Hi all,
I'm trying to load a DrugBank library into a Pandas DataFrame, using two different possibilities: creating or not a 'mol' column during load. In principle I'm only interested in the SMILES, so creating the 'Mol' column should not be necessary. However, I noticed that the two procedures actually generate a different number of molecules, and the SMILES are not necessarily the same: 1. Creating 'Mol' column: 2,410 molecules 2. Not creating the 'Mol' column: 2,413 molecules I assumed the difference would be due to some molecules which RDKit could not generate the 'Mol' column for some reason and then just silently dropped the molecules. So, I tried to find out the difference between the sets by: >>> drugbank.merge(drugbank_nomol,how='outer',on='SMILES',indicator=True).loc[ lambda x: x['_merge'] == 'right_only'] Which, assuming the SMILES are the same, *should* be 3, but it returns 1865 records (!) meaning the SMILES are mostly different between the sets. Could someone help me figure out what is going on here? To avoid attach files here, I put a test database and a Jupyter Notebook with the example in here: https://www.dropbox.com/s/v8kf7vzpmrjkidl/RDKit_test.zip?dl=0 Thanks a lot! -- Gustavo Seabra
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss