Hi all,

 

I'm trying to load a DrugBank library into a Pandas DataFrame, using two
different possibilities: creating or not a 'mol' column during load. In
principle I'm only interested in the SMILES, so creating the 'Mol' column
should not be necessary.

 

However, I noticed that the two procedures actually generate a different
number of molecules, and the SMILES are not necessarily the same: 

 

1.       Creating 'Mol' column: 2,410 molecules

2.       Not creating the 'Mol' column: 2,413 molecules

 

I assumed the difference would be due to some molecules which RDKit could
not generate the 'Mol' column for some reason and then just silently dropped
the molecules. So, I tried to find out the difference between the sets by:

 

>>>
drugbank.merge(drugbank_nomol,how='outer',on='SMILES',indicator=True).loc[
lambda x: x['_merge'] == 'right_only']

 

Which, assuming the SMILES are the same, *should* be 3, but it returns 1865
records (!) meaning the SMILES are mostly different between the sets.

 

Could someone help me figure out what is going on here?

 

To avoid attach files here, I put a test database and a Jupyter Notebook
with the example in here:

https://www.dropbox.com/s/v8kf7vzpmrjkidl/RDKit_test.zip?dl=0

 

Thanks a lot!

--

Gustavo Seabra

 

 

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to