[Rdkit-discuss] PandasTools LoadSDF: Different treatment of SMILES depending on presence of 'MOL' column?

Gustavo Seabra Mon, 23 Sep 2019 06:04:24 -0700

Hi all,


I'm trying to load a DrugBank library into a Pandas DataFrame, using two
different possibilities: creating or not a 'mol' column during load. In
principle I'm only interested in the SMILES, so creating the 'Mol' column
should not be necessary.

 

However, I noticed that the two procedures actually generate a different
number of molecules, and the SMILES are not necessarily the same: 

 

1.       Creating 'Mol' column: 2,410 molecules

2.       Not creating the 'Mol' column: 2,413 molecules

 

I assumed the difference would be due to some molecules which RDKit could
not generate the 'Mol' column for some reason and then just silently dropped
the molecules. So, I tried to find out the difference between the sets by:

 

>>>
drugbank.merge(drugbank_nomol,how='outer',on='SMILES',indicator=True).loc[
lambda x: x['_merge'] == 'right_only']

 

Which, assuming the SMILES are the same, *should* be 3, but it returns 1865
records (!) meaning the SMILES are mostly different between the sets.

 

Could someone help me figure out what is going on here?

 

To avoid attach files here, I put a test database and a Jupyter Notebook
with the example in here:

https://www.dropbox.com/s/v8kf7vzpmrjkidl/RDKit_test.zip?dl=0

 

Thanks a lot!

--

Gustavo Seabra

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] PandasTools LoadSDF: Different treatment of SMILES depending on presence of 'MOL' column?

Reply via email to