Hello, Firstly I'm a statistician with next to no knowledge of chemistry. But I want to test a new approach for generating Tanimoto similarities based on an alternative type of fingerprint. So I want to use rdkit to generate feature sets from molecules, and use those to construct fingerprints. I have tried to identify how to do this by reading the docs and looking at the code, but haven't been able to find the relevant code (I'm guessing it's in C++ somewhere, and I'm only really fluent in Python). The nearest I've been able to get is,
>>> import rdkit >>> from rdkit import Chem >>> m = Chem.MolFromSmiles('Cc1ccccc1') >>> len(Chem.RDKFingerprint(m, nBitsPerHash=1, fpSize=1024).GetOnBits()) 18 >>> len(Chem.RDKFingerprint(m, nBitsPerHash=1, fpSize=2048).GetOnBits()) 18 >>> len(Chem.RDKFingerprint(m, nBitsPerHash=1, fpSize=4096).GetOnBits()) 19 >>> len(Chem.RDKFingerprint(m, nBitsPerHash=1, fpSize=16384).GetOnBits()) 19 >>> So it appears that there is probably 19 features, and I could take the set bit positions and use them to construct fingerprints. But I'd rather cut out the uncertainty over hash collisions. I want to compare my approach with that of Kristensen et al. (2010) who used 2 million commercially available molecules from the ZINC database (version 8). They used the CDK fingerprint generator, but don't provide further details. So, is there some way I can generate features directly? (This will also allow me to calculate the true Tanimoto scores to compare with the estimates generated by fingerprints.) Any help regarding suitable test data and feature sets would also be appreciated. My download attempts for ZINC data keep failing after a few hundred KB and I don't want to use CDK if not needed. Thanks (in advance). Duncan Smith Kristensen et al. (2010) A tree-based method for the rapid screening of chemical fingerprints. Algorithms for Molecular Biology 5:9 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss