Dear RDKit community, Happy new year!
I am looking for a way to make the circular Morgen Fingerprints more SMILES like. The background is that with the default definition of atom invariants in the RDKit implementation, Morgan Fingerprints do not explicitly take into account aromaticity, and use more information from higher radii than what would be expected when sketching the substructures indexed by the fingerprint. This becomes an issue when drawing the substructures, or encoding them as SMILES. Here are two examples that illustrate the points: 1.) Aromaticity: At radius 1, the atoms in phenyl and the sp2 atoms in cyclohexene yield exactly the same fingerprint, whereas the SMILES for those atoms is different: In [1]: import rdkit In [2]: from rdkit import Chem In [3]: from rdkit.Chem import AllChem In [4]: phenyl = "[*:1]c1ccccc1" In [5]: cyclohexyl = "[*:1]C1=CCCCC1" In [6]: mol1 = Chem.MolFromSmiles(phenyl) In [7]: mol2 = Chem.MolFromSmiles(cyclohexyl) In [8]: fp1 = AllChem.GetMorganFingerprint(mol1, 1, fromAtoms=[0]) In [9]: fp2 = AllChem.GetMorganFingerprint(mol2, 1, fromAtoms=[0]) In [10]: fp1==fp2 Out[10]: True Now in many cases there probably is a good reason why those atoms can be considered identical, but then there are still other cases when aromaticity makes a difference. For example, when encoding the substructure as a SMILES, the two atoms are different ("c" and "C"), which can create confusion when comparing to the fingerprint. 2.) Information from higher radii The Morgan Fingerprint has the concept of radius. For a radius of 2, I would naively expect that only atom environments up to 2 atoms away from the rooted atom are taken into account. However, this is not fully true, as shown below: In [11]: toluene = "[*:1]c1ccccc1C" In [12]: mol3 = Chem.MolFromSmiles(toluene) In [13]: fp1 = AllChem.GetMorganFingerprint(mol1, 2, fromAtoms=[0]) In [14]: fp3 = AllChem.GetMorganFingerprint(mol3, 2, fromAtoms=[0]) In [15]: fp1==fp3 Out[15]: False Toluene and Phenyl differ in the one C ortho to the star atom. This C is 3 bonds away from the star atom. Therefore, when calculating the MorganFingerprint with radius 2 rooted on the star atom, I would expect the two fingerprints derived from phenyl and toluene to be the same. I assume this is not the case because the connectivity makes a difference between a bond to a heavy atom and to a hydrogen. It would be very helpful to get suggestions or even code snippets for how to change the default behaviour of the Morgan Fingerprinter such that the representation is closer to what one draws or encodes in SMILES for the atoms in a given radius. The documentation says that atom invariants can be defined, which I hope help here. If someone did this before, it would be cool if you could share how to do it exactly. Thanks a lot, Christian *Dr. Christian Kramer* Computer-Aided Drug Design (CADD) F. Hoffmann-La Roche Ltd Pharma Research and Early Development Bldg. 092/4.92 CH-4070 Basel Phone +41 61 682 2471 mailto: christian.kra...@roche.com *Confidentiality Note: *This message is intended only for the use of the named recipient(s) and may contain confidential and/or proprietary information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorized use of the information contained in this message is prohibited.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss