Dear RDKit community,

Happy new year!

I am looking for a way to make the circular Morgen Fingerprints more SMILES
like. The background is that with the default definition of atom invariants
in the RDKit implementation, Morgan Fingerprints do not explicitly take
into account aromaticity, and use more information from higher radii than
what would be expected when sketching the substructures indexed by the
fingerprint. This becomes an issue when drawing the substructures, or
encoding them as SMILES. Here are two examples that illustrate the points:

1.) Aromaticity:
At radius 1, the atoms in phenyl and the sp2 atoms in cyclohexene yield
exactly the same fingerprint, whereas the SMILES for those atoms is
different:

In [1]: import rdkit
In [2]: from rdkit import Chem
In [3]: from rdkit.Chem import AllChem
In [4]: phenyl = "[*:1]c1ccccc1"
In [5]: cyclohexyl = "[*:1]C1=CCCCC1"
In [6]: mol1 = Chem.MolFromSmiles(phenyl)
In [7]: mol2 = Chem.MolFromSmiles(cyclohexyl)
In [8]: fp1 = AllChem.GetMorganFingerprint(mol1, 1, fromAtoms=[0])
In [9]: fp2 = AllChem.GetMorganFingerprint(mol2, 1, fromAtoms=[0])
In [10]: fp1==fp2
Out[10]: True

Now in many cases there probably is a good reason why those atoms can be
considered identical, but then there are still other cases when aromaticity
makes a difference. For example, when encoding the substructure as a
SMILES, the two atoms are different ("c" and "C"), which can create
confusion when comparing to the fingerprint.


2.) Information from higher radii
The Morgan Fingerprint has the concept of radius. For a radius of 2, I
would naively expect that only atom environments up to 2 atoms away from
the rooted atom are taken into account. However, this is not fully true, as
shown below:

In [11]: toluene = "[*:1]c1ccccc1C"
In [12]: mol3 = Chem.MolFromSmiles(toluene)
In [13]: fp1 = AllChem.GetMorganFingerprint(mol1, 2, fromAtoms=[0])
In [14]: fp3 = AllChem.GetMorganFingerprint(mol3, 2, fromAtoms=[0])
In [15]: fp1==fp3
Out[15]: False

Toluene and Phenyl differ in the one C ortho to the star atom. This C is 3
bonds away from the star atom. Therefore, when calculating the
MorganFingerprint with radius 2 rooted on the star atom, I would expect the
two fingerprints derived from phenyl and toluene to be the same. I assume
this is not the case because the connectivity makes a difference between a
bond to a heavy atom and to a hydrogen.


It would be very helpful to get suggestions or even code snippets for how
to change the default behaviour of the Morgan Fingerprinter such that the
representation is closer to what one draws or encodes in SMILES for the
atoms in a given radius. The documentation says that atom invariants can be
defined, which I hope help here. If someone did this before, it would be
cool if you could share how to do it exactly.

Thanks a lot,
Christian


*Dr. Christian Kramer*

Computer-Aided Drug Design (CADD)


F. Hoffmann-La Roche Ltd

Pharma Research and Early Development
Bldg. 092/4.92

CH-4070 Basel


Phone +41 61 682 2471

mailto: christian.kra...@roche.com


*Confidentiality Note: *This message is intended only for the use of the
named recipient(s) and may contain confidential and/or proprietary
information. If you are not the intended recipient, please contact the
sender and delete this message. Any unauthorized use of the information
contained in this message is prohibited.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to