Hi Christian, The topic of how to specify atom invariants came up recently on the list here: https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09400.html
Here's a gist that shows how to specify your own atom invariants based solely upon atomic number and, optionally, aromaticity: https://gist.github.com/greglandrum/d31ae7618cc5b7322a7121a529bf8190 The key function is here: def get_simple_morgan(m,radius,includeAromaticity=False,**kwargs): if not includeAromaticity: invars = [x.GetAtomicNum() for x in m.GetAtoms()] else: invars = [x.GetAtomicNum()|(1000+x.GetIsAromatic()) for x in m.GetAtoms()] return rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs) The gist also shows how to use the SMARTS for each atom as its atom invariant: import hashlib def get_smiles_morgan(m,radius,**kwargs): smis = [Chem.Atom.GetSmarts(x) for x in m.GetAtoms()] invars = [] for x in m.GetAtoms(): # there's almost certainly a more performant way to do this, but.... h = hashlib.md5() h.update(x.GetSmarts().encode()) invars.append(int.from_bytes(h.digest()[:4],'little')) return rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs) Note that this is sensitive to things like atom map numbers (as shown in the gist). I am compelled to point out that, at least based on the way you phrase the question you are asking for two mutually contradictory things here: The first question asks about including information about aromaticity, which is determined by the properties of an entire ring system and is thus definitely *not* local. The second question wants things to be super local and not affected by atoms that aren't included in the radius being considered. -greg On Thu, Jan 9, 2020 at 11:06 AM Kramer, Christian via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > Dear RDKit community, > > Happy new year! > > I am looking for a way to make the circular Morgen Fingerprints more > SMILES like. The background is that with the default definition of atom > invariants in the RDKit implementation, Morgan Fingerprints do not > explicitly take into account aromaticity, and use more information from > higher radii than what would be expected when sketching the substructures > indexed by the fingerprint. This becomes an issue when drawing the > substructures, or encoding them as SMILES. Here are two examples that > illustrate the points: > > 1.) Aromaticity: > At radius 1, the atoms in phenyl and the sp2 atoms in cyclohexene yield > exactly the same fingerprint, whereas the SMILES for those atoms is > different: > > In [1]: import rdkit > In [2]: from rdkit import Chem > In [3]: from rdkit.Chem import AllChem > In [4]: phenyl = "[*:1]c1ccccc1" > In [5]: cyclohexyl = "[*:1]C1=CCCCC1" > In [6]: mol1 = Chem.MolFromSmiles(phenyl) > In [7]: mol2 = Chem.MolFromSmiles(cyclohexyl) > In [8]: fp1 = AllChem.GetMorganFingerprint(mol1, 1, fromAtoms=[0]) > In [9]: fp2 = AllChem.GetMorganFingerprint(mol2, 1, fromAtoms=[0]) > In [10]: fp1==fp2 > Out[10]: True > > Now in many cases there probably is a good reason why those atoms can be > considered identical, but then there are still other cases when aromaticity > makes a difference. For example, when encoding the substructure as a > SMILES, the two atoms are different ("c" and "C"), which can create > confusion when comparing to the fingerprint. > > > 2.) Information from higher radii > The Morgan Fingerprint has the concept of radius. For a radius of 2, I > would naively expect that only atom environments up to 2 atoms away from > the rooted atom are taken into account. However, this is not fully true, as > shown below: > > In [11]: toluene = "[*:1]c1ccccc1C" > In [12]: mol3 = Chem.MolFromSmiles(toluene) > In [13]: fp1 = AllChem.GetMorganFingerprint(mol1, 2, fromAtoms=[0]) > In [14]: fp3 = AllChem.GetMorganFingerprint(mol3, 2, fromAtoms=[0]) > In [15]: fp1==fp3 > Out[15]: False > > Toluene and Phenyl differ in the one C ortho to the star atom. This C is 3 > bonds away from the star atom. Therefore, when calculating the > MorganFingerprint with radius 2 rooted on the star atom, I would expect the > two fingerprints derived from phenyl and toluene to be the same. I assume > this is not the case because the connectivity makes a difference between a > bond to a heavy atom and to a hydrogen. > > > It would be very helpful to get suggestions or even code snippets for how > to change the default behaviour of the Morgan Fingerprinter such that the > representation is closer to what one draws or encodes in SMILES for the > atoms in a given radius. The documentation says that atom invariants can be > defined, which I hope help here. If someone did this before, it would be > cool if you could share how to do it exactly. > > Thanks a lot, > Christian > > > *Dr. Christian Kramer* > > Computer-Aided Drug Design (CADD) > > > F. Hoffmann-La Roche Ltd > > Pharma Research and Early Development > Bldg. 092/4.92 > > CH-4070 Basel > > > Phone +41 61 682 2471 > > mailto: christian.kra...@roche.com > > > *Confidentiality Note: *This message is intended only for the use of the > named recipient(s) and may contain confidential and/or proprietary > information. If you are not the intended recipient, please contact the > sender and delete this message. Any unauthorized use of the information > contained in this message is prohibited. > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss