Hi Christian,

The topic of how to specify atom invariants came up recently on the list
here:
https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09400.html

Here's a gist that shows how to specify your own atom invariants based
solely upon atomic number and, optionally, aromaticity:
https://gist.github.com/greglandrum/d31ae7618cc5b7322a7121a529bf8190
The key function is here:

def get_simple_morgan(m,radius,includeAromaticity=False,**kwargs):
    if not includeAromaticity:
        invars = [x.GetAtomicNum() for x in m.GetAtoms()]
    else:
        invars = [x.GetAtomicNum()|(1000+x.GetIsAromatic()) for x in
m.GetAtoms()]
    return
rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs)



The gist also shows how to use the SMARTS for each atom as its atom
invariant:

import hashlib
def get_smiles_morgan(m,radius,**kwargs):
    smis = [Chem.Atom.GetSmarts(x) for x in m.GetAtoms()]
    invars = []
    for x in m.GetAtoms():
        # there's almost certainly a more performant way to do this, but....
        h = hashlib.md5()
        h.update(x.GetSmarts().encode())
        invars.append(int.from_bytes(h.digest()[:4],'little'))
    return
rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs)


 Note that this is sensitive to things like atom map numbers (as shown in
the gist).

I am compelled to point out that, at least based on the way you phrase the
question you are asking for two mutually contradictory things here:
The first question asks about including information about aromaticity,
which is determined by the properties of an entire ring system and is thus
definitely *not* local. The second question wants things to be super local
and not affected by atoms that aren't included in the radius being
considered.

-greg




On Thu, Jan 9, 2020 at 11:06 AM Kramer, Christian via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> Dear RDKit community,
>
> Happy new year!
>
> I am looking for a way to make the circular Morgen Fingerprints more
> SMILES like. The background is that with the default definition of atom
> invariants in the RDKit implementation, Morgan Fingerprints do not
> explicitly take into account aromaticity, and use more information from
> higher radii than what would be expected when sketching the substructures
> indexed by the fingerprint. This becomes an issue when drawing the
> substructures, or encoding them as SMILES. Here are two examples that
> illustrate the points:
>
> 1.) Aromaticity:
> At radius 1, the atoms in phenyl and the sp2 atoms in cyclohexene yield
> exactly the same fingerprint, whereas the SMILES for those atoms is
> different:
>
> In [1]: import rdkit
> In [2]: from rdkit import Chem
> In [3]: from rdkit.Chem import AllChem
> In [4]: phenyl = "[*:1]c1ccccc1"
> In [5]: cyclohexyl = "[*:1]C1=CCCCC1"
> In [6]: mol1 = Chem.MolFromSmiles(phenyl)
> In [7]: mol2 = Chem.MolFromSmiles(cyclohexyl)
> In [8]: fp1 = AllChem.GetMorganFingerprint(mol1, 1, fromAtoms=[0])
> In [9]: fp2 = AllChem.GetMorganFingerprint(mol2, 1, fromAtoms=[0])
> In [10]: fp1==fp2
> Out[10]: True
>
> Now in many cases there probably is a good reason why those atoms can be
> considered identical, but then there are still other cases when aromaticity
> makes a difference. For example, when encoding the substructure as a
> SMILES, the two atoms are different ("c" and "C"), which can create
> confusion when comparing to the fingerprint.
>
>
> 2.) Information from higher radii
> The Morgan Fingerprint has the concept of radius. For a radius of 2, I
> would naively expect that only atom environments up to 2 atoms away from
> the rooted atom are taken into account. However, this is not fully true, as
> shown below:
>
> In [11]: toluene = "[*:1]c1ccccc1C"
> In [12]: mol3 = Chem.MolFromSmiles(toluene)
> In [13]: fp1 = AllChem.GetMorganFingerprint(mol1, 2, fromAtoms=[0])
> In [14]: fp3 = AllChem.GetMorganFingerprint(mol3, 2, fromAtoms=[0])
> In [15]: fp1==fp3
> Out[15]: False
>
> Toluene and Phenyl differ in the one C ortho to the star atom. This C is 3
> bonds away from the star atom. Therefore, when calculating the
> MorganFingerprint with radius 2 rooted on the star atom, I would expect the
> two fingerprints derived from phenyl and toluene to be the same. I assume
> this is not the case because the connectivity makes a difference between a
> bond to a heavy atom and to a hydrogen.
>
>
> It would be very helpful to get suggestions or even code snippets for how
> to change the default behaviour of the Morgan Fingerprinter such that the
> representation is closer to what one draws or encodes in SMILES for the
> atoms in a given radius. The documentation says that atom invariants can be
> defined, which I hope help here. If someone did this before, it would be
> cool if you could share how to do it exactly.
>
> Thanks a lot,
> Christian
>
>
> *Dr. Christian Kramer*
>
> Computer-Aided Drug Design (CADD)
>
>
> F. Hoffmann-La Roche Ltd
>
> Pharma Research and Early Development
> Bldg. 092/4.92
>
> CH-4070 Basel
>
>
> Phone +41 61 682 2471
>
> mailto: christian.kra...@roche.com
>
>
> *Confidentiality Note: *This message is intended only for the use of the
> named recipient(s) and may contain confidential and/or proprietary
> information. If you are not the intended recipient, please contact the
> sender and delete this message. Any unauthorized use of the information
> contained in this message is prohibited.
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to