Hi Andrew,

Thank you! This is so thorough, and so helpful. We truly appreciate it.

All the best,

-Cyrus

On 10/27/20, 4:28 AM, "Andrew Dalke" <da...@dalkescientific.com> wrote:

    ** EXTERNAL EMAIL **


    On Oct 26, 2020, at 17:41, Cyrus Maher <cma...@vir.bio> wrote:
    > I’m wondering if there is an easy way to retrieve the atom numbers that 
the morgan fingerprints algorithm assigns as its first step.

    Many of the fingerprint function support an optional "bitInfo" parameter. 
If it's a dictionary then the keys are the bit that was set, and the value is 
at tuple of the (atom index, radius) which set it.

    Here's an example with theobromine using r=0, which lets you see the 
initial invariants:

    >>> from rdkit import Chem
    >>> from rdkit.Chem import rdMolDescriptors
    >>> mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C")
    >>> bitInfo = {}
    >>> fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=0, 
useFeatures=1, bitInfo=bitInfo)
    >>> for bitno, pairs in sorted(bitInfo.items()):
    ...   print(f"Bitno: {bitno}")
    ...   for atom_idx, r in pairs:
    ...     print(f"  atom {atom_idx} 
({mol.GetAtomWithIdx(atom_idx).GetSymbol()}) with radius {r}")
    ...
    Bitno: 0
      atom 0 (C) with radius 0
      atom 12 (C) with radius 0
    Bitno: 2
      atom 7 (O) with radius 0
      atom 10 (O) with radius 0
    Bitno: 4
      atom 2 (C) with radius 0
      atom 4 (C) with radius 0
      atom 5 (C) with radius 0
      atom 6 (C) with radius 0
      atom 9 (C) with radius 0
    Bitno: 5
      atom 8 (N) with radius 0
    Bitno: 6
      atom 1 (N) with radius 0
      atom 3 (N) with radius 0
      atom 11 (N) with radius 0

    If I follow the code correctly, when useFeatures == 1 then the intial 
invariants are set by getFeatureInvariants() in 
./Code/GraphMol/Fingerprints/FingerprintUtil.cpp , available at:

    
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rdkit_rdkit_blob_d594998dda2803a15aa0066e06f2477b71ed3be6_Code_GraphMol_Fingerprints_FingerprintUtil.cpp-23L221&d=DwIFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=b_UdO5RJBZB-KGEyd1F-0g&m=3NVDgW4zQTXqvGgBgN1WpWt1PuJsSWJpydkDuamTkGY&s=tfPrxiPHW2FK-NXmObtRK0ri4Z456d1IlSiKx1tIB9s&e=

    A few lines up, at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rdkit_rdkit_blob_d594998dda2803a15aa0066e06f2477b71ed3be6_Code_GraphMol_Fingerprints_FingerprintUtil.cpp-23L182&d=DwIFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=b_UdO5RJBZB-KGEyd1F-0g&m=3NVDgW4zQTXqvGgBgN1WpWt1PuJsSWJpydkDuamTkGY&s=rd8o6LjWxXd6iezueStsEXFPmvKD2IoPWRz_vCOnPNI&e=
 , you'll see the features patterns defined in smartsPatterns

    They appear to be identical to the list you gave.

    I reimplemented the initialization function (copied at the end of this 
email). Running the program shows that it produces the same invariants which 
are used as the bit numbers in the Morgan feature fingerprint:

    Invariant: 0
      atom 0 (C)
      atom 12 (C)
    Invariant: 2
      atom 7 (O)
      atom 10 (O)
    Invariant: 4
      atom 2 (C)
      atom 4 (C)
      atom 5 (C)
      atom 6 (C)
      atom 9 (C)
    Invariant: 5
      atom 8 (N)
    Invariant: 6
      atom 1 (N)
      atom 3 (N)
      atom 11 (N)


    I believe that gives you two ways to get the information you want!

    Best regards,

                                    Andrew
                                    da...@dalkescientific.com




    # Python re-implementation of RDKit's getFeatureInvariants() from
    # ./Code/GraphMol/Fingerprints/FingerprintUtil.cpp

    from rdkit import Chem

    smartsPatterns = [
        "[$([N;!H0;v3,v4&+1]),\
    $([O,S;H1;+0]),\
    n&H1&+0]",                                                  # Donor
        "[$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),\
    $([O,S;H0;v2]),\
    $([O,S;-]),\
    $([N;v3;!$(N-*=[O,N,P,S])]),\
    n&H0&+0,\
    $([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]",                    # Acceptor
        "[a]",                                                  # Aromatic
        "[F,Cl,Br,I]",                                          # Halogen
        "[#7;+,\
    $([N;H2&+0][$([C,a]);!$([C,a](=O))]),\
    $([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),\
    $([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]",  # Basic
        "[$([C,S](=[O,S,P])-[O;H1,-1])]"                        # Acidic
        ]

    mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C")
    invariants = [0] * mol.GetNumAtoms()
    for pattern_idx, smartsPattern in enumerate(smartsPatterns):
        pat = Chem.MolFromSmarts(smartsPattern)
        for (atom_idx,) in mol.GetSubstructMatches(pat):
            invariants[atom_idx] |= (1<<pattern_idx)

    # Show all atoms with the same invariant, ordered by invariant
    from collections import defaultdict
    by_invariant = defaultdict(list)
    for atom_idx, invariant in enumerate(invariants):
        by_invariant[invariant].append(atom_idx)
    for invariant, atom_indices in sorted(by_invariant.items()):
        print(f"Invariant: {invariant}")
        for atom_idx in atom_indices:
            print(f"  atom {atom_idx} 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__-7Bmol.Ge&d=DwIFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=b_UdO5RJBZB-KGEyd1F-0g&m=3NVDgW4zQTXqvGgBgN1WpWt1PuJsSWJpydkDuamTkGY&s=9p0TLX9AIODjZpvzQA_s-Q3jrUGE6l9exmJiFVhU4ro&e=tAtomWithIdx(atom_idx).GetSymbol()})")






_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to