Hi Wojtek,

From looking at the RDKit code base my take is that is is currently not possible to generate 64 bit Morgan fingerprints.

The Python fingerprint generator defaults to 64bit:

In [36]: fp.GetLength()
Out[36]: 18446744073709551615

Unfortunately, the C++ Morgan fingerprint generator only ever sets the first 32 bits even if the fingerprint is 64bit.  If you look at MorganFingerprints::getConnectivityInvariants and MorganFingerprints::getFeatureInvariants in Code/GraphMol/Fingerprints/FingerprintUtil.cpp the generated invariants (that are used to set the fingerprint bits) are unsigned 32 bit ints.

Some RDKit development would be needed to template those functions so that they would work with both 32 and 64 bit fingerprints.

Cheers,

Gareth


On 4/21/2021 10:10 PM, Wojtek Plonka wrote:
Hi Gareth,

Thank you. I do exactly as you wrote. That's not the issue.
Please note, that all the keys in elements are in range of 2**32 - the main hash function used is definitely 32 bit

According to https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html <https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html>
both /class /|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator32|
and /class /|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator64|
exist.

However with my limited knowledge I don't know how to access the 64 bit version and that is my problem.
Kindest regards,

Wojtek

Wojtek Plonka
+48885756652
wojtekplonka.com <http://www.wojtekplonka.com>
fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>



On Thu, Apr 22, 2021 at 1:27 AM Gareth Jones <java.jo...@gmail.com <mailto:java.jo...@gmail.com>> wrote:

    Wojtek,

    You can use GetNonzeroelements() to convert the sparse fingerprint
    to a Python Dict of hash to count.

    Cheers,
    Gareth


    In [7]: mol = Chem.MolFromSmiles('Cn1cnc2n(C)c(=O)n(C)c(=O)c12')

    In [8]: fp = AllChem.GetMorganFingerprint(mol, 2)

    In [9]: elements = fp.GetNonzeroElements();

    In [10]: elements
    Out[10]:
    {10565946: 2,
     348155210: 1,
     476388586: 1,
     540046244: 1,
     553412256: 1,
     864942730: 2,
     909857231: 1,
     1100037548: 1,
     1333761024: 1,
     1512818157: 1,
     1981181107: 1,
     2030573601: 1,
     2041434490: 1,
     2092489639: 3,
     2246728737: 3,
     2370996728: 1,
     2877515035: 1,
     2971716993: 1,
     2975126068: 2,
     3140581776: 1,
     3217380708: 4,
     3218693969: 1,
     3462333187: 1,
     3657471097: 3,
     3796970912: 1}

    In [11]:

    On 4/21/2021 5:44 AM, Wojtek Plonka wrote:
    Dear All

    Do any of you have a working example of getting Morgan
    Fingerprints, as sparse bit vector (non-hashed) in the 64 bit
    version using Python?
    I'm looking into the issue of collisions on the "main hash" on
    large (100+ million molecules) data
    Thank you very much!
    Kindest regards,

    Wojtek Plonka
    +48885756652
    wojtekplonka.com <http://www.wojtekplonka.com>
    fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>



    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net  
<mailto:Rdkit-discuss@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss  
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    <mailto:Rdkit-discuss@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
    <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>



_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to