Dear Gareth, Your test proves that I have to drop that issue for now and get back to playing with 32bit ones and investigating potential collisions there. Thank you very much for all the support! Best wishes,
Wojtek Plonka +48885756652 wojtekplonka.com <http://www.wojtekplonka.com> fb.com/wojtek.plonka On Thu, Apr 22, 2021 at 10:57 PM Gareth Jones <java.jo...@gmail.com> wrote: > Hi Wojtek, > > Yes. I don't want to speak for the developer(s) of the Morgan fingerprint > code, but I don't think that 64 bit support is there. > > If you add the function below to the testFingerprintGenerators.cpp then > debug, you can see that you create a 64 bit fingerprint but only end up > setting the first 32 bits through the Morgan invariant functions. This is > what happens in Python where 64 bit fingerprints are created by default. > > > void testMorgan64FP() { > > BOOST_LOG(rdErrorLog) << "-------------------------------------" << > std::endl; > BOOST_LOG(rdErrorLog) << " Test Morgan 64 Fingerprints." << std::endl; > > auto mol = SmilesToMol("Cn1cnc2n(C)c(=O)n(C)c(=O)c12"); > auto morganGenerator = > MorganFingerprint::getMorganGenerator<std::uint64_t>(3); > auto fp = morganGenerator->getSparseCountFingerprint(*mol); > fp->getNonzeroElements(); > > delete fp; > delete morganGenerator; > delete mol; > } > > > On 4/22/2021 2:06 PM, Wojtek Plonka wrote: > > Hi Gareth, > > I'm a bit lost now... > If you look into the CPP testing code > C:\RDKit\rdkit\Code\GraphMol\Fingerprints\testFingerprintGenerators.cpp > the testing function void testMorganFP() (line 615) seems to use only the > FingerprintGenerator<std::uint32_t> *morganGenerator; > as if the 64 bit version was not maintained. > > Wojtek Plonka > +48885756652 > wojtekplonka.com <http://www.wojtekplonka.com> > fb.com/wojtek.plonka > > > > On Thu, Apr 22, 2021 at 9:57 PM Gareth Jones <java.jo...@gmail.com> wrote: > >> Hi Wojtek, >> >> Our findings are the same. There is a Morgan fingerprint generator for >> 64 bits, which Python uses by default. When you call it the functions that >> actually set the bits in the 64 bit fingerprint >> (MorganFingerprints::getConnectivityInvariants and >> MorganFingerprints::getFeatureInvariants) will only ever set the first 32 >> bits. >> >> So you have a 64 bit fingerprint, but only the first 32 bits are ever set. >> On 4/22/2021 12:20 PM, Wojtek Plonka wrote: >> >> Hi Gareth, >> >> Your findings are a bit contrary to mine, so the truth must be somewhere >> in between :) >> I downloaded the RDKit sources and some support for 64 bit Morgan >> Fingerprints seems to be there: >> >> Search "getMorganGenerator<std::uint64_t>" (7 hits in 4 files of 661 >> searched) >> C:\RDKit\rdkit\Code\GraphMol\Fingerprints\catch_tests.cpp (1 hit) >> Line 152: >> MorganFingerprint::getMorganGenerator<std::uint64_t>(radius)); >> C:\RDKit\rdkit\Code\GraphMol\Fingerprints\FingerprintGenerator.cpp (4 >> hits) >> Line 461: generator = >> MorganFingerprint::getMorganGenerator<std::uint64_t>(2); >> Line 497: generator = >> MorganFingerprint::getMorganGenerator<std::uint64_t>(2); >> Line 533: generator = >> MorganFingerprint::getMorganGenerator<std::uint64_t>(2); >> Line 569: generator = >> MorganFingerprint::getMorganGenerator<std::uint64_t>(2); >> C:\RDKit\rdkit\Code\GraphMol\Fingerprints\testFingerprintGenerators.cpp >> (1 hit) >> Line 2387: MorganFingerprint::getMorganGenerator<std::uint64_t>(2), >> C:\RDKit\rdkit\Code\GraphMol\Fingerprints\Wrap\MorganWrapper.cpp (1 hit) >> Line 78: "GetMorganGenerator", getMorganGenerator<std::uint64_t>, >> >> I will have a closer look at that. >> I don't need to write my code in Python, C++ (with Google's help) is >> fine, too, as long as I can compile it with Linux tools of MSVC Community >> Edition. >> Maybe simply 64 bit stuff is not complete or not interfaced to Python yet? >> Thanks! >> >> Wojtek Plonka >> +48885756652 >> wojtekplonka.com <http://www.wojtekplonka.com> >> fb.com/wojtek.plonka >> >> >> >> On Thu, Apr 22, 2021 at 7:17 PM Gareth Jones <java.jo...@gmail.com> >> wrote: >> >>> >>> Hi Wojtek, >>> >>> From looking at the RDKit code base my take is that is is currently not >>> possible to generate 64 bit Morgan fingerprints. >>> >>> The Python fingerprint generator defaults to 64bit: >>> >>> In [36]: fp.GetLength() >>> Out[36]: 18446744073709551615 >>> >>> Unfortunately, the C++ Morgan fingerprint generator only ever sets the >>> first 32 bits even if the fingerprint is 64bit. If you look at >>> MorganFingerprints::getConnectivityInvariants and >>> MorganFingerprints::getFeatureInvariants in >>> Code/GraphMol/Fingerprints/FingerprintUtil.cpp the generated invariants >>> (that are used to set the fingerprint bits) are unsigned 32 bit ints. >>> >>> Some RDKit development would be needed to template those functions so >>> that they would work with both 32 and 64 bit fingerprints. >>> Cheers, >>> >>> Gareth >>> >>> >>> On 4/21/2021 10:10 PM, Wojtek Plonka wrote: >>> >>> Hi Gareth, >>> >>> Thank you. I do exactly as you wrote. That's not the issue. >>> Please note, that all the keys in elements are in range of 2**32 - the >>> main hash function used is definitely 32 bit >>> >>> According to >>> https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html >>> both *class *rdkit.Chem.rdFingerprintGenerator.FingerprintGenerator32 >>> and *class *rdkit.Chem.rdFingerprintGenerator.FingerprintGenerator64 >>> exist. >>> >>> However with my limited knowledge I don't know how to access the 64 bit >>> version and that is my problem. >>> Kindest regards, >>> >>> Wojtek >>> >>> Wojtek Plonka >>> +48885756652 >>> wojtekplonka.com <http://www.wojtekplonka.com> >>> fb.com/wojtek.plonka >>> >>> >>> >>> On Thu, Apr 22, 2021 at 1:27 AM Gareth Jones <java.jo...@gmail.com> >>> wrote: >>> >>>> Wojtek, >>>> >>>> You can use GetNonzeroelements() to convert the sparse fingerprint to a >>>> Python Dict of hash to count. >>>> >>>> Cheers, >>>> Gareth >>>> >>>> >>>> In [7]: mol = Chem.MolFromSmiles('Cn1cnc2n(C)c(=O)n(C)c(=O)c12') >>>> >>>> In [8]: fp = AllChem.GetMorganFingerprint(mol, 2) >>>> >>>> In [9]: elements = fp.GetNonzeroElements(); >>>> >>>> In [10]: elements >>>> Out[10]: >>>> {10565946: 2, >>>> 348155210: 1, >>>> 476388586: 1, >>>> 540046244: 1, >>>> 553412256: 1, >>>> 864942730: 2, >>>> 909857231: 1, >>>> 1100037548: 1, >>>> 1333761024: 1, >>>> 1512818157: 1, >>>> 1981181107: 1, >>>> 2030573601: 1, >>>> 2041434490: 1, >>>> 2092489639: 3, >>>> 2246728737: 3, >>>> 2370996728: 1, >>>> 2877515035: 1, >>>> 2971716993: 1, >>>> 2975126068: 2, >>>> 3140581776: 1, >>>> 3217380708: 4, >>>> 3218693969: 1, >>>> 3462333187: 1, >>>> 3657471097: 3, >>>> 3796970912: 1} >>>> >>>> In [11]: >>>> On 4/21/2021 5:44 AM, Wojtek Plonka wrote: >>>> >>>> Dear All >>>> >>>> Do any of you have a working example of getting Morgan Fingerprints, as >>>> sparse bit vector (non-hashed) in the 64 bit version using Python? >>>> I'm looking into the issue of collisions on the "main hash" on large >>>> (100+ million molecules) data >>>> Thank you very much! >>>> Kindest regards, >>>> >>>> Wojtek Plonka >>>> +48885756652 >>>> wojtekplonka.com <http://www.wojtekplonka.com> >>>> fb.com/wojtek.plonka >>>> >>>> >>>> >>>> _______________________________________________ >>>> Rdkit-discuss mailing >>>> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> >>> >>> _______________________________________________ >>> Rdkit-discuss mailing >>> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> _______________________________________________ >> Rdkit-discuss mailing >> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > _______________________________________________ > Rdkit-discuss mailing > listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss