Hi Wojtek,
Our findings are the same. There is a Morgan fingerprint generator for
64 bits, which Python uses by default. When you call it the functions
that actually set the bits in the 64 bit fingerprint
(MorganFingerprints::getConnectivityInvariants and
MorganFingerprints::getFeatureInvariants) will only ever set the first
32 bits.
So you have a 64 bit fingerprint, but only the first 32 bits are ever set.
On 4/22/2021 12:20 PM, Wojtek Plonka wrote:
Hi Gareth,
Your findings are a bit contrary to mine, so the truth must be
somewhere in between :)
I downloaded the RDKit sources and some support for 64 bit Morgan
Fingerprints seems to be there:
Search "getMorganGenerator<std::uint64_t>" (7 hits in 4 files of 661
searched)
C:\RDKit\rdkit\Code\GraphMol\Fingerprints\catch_tests.cpp (1 hit)
Line 152: MorganFingerprint::getMorganGenerator<std::uint64_t>(radius));
C:\RDKit\rdkit\Code\GraphMol\Fingerprints\FingerprintGenerator.cpp (4
hits)
Line 461: generator =
MorganFingerprint::getMorganGenerator<std::uint64_t>(2);
Line 497: generator =
MorganFingerprint::getMorganGenerator<std::uint64_t>(2);
Line 533: generator =
MorganFingerprint::getMorganGenerator<std::uint64_t>(2);
Line 569: generator =
MorganFingerprint::getMorganGenerator<std::uint64_t>(2);
C:\RDKit\rdkit\Code\GraphMol\Fingerprints\testFingerprintGenerators.cpp
(1 hit)
Line 2387: MorganFingerprint::getMorganGenerator<std::uint64_t>(2),
C:\RDKit\rdkit\Code\GraphMol\Fingerprints\Wrap\MorganWrapper.cpp (1 hit)
Line 78: "GetMorganGenerator", getMorganGenerator<std::uint64_t>,
I will have a closer look at that.
I don't need to write my code in Python, C++ (with Google's help) is
fine, too, as long as I can compile it with Linux tools of MSVC
Community Edition.
Maybe simply 64 bit stuff is not complete or not interfaced to Python yet?
Thanks!
Wojtek Plonka
+48885756652
wojtekplonka.com <http://www.wojtekplonka.com>
fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>
On Thu, Apr 22, 2021 at 7:17 PM Gareth Jones <[email protected]
<mailto:[email protected]>> wrote:
Hi Wojtek,
From looking at the RDKit code base my take is that is is
currently not possible to generate 64 bit Morgan fingerprints.
The Python fingerprint generator defaults to 64bit:
In [36]: fp.GetLength()
Out[36]: 18446744073709551615
Unfortunately, the C++ Morgan fingerprint generator only ever sets
the first 32 bits even if the fingerprint is 64bit. If you look
at MorganFingerprints::getConnectivityInvariants and
MorganFingerprints::getFeatureInvariants in
Code/GraphMol/Fingerprints/FingerprintUtil.cpp the generated
invariants (that are used to set the fingerprint bits) are
unsigned 32 bit ints.
Some RDKit development would be needed to template those functions
so that they would work with both 32 and 64 bit fingerprints.
Cheers,
Gareth
On 4/21/2021 10:10 PM, Wojtek Plonka wrote:
Hi Gareth,
Thank you. I do exactly as you wrote. That's not the issue.
Please note, that all the keys in elements are in range of 2**32
- the main hash function used is definitely 32 bit
According to
https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html
<https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html>
both /class
/|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator32|
and /class
/|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator64|
exist.
However with my limited knowledge I don't know how to access the
64 bit version and that is my problem.
Kindest regards,
Wojtek
Wojtek Plonka
+48885756652
wojtekplonka.com <http://www.wojtekplonka.com>
fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>
On Thu, Apr 22, 2021 at 1:27 AM Gareth Jones
<[email protected] <mailto:[email protected]>> wrote:
Wojtek,
You can use GetNonzeroelements() to convert the sparse
fingerprint to a Python Dict of hash to count.
Cheers,
Gareth
In [7]: mol = Chem.MolFromSmiles('Cn1cnc2n(C)c(=O)n(C)c(=O)c12')
In [8]: fp = AllChem.GetMorganFingerprint(mol, 2)
In [9]: elements = fp.GetNonzeroElements();
In [10]: elements
Out[10]:
{10565946: 2,
348155210: 1,
476388586: 1,
540046244: 1,
553412256: 1,
864942730: 2,
909857231: 1,
1100037548: 1,
1333761024: 1,
1512818157: 1,
1981181107: 1,
2030573601: 1,
2041434490: 1,
2092489639: 3,
2246728737: 3,
2370996728: 1,
2877515035: 1,
2971716993: 1,
2975126068: 2,
3140581776: 1,
3217380708: 4,
3218693969: 1,
3462333187: 1,
3657471097: 3,
3796970912: 1}
In [11]:
On 4/21/2021 5:44 AM, Wojtek Plonka wrote:
Dear All
Do any of you have a working example of getting Morgan
Fingerprints, as sparse bit vector (non-hashed) in the 64
bit version using Python?
I'm looking into the issue of collisions on the "main hash"
on large (100+ million molecules) data
Thank you very much!
Kindest regards,
Wojtek Plonka
+48885756652
wojtekplonka.com <http://www.wojtekplonka.com>
fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss