I'm managed to get soundly confused by the RDKit fingerprint code, but have
since gotten things figured out. My confusion is mostly because the RDKit
fingerprint documentation is incomplete and misleading, and because the data I
want isn't directly accessible.
This may help others who want to work with fingerprints.
I'll start by making a fingerprint.
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("c1ccccc1O")
>>> fp = Chem.RDKFingerprint(mol)
>>> fp.ToBitString()
'000000000001000001000000000000000000000000000000000000001000010000000001
0000000000000000000000000000000000000000000000000100000000000000000000000
...
0000000000000000000000000100000000000000000000001000000000101000000000000
0000000010000000000000000101000000000000000000000001000000010000000000000
0000010000000000000001000000000000000000000000000'
>>> fp.GetNumOnBits()
76
>>> fp.ToBitString().count("1")
76
The bits are in little-endian, so that first '1' is in bit position 11 (the
12th bit).
What I want is some way to get the dense fingerprint as a series of bytes.
I tried
>>> fp.ToBinary().encode("hex")
'e0ffffff000800004c000000160a4c08126240301a08226a4e6e4012521e2444463c006a281e003a8c00246c54302c024c108c56422614044c6e360a3a0e5e0a460c046628765a142d000e044c10502c12022820022e0e241e36'
>>> len(fp.ToBinary() * 8)
720
>>> len(fp.ToBitString())
2048
You can clearly see that this is not a simple binary version of the
fingerprint. It's the wrong length and is entirely too dense with bits.
The Python documentation only says "Returns a binary string representation of
the vector". That method name doesn't exist in the C++ documentation. It's part
of the Python wrapper (I think) which is forwarding the request to
ExplicitBitVect::ToString().
Why does the C++ code have "ToString()" while Python has "ToBinary()" ?
For those curious, the encoding is a version string, followed by the
fingerprint size, followed by the number of on bits, followed by run-length
encoded "on" bits.
>>> import struct
>>> struct.unpack("I", fp.ToBinary()[4:8])
(2048,)
>>> struct.unpack("I", fp.ToBinary()[8:12])
(76,)
>>> s = fp.ToBinary()
>>> ord(s[12])/2
11
>>> ord(s[13])/2
5
>>> ord(s[14])/2
38
>>> ord(s[15])/2
4
The details of the run-length encoding are in Code/RDGeneral/StreamOps.h .
That's no fast way to unpack that string in Python. I would rather work with
the bit string, or with the bit list I can get directly via
>>> list(fp.GetOnBits())
[11, 17, 56, 61, 71, 121, 154, ... 1977, 1985, 2004, 2020]
That's okay. There are other options to explore trying to get the underlying
raw data. The documentation says:
ToBase64( (ExplicitBitVect)arg1) -> str :
Converts the vector to a base64 string (the Daylight encoding).
I think the documentation here is wrong. Daylight does have a 6-bit encoding
for binary values, but it's not base64. It's something of their own making. I
have an implementation of it if anyone really needs it.
In any case, it doesn't describe what data is encoded. I hoped it was the raw
data. A quick test shows it's the base64 encoding of the ToBinary() string:
>>> fp.ToBinary()
'\xe0\xff\xff\xff\x00\x08\x00\x00l\x00\x00\x00\x16\nl\x08\x...@0\x1a\x08"j...@\x12r\x1e$df<\x00j(\x1e\x00:\x8c\x00$lT0,\x02L\x10\x8cVB&\x14\x04Ln6\n:\x0e^\nF\x0c\x04f(vZ\x14-\x00\x0e\x04L\x10P,\x12\x02(
\x02.\x0e$\x1e6'
>>> fp.ToBase64()
'4P///wAIAABMAAAAFgpMCBJiQDAaCCJqTm5AElIeJERGPABqKB4AOowAJGxUMCwCTBCMVkImFARMbjYKOg5eCkYMBGYodloULQAOBEwQUCwSAiggAi4OJB42'
>>> fp.ToBase64().decode("base64")
'\xe0\xff\xff\xff\x00\x08\x00\x00l\x00\x00\x00\x16\nl\x08\x...@0\x1a\x08"j...@\x12r\x1e$df<\x00j(\x1e\x00:\x8c\x00$lT0,\x02L\x10\x8cVB&\x14\x04Ln6\n:\x0e^\nF\x0c\x04f(vZ\x14-\x00\x0e\x04L\x10P,\x12\x02(
\x02.\x0e$\x1e6'
>>>
This means it doesn't seem possible to get the data I want.
I do see that the data is stored in a boost::dynamic_bitset<>, which is very
different than how OpenBabel and OEChem do it. I don't need access to the raw
data structure. A "GetBytes()" would be fine. I ended up writing a Python
function to do that. I also have a very specific requirement for the bit and
byte order, so I'm not pushing for a change in RDKit ... yet. ;)
Cheers!
Andrew
[email protected]
------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss