Hi Andrew,
On Sun, Jan 24, 2010 at 4:26 AM, Andrew Dalke <[email protected]> wrote:
> I'm managed to get soundly confused by the RDKit fingerprint code, but have
> since gotten things figured out. My confusion is mostly because the RDKit
> fingerprint documentation is incomplete and misleading, and because the data
> I want isn't directly accessible.
>
> This may help others who want to work with fingerprints.
>
>
> I'll start by making a fingerprint.
>
>>>> from rdkit import Chem
>>>> mol = Chem.MolFromSmiles("c1ccccc1O")
>>>> fp = Chem.RDKFingerprint(mol)
>>>> fp.ToBitString()
> '000000000001000001000000000000000000000000000000000000001000010000000001
> 0000000000000000000000000000000000000000000000000100000000000000000000000
> ...
> 0000000000000000000000000100000000000000000000001000000000101000000000000
> 0000000010000000000000000101000000000000000000000001000000010000000000000
> 0000010000000000000001000000000000000000000000000'
>>>> fp.GetNumOnBits()
> 76
>>>> fp.ToBitString().count("1")
> 76
>
> The bits are in little-endian, so that first '1' is in bit position 11 (the
> 12th bit).
>
>
> What I want is some way to get the dense fingerprint as a series of bytes.
As you discovered, there's not really an easy way to do that from
python (or from C++ for that matter).
> I tried
>
>
>>>> fp.ToBinary().encode("hex")
> 'e0ffffff000800004c000000160a4c08126240301a08226a4e6e4012521e2444463c006a281e003a8c00246c54302c024c108c56422614044c6e360a3a0e5e0a460c046628765a142d000e044c10502c12022820022e0e241e36'
>>>> len(fp.ToBinary() * 8)
> 720
>>>> len(fp.ToBitString())
> 2048
>
> You can clearly see that this is not a simple binary version of the
> fingerprint. It's the wrong length and is entirely too dense with bits.
Correct. The result of ToBinary() is a binary string (i.e. it may
contain \x00) representation of the bit vector. It is, as you figured
out below, generated by the C++ method "ToString()". These binary
representations can be used to construct new bit vectors:
[1]>>> from rdkit import Chem
[2]>>> from rdkit import DataStructs
[3]>>> mol = Chem.MolFromSmiles("c1ccccc1O")
[4]>>> fp = Chem.RDKFingerprint(mol)
[5]>>> txt = fp.ToBinary()
[6]>>> fp2 = DataStructs.ExplicitBitVect(txt)
[7]>>> fp2==fp
Out[7] True
As you might guess, this same information is contained in the pickled
form of ExplicitBitVects:
[11]>>> pkl = cPickle.dumps(fp,True)
[12]>>> txt in pkl
Out[12] True
>
> The Python documentation only says "Returns a binary string representation of
> the vector". That method name doesn't exist in the C++ documentation. It's
> part of the Python wrapper (I think) which is forwarding the request to
> ExplicitBitVect::ToString().
>
> Why does the C++ code have "ToString()" while Python has "ToBinary()" ?
That's a good question. The answer isn't particularly satisfying. The
C++ name is an accurate description of the method, which returns a
string representation of the BitVect. I chose a different name in
Python because, to me, "ToString()" sounds like something you might
call to get a formatted version of your BitVector, which this binary
output clearly is not.
> For those curious, the encoding is a version string, followed by the
> fingerprint size, followed by the number of on bits, followed by run-length
> encoded "on" bits.
Correct.
>
> That's okay. There are other options to explore trying to get the underlying
> raw data. The documentation says:
>
> ToBase64( (ExplicitBitVect)arg1) -> str :
> Converts the vector to a base64 string (the Daylight encoding).
>
>
> I think the documentation here is wrong. Daylight does have a 6-bit encoding
> for binary values, but it's not base64. It's something of their own making. I
> have an implementation of it if anyone really needs it.
Nice catch. This is a mistake in the documentation and I'll fix it.
The support for the Daylight encoding only includes parsing the
fingerprints (based on Daylight contrib code).
>
> In any case, it doesn't describe what data is encoded. I hoped it was the raw
> data. A quick test shows it's the base64 encoding of the ToBinary() string:
>
I've updated the docs to make this explicit:
[5]>>> ?fp.ToBase64
Type: instancemethod
Base Class: <type 'instancemethod'>
String Form: <bound method ExplicitBitVect.ToBase64 of
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x898fb8c>>
Namespace: Interactive
Docstring:
ToBase64( (ExplicitBitVect)arg1) -> str :
Converts the vector to a base64 string (the base64 encoded
version of the results of ToString()).
C++ signature :
std::string ToBase64(ExplicitBitVect {lvalue})
>
> This means it doesn't seem possible to get the data I want.
Unfortunately correct: there's currently no way to get a packed list
of bytes for an ExplicitBitVect.
> I do see that the data is stored in a boost::dynamic_bitset<>, which is very
> different than how OpenBabel and OEChem do it. I don't need access to the raw
> data structure. A "GetBytes()" would be fine. I ended up writing a Python
> function to do that. I also have a very specific requirement for the bit and
> byte order, so I'm not pushing for a change in RDKit ... yet. ;)
:-)
-greg
------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss