OK to sum it up, for me writing to binary is a neat, fast, and low-storage
solution for fingerprints. Example:
o = open('fingerprints.bin', 'wb')
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64)
for smi in tqdm_notebook(df['smiles']):
    mol = Chem.MolFromSmiles(smi)
    fp = gen_mo.GetFingerprint(mol)
    bs = bitstring.BitArray(bin=fp.ToBitString())
    o.write(bs.bytes)

o.close()

Most of the time was being taken up in creating numpy arrays. For instance:
%%timeit
fp = np.array(gen_mo.GetFingerprint(mol))

351 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

vs:

%%timeit
fp = gen_mo.GetFingerprint(mol)
bs = bitstring.BitArray(bin=fp.ToBitString())
42 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


so when you remove that step, 100mm ligands takes about 4 1/2 hours in
serial, no need for parallelization. Cheers!
PS for newbs like me, read the binary fingerprints like this:
i = open('fingerprints.bin', 'rb')
bits = ''
for _ in range(num_fps):
    fp_bytes = i.read(8)
    bits += bitstring.BitArray(bytes=fp_bytes).bin
i.close()
fingerprints_concat = (np.fromstring(bits,'u1') -
ord('0')).reshape(num_fps, 64)

On Wed, Sep 9, 2020 at 1:28 PM Greg Landrum <greg.land...@gmail.com> wrote:

> The most efficient (easy) way to store the fingerprints is using
> DataStructs.BitVectToBinaryText(). That will return a 64byte binary string
> for a 512bit fingerprint.
>
> FWIW: if you haven't seen the recent blog post about similarity searching
> with short fingerprints:
> http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html
>
> -greg
>
>
> On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin <lewis.marti...@gmail.com>
> wrote:
>
>> Hi RDKit,
>>
>> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
>> like to fit an approximate-nearest neighbors index on a dataset of 100
>> million ligands, featurized by morgan fingerprint. The text file of the
>> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or
>> f.readlines() due to weird memory allocation issues.
>>
>>
>> It would take 45hrs to process the file in serial (i.e. read line, create
>> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
>> so now I'd like to parallelize the job with joblib, which would multiply
>> the memory requirements by the number of processes running at a time.
>>
>> So: what is the smallest possible representation for a binary
>> fingerprint? Using `sys.getsizeof` on a
>> rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96
>> bytes, but I'm not sure whether to believe that since, like csr_matrix, the
>> size depends on accurately returning the object's data. Here's an example
>> demonstrating this:
>>
>> from rdkit import Chem
>> from rdkit.Chem import rdFingerprintGenerator
>> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
>> mol = Chem.MolFromSmiles(smi)
>> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
>> fp = gen_mo.GetFingerprint(mol)
>> sparse_fp = sparse.csr_matrix(fp)
>>
>> print('ExplicitBitVect object size:', getsizeof(fp))
>> print('Sparse matrix size (naive):', getsizeof(sparse_fp))
>> print('Sparse matrix size (real):',
>> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
>> print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
>> print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
>> >>>
>>
>> ExplicitBitVect object size: 96
>> Sparse matrix size (naive): 64
>> Sparse matrix size (real): 476
>> fp.ToBinary size: 85
>> fp.ToBinary size: 121
>>
>>
>>
>> Note that even the smallest of these multiplied by 100 million would be
>> about 8gb, still larger than the text file storing the smiles codes - not
>> sure if that is to be expected or not?
>>
>> Thank for your time!
>> Lewis
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to