The most efficient (easy) way to store the fingerprints is using
DataStructs.BitVectToBinaryText(). That will return a 64byte binary string
for a 512bit fingerprint.

FWIW: if you haven't seen the recent blog post about similarity searching
with short fingerprints:
http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html

-greg


On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin <lewis.marti...@gmail.com>
wrote:

> Hi RDKit,
>
> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
> like to fit an approximate-nearest neighbors index on a dataset of 100
> million ligands, featurized by morgan fingerprint. The text file of the
> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or
> f.readlines() due to weird memory allocation issues.
>
>
> It would take 45hrs to process the file in serial (i.e. read line, create
> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
> so now I'd like to parallelize the job with joblib, which would multiply
> the memory requirements by the number of processes running at a time.
>
> So: what is the smallest possible representation for a binary fingerprint?
> Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect
> object tells me it is 96 bytes, but I'm not sure whether to believe that
> since, like csr_matrix, the size depends on accurately returning the
> object's data. Here's an example demonstrating this:
>
> from rdkit import Chem
> from rdkit.Chem import rdFingerprintGenerator
> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
> mol = Chem.MolFromSmiles(smi)
> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
> fp = gen_mo.GetFingerprint(mol)
> sparse_fp = sparse.csr_matrix(fp)
>
> print('ExplicitBitVect object size:', getsizeof(fp))
> print('Sparse matrix size (naive):', getsizeof(sparse_fp))
> print('Sparse matrix size (real):',
> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
> print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
> print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
> >>>
>
> ExplicitBitVect object size: 96
> Sparse matrix size (naive): 64
> Sparse matrix size (real): 476
> fp.ToBinary size: 85
> fp.ToBinary size: 121
>
>
>
> Note that even the smallest of these multiplied by 100 million would be
> about 8gb, still larger than the text file storing the smiles codes - not
> sure if that is to be expected or not?
>
> Thank for your time!
> Lewis
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to