The most efficient (easy) way to store the fingerprints is using DataStructs.BitVectToBinaryText(). That will return a 64byte binary string for a 512bit fingerprint.
FWIW: if you haven't seen the recent blog post about similarity searching with short fingerprints: http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html -greg On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin <lewis.marti...@gmail.com> wrote: > Hi RDKit, > > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd > like to fit an approximate-nearest neighbors index on a dataset of 100 > million ligands, featurized by morgan fingerprint. The text file of the > smiles is ~6gb but this blows out when loaded with pandas.read_csv() or > f.readlines() due to weird memory allocation issues. > > > It would take 45hrs to process the file in serial (i.e. read line, create > mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner > so now I'd like to parallelize the job with joblib, which would multiply > the memory requirements by the number of processes running at a time. > > So: what is the smallest possible representation for a binary fingerprint? > Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect > object tells me it is 96 bytes, but I'm not sure whether to believe that > since, like csr_matrix, the size depends on accurately returning the > object's data. Here's an example demonstrating this: > > from rdkit import Chem > from rdkit.Chem import rdFingerprintGenerator > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' > mol = Chem.MolFromSmiles(smi) > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) > fp = gen_mo.GetFingerprint(mol) > sparse_fp = sparse.csr_matrix(fp) > > print('ExplicitBitVect object size:', getsizeof(fp)) > print('Sparse matrix size (naive):', getsizeof(sparse_fp)) > print('Sparse matrix size (real):', > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) > print('fp.ToBinary size:', getsizeof(fp.ToBinary())) > print('fp.ToBinary size:', getsizeof(fp.ToBase64())) > >>> > > ExplicitBitVect object size: 96 > Sparse matrix size (naive): 64 > Sparse matrix size (real): 476 > fp.ToBinary size: 85 > fp.ToBinary size: 121 > > > > Note that even the smallest of these multiplied by 100 million would be > about 8gb, still larger than the text file storing the smiles codes - not > sure if that is to be expected or not? > > Thank for your time! > Lewis > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss