Hi RDKit,

Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like
to fit an approximate-nearest neighbors index on a dataset of 100 million
ligands, featurized by morgan fingerprint. The text file of the smiles is
~6gb but this blows out when loaded with pandas.read_csv() or f.readlines()
due to weird memory allocation issues.


It would take 45hrs to process the file in serial (i.e. read line, create
mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
so now I'd like to parallelize the job with joblib, which would multiply
the memory requirements by the number of processes running at a time.

So: what is the smallest possible representation for a binary fingerprint?
Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect
object tells me it is 96 bytes, but I'm not sure whether to believe that
since, like csr_matrix, the size depends on accurately returning the
object's data. Here's an example demonstrating this:

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
mol = Chem.MolFromSmiles(smi)
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
fp = gen_mo.GetFingerprint(mol)
sparse_fp = sparse.csr_matrix(fp)

print('ExplicitBitVect object size:', getsizeof(fp))
print('Sparse matrix size (naive):', getsizeof(sparse_fp))
print('Sparse matrix size (real):',
sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
>>>

ExplicitBitVect object size: 96
Sparse matrix size (naive): 64
Sparse matrix size (real): 476
fp.ToBinary size: 85
fp.ToBinary size: 121



Note that even the smallest of these multiplied by 100 million would be
about 8gb, still larger than the text file storing the smiles codes - not
sure if that is to be expected or not?

Thank for your time!
Lewis
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to