OK to sum it up, for me writing to binary is a neat, fast, and low-storage solution for fingerprints. Example: o = open('fingerprints.bin', 'wb') gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64) for smi in tqdm_notebook(df['smiles']): mol = Chem.MolFromSmiles(smi) fp = gen_mo.GetFingerprint(mol) bs = bitstring.BitArray(bin=fp.ToBitString()) o.write(bs.bytes)
o.close() Most of the time was being taken up in creating numpy arrays. For instance: %%timeit fp = np.array(gen_mo.GetFingerprint(mol)) 351 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) vs: %%timeit fp = gen_mo.GetFingerprint(mol) bs = bitstring.BitArray(bin=fp.ToBitString()) 42 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) so when you remove that step, 100mm ligands takes about 4 1/2 hours in serial, no need for parallelization. Cheers! PS for newbs like me, read the binary fingerprints like this: i = open('fingerprints.bin', 'rb') bits = '' for _ in range(num_fps): fp_bytes = i.read(8) bits += bitstring.BitArray(bytes=fp_bytes).bin i.close() fingerprints_concat = (np.fromstring(bits,'u1') - ord('0')).reshape(num_fps, 64) On Wed, Sep 9, 2020 at 1:28 PM Greg Landrum <greg.land...@gmail.com> wrote: > The most efficient (easy) way to store the fingerprints is using > DataStructs.BitVectToBinaryText(). That will return a 64byte binary string > for a 512bit fingerprint. > > FWIW: if you haven't seen the recent blog post about similarity searching > with short fingerprints: > http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html > > -greg > > > On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin <lewis.marti...@gmail.com> > wrote: > >> Hi RDKit, >> >> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd >> like to fit an approximate-nearest neighbors index on a dataset of 100 >> million ligands, featurized by morgan fingerprint. The text file of the >> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or >> f.readlines() due to weird memory allocation issues. >> >> >> It would take 45hrs to process the file in serial (i.e. read line, create >> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner >> so now I'd like to parallelize the job with joblib, which would multiply >> the memory requirements by the number of processes running at a time. >> >> So: what is the smallest possible representation for a binary >> fingerprint? Using `sys.getsizeof` on a >> rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96 >> bytes, but I'm not sure whether to believe that since, like csr_matrix, the >> size depends on accurately returning the object's data. Here's an example >> demonstrating this: >> >> from rdkit import Chem >> from rdkit.Chem import rdFingerprintGenerator >> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' >> mol = Chem.MolFromSmiles(smi) >> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) >> fp = gen_mo.GetFingerprint(mol) >> sparse_fp = sparse.csr_matrix(fp) >> >> print('ExplicitBitVect object size:', getsizeof(fp)) >> print('Sparse matrix size (naive):', getsizeof(sparse_fp)) >> print('Sparse matrix size (real):', >> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) >> print('fp.ToBinary size:', getsizeof(fp.ToBinary())) >> print('fp.ToBinary size:', getsizeof(fp.ToBase64())) >> >>> >> >> ExplicitBitVect object size: 96 >> Sparse matrix size (naive): 64 >> Sparse matrix size (real): 476 >> fp.ToBinary size: 85 >> fp.ToBinary size: 121 >> >> >> >> Note that even the smallest of these multiplied by 100 million would be >> about 8gb, still larger than the text file storing the smiles codes - not >> sure if that is to be expected or not? >> >> Thank for your time! >> Lewis >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss