Cheers Francois - that might be the way to go actually. I'll try with 'bitstring' https://github.com/scott-griffiths/bitstring and I guess write the data as concatenated bitarrays in chunked binary files.
I'd like to keep it FOSS since its for academic publication and hopefully to be re-used. Chemfp is amazing but brute-forcing 100million by 100million would surely still take a long time compared with an approximate nearest neighbor approach. Straying from RDKit so Ill leave it there - thanks! On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger <mli...@ligand.eu> wrote: > On 09/09/2020 09:35, Lewis Martin wrote: > > Hi RDKit, > > > > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd > > like to fit an approximate-nearest neighbors index on a dataset of 100 > > million ligands, featurized by morgan fingerprint. The text file of > > the smiles is ~6gb but this blows out when loaded with > > pandas.read_csv() or f.readlines() due to weird memory allocation > > issues. > > > > It would take 45hrs to process the file in serial (i.e. read line, > > create mol, fingerprint, convert to np.arr or sparse arrays) in a > > streaming manner so now I'd like to parallelize the job with joblib, > > which would multiply the memory requirements by the number of > > processes running at a time. > > > > So: what is the smallest possible representation for a binary > > fingerprint? Using `sys.getsizeof` on a > > rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is > > 96 bytes, but I'm not sure whether to believe that since, like > > csr_matrix, the size depends on accurately returning the object's > > data. Here's an example demonstrating this: > > > > from rdkit import Chem > > from rdkit.Chem import rdFingerprintGenerator > > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' > > mol = Chem.MolFromSmiles(smi) > > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, > > fpSize=512) > > Obviously, if you ask for fpSize = 512, the smallest uncompressed > representation of the fingerprint will be 512 bits (64 bytes). > > 10M of such fingerprints, if there is not any overhead added by the > programming language, > would fit into 6GB of RAM. > > But, the really fun things will start when you want to search fast into > so many molecules. :) > There are many published methods, some open-source software (like > Dalke's chemfp) and even some commercial ones > which claim they are lightning fast (even reaching real-time search > speed!). > > e.g. > https://chemaxon.com/products/madfast > https://www.nextmovesoftware.com/arthor.html > > Regards, > F. > > > fp = gen_mo.GetFingerprint(mol) > > sparse_fp = sparse.csr_matrix(fp) > > > > print('ExplicitBitVect object size:', getsizeof(fp)) > > print('Sparse matrix size (naive):', getsizeof(sparse_fp)) > > print('Sparse matrix size (real):', > > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) > > print('fp.ToBinary size:', getsizeof(fp.ToBinary())) > > print('fp.ToBinary size:', getsizeof(fp.ToBase64())) > >>>> > > > > ExplicitBitVect object size: 96 > > Sparse matrix size (naive): 64 > > Sparse matrix size (real): 476 > > fp.ToBinary size: 85 > > fp.ToBinary size: 121 > > > > Note that even the smallest of these multiplied by 100 million would > > be about 8gb, still larger than the text file storing the smiles codes > > - not sure if that is to be expected or not? > > > > Thank for your time! > > Lewis > > _______________________________________________ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss