Cheers Francois - that might be the way to go actually. I'll try with
'bitstring' https://github.com/scott-griffiths/bitstring and I guess write
the data as concatenated bitarrays in chunked binary files.

I'd like to keep it FOSS since its for academic publication and hopefully
to be re-used. Chemfp is amazing but brute-forcing 100million by 100million
would surely still take a long time compared with an approximate nearest
neighbor approach.

Straying from RDKit so Ill leave it there - thanks!

On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger <mli...@ligand.eu> wrote:

> On 09/09/2020 09:35, Lewis Martin wrote:
> > Hi RDKit,
> >
> > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
> > like to fit an approximate-nearest neighbors index on a dataset of 100
> > million ligands, featurized by morgan fingerprint. The text file of
> > the smiles is ~6gb but this blows out when loaded with
> > pandas.read_csv() or f.readlines() due to weird memory allocation
> > issues.
> >
> > It would take 45hrs to process the file in serial (i.e. read line,
> > create mol, fingerprint, convert to np.arr or sparse arrays) in a
> > streaming manner so now I'd like to parallelize the job with joblib,
> > which would multiply the memory requirements by the number of
> > processes running at a time.
> >
> > So: what is the smallest possible representation for a binary
> > fingerprint? Using `sys.getsizeof` on a
> > rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is
> > 96 bytes, but I'm not sure whether to believe that since, like
> > csr_matrix, the size depends on accurately returning the object's
> > data. Here's an example demonstrating this:
> >
> > from rdkit import Chem
> > from rdkit.Chem import rdFingerprintGenerator
> > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
> > mol = Chem.MolFromSmiles(smi)
> > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2,
> > fpSize=512)
>
> Obviously, if you ask for fpSize = 512, the smallest uncompressed
> representation of the fingerprint will be 512 bits (64 bytes).
>
> 10M of such fingerprints, if there is not any overhead added by the
> programming language,
> would fit into 6GB of RAM.
>
> But, the really fun things will start when you want to search fast into
> so many molecules. :)
> There are many published methods, some open-source software (like
> Dalke's chemfp) and even some commercial ones
> which claim they are lightning fast (even reaching real-time search
> speed!).
>
> e.g.
> https://chemaxon.com/products/madfast
> https://www.nextmovesoftware.com/arthor.html
>
> Regards,
> F.
>
> > fp = gen_mo.GetFingerprint(mol)
> > sparse_fp = sparse.csr_matrix(fp)
> >
> > print('ExplicitBitVect object size:', getsizeof(fp))
> > print('Sparse matrix size (naive):', getsizeof(sparse_fp))
> > print('Sparse matrix size (real):',
> > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
> > print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
> > print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
> >>>>
> >
> > ExplicitBitVect object size: 96
> > Sparse matrix size (naive): 64
> > Sparse matrix size (real): 476
> > fp.ToBinary size: 85
> > fp.ToBinary size: 121
> >
> > Note that even the smallest of these multiplied by 100 million would
> > be about 8gb, still larger than the text file storing the smiles codes
> > - not sure if that is to be expected or not?
> >
> > Thank for your time!
> > Lewis
> > _______________________________________________
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to