Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

Andrew Dalke Tue, 08 Sep 2020 22:45:11 -0700

On Sep 9, 2020, at 04:00, Lewis Martin <lewis.marti...@gmail.com> wrote:
> I'd like to keep it FOSS since its for academic publication and hopefully to 
> be re-used. Chemfp is amazing but brute-forcing 100million by 100million 
> would surely still take a long time compared with an approximate nearest 
> neighbor approach.

As a rough guideline, the chemfp paper Fig. 3 says that searching 100M PubChem 
fingerprints for the k=10 neighbors averages about 100 ms on a Haswell machine 
(See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8 ).

That was for chemfp 3.3 using 881 bits. You have 512 bits, which is faster 
(0.6x time, according to my timings). Chemfp 1.6 is about the same performance 
as chemfp 3.4 for this case.

So that's 60ms per query, on average, or 70 days for a single-processor query.

(It's hard to compare directly though as the BitBound algorithm is sublinear; I 
found the PubChem fingerprint kNN search scales at O(n** ~0.6) while Morgan 
2048-bit scales at O(n** ~0.8). Then again, the Haswell machine is several 
years old, so I'm leaving the numbers as-is for an estimate.

This is easily parallelized yourself, or you can use chemfp's built-in 
parallelized NxN kNN searches. (The amount of scaleup on a single multi-core 
machine depends on your memory architecture.)

This means that it shouldn't be hard to get a <1 week performance out of chemfp.

Performance-wise, chemfp 3.4.1 (the latest version of the commercial track) is 
the same as chemfp 1.6.1 (the latest *no-cost*/open-source version).

One caveat - chemfp 1.x has a limit of 2**31 bytes for the fingerprints, or 
33.6M 512-bit fingerprints. (chemfp 2.x or later use 64-bit indexing and don't 
have this limitation.) To get around that, with 1.6.1, you'll need split your 
data into 4 chucks, and search each independently. This also reduces the effect 
of sublinear scaling.

Or, there's no-cost academic licensing for chemfp 3.x using pre-compiled 
binaries which work on most Linux-based OSes.

Still, I think an exact search is entirely doable using FOSS software.

Switching to a smaller fingerprint size is a form of approximate nearest 
neighbors.

If you take Greg's observation and drop the fingerprint size from 512 bits to 
128 bits then you can get about a factor of 3 faster, and have everything fit 
into chemfp 1.6's 2GB limit.

Best regards,

                                Andrew
                                da...@dalkescientific.com

P.S.

On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger <mli...@ligand.eu> wrote:
> There are many published methods, some open-source software (like 
> Dalke's chemfp) and even some commercial ones
> which claim they are lightning fast (even reaching real-time search 
> speed!).

Dalke's chemfp also claims real-time search speed. ;)

It's always a claim of what size that means. 

Also, at least a couple of those systems use multiple threads to speed up 
single-query search, which interferes with multi-threaded batch search, as with 
NxN search on a single machine.

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

Reply via email to