On Sep 9, 2020, at 04:00, Lewis Martin <lewis.marti...@gmail.com> wrote: > I'd like to keep it FOSS since its for academic publication and hopefully to > be re-used. Chemfp is amazing but brute-forcing 100million by 100million > would surely still take a long time compared with an approximate nearest > neighbor approach.
As a rough guideline, the chemfp paper Fig. 3 says that searching 100M PubChem fingerprints for the k=10 neighbors averages about 100 ms on a Haswell machine (See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8 ). That was for chemfp 3.3 using 881 bits. You have 512 bits, which is faster (0.6x time, according to my timings). Chemfp 1.6 is about the same performance as chemfp 3.4 for this case. So that's 60ms per query, on average, or 70 days for a single-processor query. (It's hard to compare directly though as the BitBound algorithm is sublinear; I found the PubChem fingerprint kNN search scales at O(n** ~0.6) while Morgan 2048-bit scales at O(n** ~0.8). Then again, the Haswell machine is several years old, so I'm leaving the numbers as-is for an estimate. This is easily parallelized yourself, or you can use chemfp's built-in parallelized NxN kNN searches. (The amount of scaleup on a single multi-core machine depends on your memory architecture.) This means that it shouldn't be hard to get a <1 week performance out of chemfp. Performance-wise, chemfp 3.4.1 (the latest version of the commercial track) is the same as chemfp 1.6.1 (the latest *no-cost*/open-source version). One caveat - chemfp 1.x has a limit of 2**31 bytes for the fingerprints, or 33.6M 512-bit fingerprints. (chemfp 2.x or later use 64-bit indexing and don't have this limitation.) To get around that, with 1.6.1, you'll need split your data into 4 chucks, and search each independently. This also reduces the effect of sublinear scaling. Or, there's no-cost academic licensing for chemfp 3.x using pre-compiled binaries which work on most Linux-based OSes. Still, I think an exact search is entirely doable using FOSS software. Switching to a smaller fingerprint size is a form of approximate nearest neighbors. If you take Greg's observation and drop the fingerprint size from 512 bits to 128 bits then you can get about a factor of 3 faster, and have everything fit into chemfp 1.6's 2GB limit. Best regards, Andrew da...@dalkescientific.com P.S. On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger <mli...@ligand.eu> wrote: > There are many published methods, some open-source software (like > Dalke's chemfp) and even some commercial ones > which claim they are lightning fast (even reaching real-time search > speed!). Dalke's chemfp also claims real-time search speed. ;) It's always a claim of what size that means. Also, at least a couple of those systems use multiple threads to speed up single-query search, which interferes with multi-threaded batch search, as with NxN search on a single machine. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss