On 6/5/2020 4:45 AM, Greg Landrum wrote:
Having said that, the team behind ZINC used to use the RDKit cartridge with
PostgreSQL as the backend for ZINC. They had the database sharded
across multiple instances and managed to get the fingerprint indices to
work there. I don't remember the substructure search performance being
terrible, but it wasn't great either. They have since switched to a
specialized system (Arthor from NextMove software), which offers
significantly better performance.
Generally speaking a database of a billion rows needs hardware capable
of running it. Buy a server with 1TB RAM and 64 cores and a couple of
U.2 NVME drives and see how Postgres runs on that.
Then you need to look at the database, e.g. query in an indexed
billion-row table could be OK but inserting a billion-first row will not be.
If you want to scale to these kinds of volumes, you need to do some work.
(And much of the point of no-sql hadoop "cloud" workflows is that if you
can parallelize what you're doing to multiple machines, at some data
size they will start outperforming a centralized fast search engine.)
Dima
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss