Thank you everyone for the suggestions. For now I don't have immediate plans to adopt the cartridge but it's good to know these things when the time comes.
Best, Ivan On Mon, Jun 8, 2020 at 6:49 PM Finnerty, Jim via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > If you have a billion molecule data source and would like to try an > at-scale test, I'd be willing to help out with provisioning the hardware, > looking at the efficiency of the plans, etc., using rdkit with Aurora > PostgreSQL. > > If I understand how the rdkit GIST index filtering mechanism works for a > given similarity metric, a parallel GIST index scan ought to be able to > scale almost linearly scale with the number of cores, provided that the > RDBMS is built on a scalable storage subsystem. > > If so, the largest instance size that's currently supported has 96 cores, > so we can do a fairly high degree of parallelism. > > On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" < > rdkit-discuss@lists.sourceforge.net> wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > On 6/5/2020 4:45 AM, Greg Landrum wrote: > > > Having said that, the team behind ZINC used to use the RDKit > cartridge with > > PostgreSQL as the backend for ZINC. They had the database sharded > > across multiple instances and managed to get the fingerprint indices > to > > work there. I don't remember the substructure search performance > being > > terrible, but it wasn't great either. They have since switched to a > > specialized system (Arthor from NextMove software), which offers > > significantly better performance. > > Generally speaking a database of a billion rows needs hardware capable > of running it. Buy a server with 1TB RAM and 64 cores and a couple of > U.2 NVME drives and see how Postgres runs on that. > > Then you need to look at the database, e.g. query in an indexed > billion-row table could be OK but inserting a billion-first row will > not be. > > If you want to scale to these kinds of volumes, you need to do some > work. > > (And much of the point of no-sql hadoop "cloud" workflows is that if > you > can parallelize what you're doing to multiple machines, at some data > size they will start outperforming a centralized fast search engine.) > > Dima > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss