Thank you everyone for the suggestions. For now I don't have immediate
plans to adopt the cartridge but it's good to know these things when the
time comes.

Best,
Ivan

On Mon, Jun 8, 2020 at 6:49 PM Finnerty, Jim via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> If you have a billion molecule data source and would like to try an
> at-scale test, I'd be willing to help out with provisioning the hardware,
> looking at the efficiency of the plans, etc., using rdkit with Aurora
> PostgreSQL.
>
> If I understand how the rdkit GIST index filtering mechanism works for a
> given similarity metric, a parallel GIST index scan ought to be able to
> scale almost linearly scale with the number of cores, provided that the
> RDBMS is built on a scalable storage subsystem.
>
> If so, the largest instance size that's currently supported has 96 cores,
> so we can do a fairly high degree of parallelism.
>
> On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
>     On 6/5/2020 4:45 AM, Greg Landrum wrote:
>
>     > Having said that, the team behind ZINC used to use the RDKit
> cartridge with
>     > PostgreSQL as the backend for ZINC. They had the database sharded
>     > across multiple instances and managed to get the fingerprint indices
> to
>     > work there. I don't remember the substructure search performance
> being
>     > terrible, but it wasn't great either. They have since switched to a
>     > specialized system (Arthor from NextMove software), which offers
>     > significantly better performance.
>
>     Generally speaking a database of a billion rows needs hardware capable
>     of running it. Buy a server with 1TB RAM and 64 cores and a couple of
>     U.2 NVME drives and see how Postgres runs on that.
>
>     Then you need to look at the database, e.g. query in an indexed
>     billion-row table could be OK but inserting a billion-first row will
> not be.
>
>     If you want to scale to these kinds of volumes, you need to do some
> work.
>
>     (And much of the point of no-sql hadoop "cloud" workflows is that if
> you
>     can parallelize what you're doing to multiple machines, at some data
>     size they will start outperforming a centralized fast search engine.)
>
>     Dima
>
>
>     _______________________________________________
>     Rdkit-discuss mailing list
>     Rdkit-discuss@lists.sourceforge.net
>     https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to