Hi Naomi,

I once played around a bit with this idea using the Lucene-based RDKit
example as guidance. However what that code does inside Lucene and hence my
"adaption" inside elastic search is only the fingerprint screening part.
For the actual subgraph-match the data then has to be sent to the
caller/client and doesn't run inside elastic search and means one must
manipulate the elastic search results (hit count, paging,...) before
finally returning to the end user application. Simply said, not a very
usable but very hacky solution.

Even ignoring that part, it wasn't very fast either. That could be due to
many things like only having 1 machine for ES (my machine, no cluster) and
not being an expert in ES anyway (suboptimal config?). Or maybe the dataset
was too small to actually benefit. Same data, same query is much faster in
PostgreSQL + RDKit + Full-text index and easier to use. (Yes, PostgreSQL
supports full-text search similar to elastic. if one doesn't need very
advanced features or has a lot of data, for sure worth a look)

Any "real solution" must also do the subgraph matching inside elastic
itself which means writing a plugin / extension for elasticsearch. This was
simply too involved for me to even try. (If that is of interest, you should
probably also look at the very recent licensing changes to elasticsearch).

The presentation Joshua mentioned is actually only about similarity search
which naturally is easier to implement and fast.

Having said that, there is a commercial solution available from PerkinElmer
in their Signals Data factory offering. Of course this has nothing to do
with RDKit but it does hint that it's possible to do this if you have the
time, budget and skills/knowledge.

Another  commercial "fast substructure search" option would be nextmoves
Arthor but that has nothing to do with elasticsearch. Question is if you
want elasticsearch due to the speed or due to the combination with text
search. I would probably avoid it if the text search part is not important.

Just using RDKit default functionality is actually pretty fast (see on
Gregs blog), well it does run in memory. Nowadays a machine with lots of
RAM doesn't cost all that much so I could see that scaling to 10-20 million
structures easily.

hope that helps you a bit to come to a conclusion on what to do.

Best Regards,

Joos


---------- Forwarded message ----------
> From: Naomi Jacobs <na...@benchling.com>
> To: rdkit-discuss@lists.sourceforge.net
> Cc: Alan Pierce <a...@benchling.com>, Larry Taylor <la...@benchling.com>
> Bcc:
> Date: Wed, 20 Jan 2021 22:27:32 -0800
> Subject: [Rdkit-discuss] RDKit ElasticSearch Plugin
> Hi all,
>
> We're looking for information about whether anyone has built an
> ElasticSearch plugin using RDKit to support chemical search. I didn't see
> anything open-source online, but was thinking some folks may have heard
> about internal efforts and would be willing to share any code and/or chat
> about it. Thanks!
>
> Cheers,
> Naomi
>
> --
> *Naomi Jacobs*
> Software Engineer | benchling.com
> (415) 590-2798
>
>
>
> ---------- Forwarded message ----------
> From: Greg Landrum <greg.land...@gmail.com>
> To: Naomi Jacobs <na...@benchling.com>
> Cc: RDKit Discuss <rdkit-discuss@lists.sourceforge.net>, Larry Taylor <
> la...@benchling.com>
> Bcc:
> Date: Thu, 21 Jan 2021 08:54:08 +0100
> Subject: Re: [Rdkit-discuss] RDKit ElasticSearch Plugin
> Hi Naomi,
>
> I'm not personally aware of any ElasticSearch work, but there is a
> prototype for a lucene plugin which could, I believe, be used as the basis
> for an ElasticSearch plugin:
> https://github.com/rdkit/org.rdkit.lucene
>
> It's (obviously) been a while since anyone did anything with that code and
> it may no longer work, but the more recent (and still functional)
> RDKit-neo4j integration (https://github.com/rdkit/neo4j-rdkit) can
> provide some patterns for how the RDKit java integration can be used in
> this type of context.
>
> I hope this helps, and would be interested to hear if you end up doing
> anything with the RDKit and ElasticSearch.
> -greg
>
>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to