Hi Naomi, I once played around a bit with this idea using the Lucene-based RDKit example as guidance. However what that code does inside Lucene and hence my "adaption" inside elastic search is only the fingerprint screening part. For the actual subgraph-match the data then has to be sent to the caller/client and doesn't run inside elastic search and means one must manipulate the elastic search results (hit count, paging,...) before finally returning to the end user application. Simply said, not a very usable but very hacky solution.
Even ignoring that part, it wasn't very fast either. That could be due to many things like only having 1 machine for ES (my machine, no cluster) and not being an expert in ES anyway (suboptimal config?). Or maybe the dataset was too small to actually benefit. Same data, same query is much faster in PostgreSQL + RDKit + Full-text index and easier to use. (Yes, PostgreSQL supports full-text search similar to elastic. if one doesn't need very advanced features or has a lot of data, for sure worth a look) Any "real solution" must also do the subgraph matching inside elastic itself which means writing a plugin / extension for elasticsearch. This was simply too involved for me to even try. (If that is of interest, you should probably also look at the very recent licensing changes to elasticsearch). The presentation Joshua mentioned is actually only about similarity search which naturally is easier to implement and fast. Having said that, there is a commercial solution available from PerkinElmer in their Signals Data factory offering. Of course this has nothing to do with RDKit but it does hint that it's possible to do this if you have the time, budget and skills/knowledge. Another commercial "fast substructure search" option would be nextmoves Arthor but that has nothing to do with elasticsearch. Question is if you want elasticsearch due to the speed or due to the combination with text search. I would probably avoid it if the text search part is not important. Just using RDKit default functionality is actually pretty fast (see on Gregs blog), well it does run in memory. Nowadays a machine with lots of RAM doesn't cost all that much so I could see that scaling to 10-20 million structures easily. hope that helps you a bit to come to a conclusion on what to do. Best Regards, Joos ---------- Forwarded message ---------- > From: Naomi Jacobs <na...@benchling.com> > To: rdkit-discuss@lists.sourceforge.net > Cc: Alan Pierce <a...@benchling.com>, Larry Taylor <la...@benchling.com> > Bcc: > Date: Wed, 20 Jan 2021 22:27:32 -0800 > Subject: [Rdkit-discuss] RDKit ElasticSearch Plugin > Hi all, > > We're looking for information about whether anyone has built an > ElasticSearch plugin using RDKit to support chemical search. I didn't see > anything open-source online, but was thinking some folks may have heard > about internal efforts and would be willing to share any code and/or chat > about it. Thanks! > > Cheers, > Naomi > > -- > *Naomi Jacobs* > Software Engineer | benchling.com > (415) 590-2798 > > > > ---------- Forwarded message ---------- > From: Greg Landrum <greg.land...@gmail.com> > To: Naomi Jacobs <na...@benchling.com> > Cc: RDKit Discuss <rdkit-discuss@lists.sourceforge.net>, Larry Taylor < > la...@benchling.com> > Bcc: > Date: Thu, 21 Jan 2021 08:54:08 +0100 > Subject: Re: [Rdkit-discuss] RDKit ElasticSearch Plugin > Hi Naomi, > > I'm not personally aware of any ElasticSearch work, but there is a > prototype for a lucene plugin which could, I believe, be used as the basis > for an ElasticSearch plugin: > https://github.com/rdkit/org.rdkit.lucene > > It's (obviously) been a while since anyone did anything with that code and > it may no longer work, but the more recent (and still functional) > RDKit-neo4j integration (https://github.com/rdkit/neo4j-rdkit) can > provide some patterns for how the RDKit java integration can be used in > this type of context. > > I hope this helps, and would be interested to hear if you end up doing > anything with the RDKit and ElasticSearch. > -greg > > >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss