Hi Jiri, David, I only briefly skimmed through the attachments to get an idea if the content, but it looks great at first sight! I'll read it in depth during the upcoming holidays.
But I think I can answer some of your questions already: 1) Yes it's certainly doable, if you have enough time for it :) But you got our attention, we're certainly interested to help. 2) It's probably good value for Infinispan to work on an abstraction from the specific indexing engine, although a poorly implemented abstraction would cost us in terms of performance so we should get that right. User's configuration complexity is also a frequent concern, so let's try to keep that in mind too. Once we have a proper separation from the current indexing/query engine we can certainly add this as an alternative implementation; this can live as an experimental module for a while and be integrated depending on how far we get and how people like the additional features. 3) In terms of design, I should probably read those papers in depth first, but these are my early doubts: # to Lucene / not to Lucene I see in the presentation that Lucene is referred to as a good solution for full-text, but while it's true it is actually just an encoder/decoder/query engine for a vector space model. People have built more than just text based Similarity on top of it. Would this implementation be possible to run on top of Lucene indexes, or is it required to use a completely different index management solution? # to Hibernate Search / not to Hibernate Search Most of the current indexing/query code in Infinispan is based on Hibernate Search, which handles the complexity of Lucene's resource management, Query execution, and makes it easier for developers to map their Domain model. We're working on Hibernate Search to improve its flexibility on dynamic models (more suited to Infinispan users), and also to not necessarily work on Lucene in embedded mode but to delegate to "Lucene like" services. That means it will probably always assume to have some form of Similarity capable vector space model based engine to delegate the hard work to, but not necessarily the Lucene project; we're looking at alternatives like Apache Solr and ElasticSearch for now - so essentially still Lucene based but typically running on a separate dedicated cluster node(s). You could think of integrating the index handling code into Hibernate Search, whose functionality is automatically inherited by Infinispan, or bypass Hibernate Search and integrate with Infinispan directly. Depending on the "Lucene" question, be aware that Hibernate Search is already able to provide functionality like Spatial queries and indexing of PDF/Office files; although this last one is text based, the Spatial integration works on numeric distance; the benefit is that we can combine distance criteria with text criteria. I don't think it would be hard to extend this model to support other implementations of Similarity like the mentioned images and songs, in fact that would probably be a relatively easy task if you already know which Similarity implementation you want to use. The benefit of integrating with Hibernate Search is that you would address the needs of a much larger users base: the same functionality is usable by Hibernate users (Java developers using relational databases: we provide indexing an Similarity based queries on your database stored data). I'm just listing some options but don't intend to recommend any without further details. While I'm leading the Hibernate Search project, I see good value in a proper abstraction from Infinispan to a pluggable (alternative) query strategy, although considering how many details it takes to get right I doubt we'll ever be able to make an effective competitor for the current one; so to answer the two points we'd need a better understanding of what exactly you would need to store in the "index" and how you think this can be maintained in synch with the data. Generally speaking I think all newcomers will be tempted to avoid both Lucene and Hibernate Search to not need to learn too much, but let's keep in mind that not having unlimited manpower we need to be smart and these two engines do a lot of heavy work and are constantly evolving in terms of performance. So unless the requirements don't fit at all, I'd rather help to see what could be reused from these. I haven't done much advanced research using Lucene myself, but I've heard that several researchers use it as a "toolbox" to experiment with new kinds of vector space based analytics, so I expect it should be useful to keep around even in an alternative implementation. Thanks, Sanne On 15 December 2014 at 11:35, Jiri Holusa <[email protected]> wrote: > Hi, > > there is an interesting research around similarity search at my university > driven by David Novák (CC-ed). If anyone interested, see [1][2][3]. > > Shortly: they basically achieved similarity search on any data (images, > songs, etc...) by creating some sort of custom index, that stores a > "similarity vector" for each object in the database. This index can solve > queries like "give me the most similar images to this example". So why am I > posting this here? > > The architecture is designed on top of Infinispan and they want to use it to > speed it up. Basically, they would like to distribute the entries across the > cluster, each node would have the similarity index of its entries. Then, when > a query comes, it would be distributed to all the nodes, custom search would > be performed on the node's indexes and the result returned. This is > approximately what Index.LOCAL and ClusteredQuery could do. > > The difference is that the indexing and searching mechanism must be custom. > So I wanted to ask what do you think about implementing such a feature to > Infinispan. I was thinking about somehow extracting general API for > indexing/searching, then e.g. our Lucene search would become its > implementation. > > I would be happy to take this as a contribution, since I find this extremely > interesting topic and also create a diploma thesis out of this. > So here are some questions: > 1) Is it doable? > 2) Do we want this feature? > 3) How to design it/where to start? > > Any input is more then welcome :) > > Cheers, > Jiri > > [1] https://drive.google.com/file/d/0B4sztQSfpi3rRlJBQjJHMkR2LXc/view > [2] https://drive.google.com/file/d/0B4sztQSfpi3rU2p2MV9jRE9iTUk/view > [3] https://drive.google.com/file/d/0B4sztQSfpi3rZUpld24ydzJNclk/view > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
