Re: [Lucene.Net] Common ID Field across Indices - MultiSearcher and de-duplication?

Alex Davidson Thu, 22 Dec 2011 11:20:03 -0800

The product I work on has a need for performing arbitrary report-style
queries against relational data, including predicates which are
difficult to express to the database's own full-text search engine. We
developed an in-house solution which combines SQL and Lucene queries. It
does this by rewriting the query as intersections between
pure-relational and pure-Lucene queries. We then use SqlBulkCopy to dump
the relevant identifier field from the Lucene resultsets to temporary
tables, apply an index to each table, then join the database queries
onto those ID sets.


As long as the intermediate sets being joined are not overly large (less
than a few thousand) it works out to be quite effective, responding in a
few tens of milliseconds if I recall, and parallelises nicely. Some
'columns' must be searchable (or filterable) in the Lucene index as well
as the database to keep the intermediate set size down, however.

On Thu, 2011-12-22 at 13:56 -0500, Wyatt Barnett wrote:

> For pulling it into a database, check out ravendb -- you can get the best
> of both worlds as it is a pretty slick document database while also making
> heavy use of lucene to power the entire indexing engine.
> 
> On Thursday, December 22, 2011, Trevor Watson <twat...@datassimilate.com>
> wrote:
> > My manager always harps more and more on the speed aspect of the program
> I'm developing.  It's hard to make him understand that adding anything is
> going to add time.
> >
> > In terms of using an index to keep track of tagging of documents, if we
> were to keep the data in a single index, as far as I know, to do any sort
> of mass insert/delete of a "Tag" field would require me to loop through
> each document and remove/add the field and the data for that field, plus I
> would have to re-create the data for the contents field which isn't stored
> in the index (makes the searches wayyy too slow).  Doing it this way would
> make the searches fast, but the tagging of documents extremely slow.
> >
> > By creating a 2nd index, we wouldn't deal with the contents field or any
> of the other stored information, plus we could call DeleteAll on the
> IndexWriter to clear out Tagged documents if need be (instead of looping
> through 150,000 records and removing the "Tagged" field (plus re-creating
> the contents field) on each of them).  Even if we had to iterate through
> the documents, it would be faster as there is less stored information.
> >
> > I kinda wish we could just put this all into a database, but that would
> get rid of the awesome high speed searching that Lucene gives.
> >
> > The best that I've been able to come up with in terms of my understanding
> of Lucene is the 2 index system, then do a search on one, a search on the
> other (retrieving the UniqueID from both) and then using the
> List.Intersect() function to join the two.
> >
> > There's gotta be a better way to do this
> >
> >
> >
> > On 12/22/2011 1:26 PM, Troy Howard wrote:
> >>
> >> Trevor,
> >>
> >> It's a little unclear from your initial description why you want to
> >> segregate tagged data in a second index. What is the purpose of the
> >> tagging? Generally tagging is not a binary operation, but rather a
> >> form of user generated data inversion.
> >>
> >> I encourage you to adopt a design strategy that keeps the data in one
> >> place (vs spread into multiple databases or indexes), and find a way
> >> to implement your business logic against that data model.
> >>
> >> And by keep it in one place, I'm referring to not splitting aspects of
> >> a given unit of data into multiple locations (vs sharding or some
> >> other partitioning of data, which is of course, always a good idea if
> >> you're running into scale issues).
> >>
> >> Thanks,
> >> Troy
> >>
> >>
> >> On Thu, Dec 22, 2011 at 10:02 AM, Trevor Watson
> >> <twat...@datassimilate.com>  wrote:
> >>>
> >>> I think sometimes i need to just type an email to myself to get my
> thoughts
> >>> out of my head.
> >>>
> >>> Would it be possible to implement a custom filter that does a search on
> each
> >>> result from the "contents" index on the "tagged" index.  The only
> problem I
> >>> have with that would be the overhead cost of opening/closing the tagged
> >>> index and then doing the search for each item.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 12/22/2011 9:27 AM, Trevor Watson wrote:
> >>>>
> >>>> Sorry to keep bothering you all.
> >>>>
> >>>> I'd like to create a 2nd index for my project for "tagging" items in
> the
> >>>> document set.  This way, it would be significantly easier to manage the
> >>>> tagged documents (add/remove/mass delete) than to deal with them in the
> >>>> existing index (where I would have to re-build fields and make sure
> that
> >>>> everything is moved over and can't do a mass delete
> >>>> (indexWriter.DeleteAll())).
> >>>>
> >>>> Is it possible to do this using a MultiSearcher?  There would be a
> common
> >>>> Unique ID field in each index and in the additional database, or would
> I
> >>>> have to do 2 searches, (one from the main index, one from either the
> >>>> database or the 2nd index) and then loop through the results and pull
> out
> >>>> the tagged documents from there?
> >>>>
> >>>> Thanks in advance.
> >>>>
> >
> >

Re: [Lucene.Net] Common ID Field across Indices - MultiSearcher and de-duplication?

Reply via email to