Re: [Lucene.Net] Common ID Field across Indices - MultiSearcher and de-duplication?

Trevor Watson Thu, 22 Dec 2011 10:52:05 -0800

My manager always harps more and more on the speed aspect of the programI'm developing. It's hard to make him understand that adding anythingis going to add time.

In terms of using an index to keep track of tagging of documents, if wewere to keep the data in a single index, as far as I know, to do anysort of mass insert/delete of a "Tag" field would require me to loopthrough each document and remove/add the field and the data for thatfield, plus I would have to re-create the data for the contents fieldwhich isn't stored in the index (makes the searches wayyy too slow).Doing it this way would make the searches fast, but the tagging ofdocuments extremely slow.

By creating a 2nd index, we wouldn't deal with the contents field or anyof the other stored information, plus we could call DeleteAll on theIndexWriter to clear out Tagged documents if need be (instead of loopingthrough 150,000 records and removing the "Tagged" field (plusre-creating the contents field) on each of them). Even if we had toiterate through the documents, it would be faster as there is lessstored information.

I kinda wish we could just put this all into a database, but that wouldget rid of the awesome high speed searching that Lucene gives.

The best that I've been able to come up with in terms of myunderstanding of Lucene is the 2 index system, then do a search on one,a search on the other (retrieving the UniqueID from both) and then usingthe List.Intersect() function to join the two.


There's gotta be a better way to do this



On 12/22/2011 1:26 PM, Troy Howard wrote:

Trevor,

It's a little unclear from your initial description why you want to
segregate tagged data in a second index. What is the purpose of the
tagging? Generally tagging is not a binary operation, but rather a
form of user generated data inversion.

I encourage you to adopt a design strategy that keeps the data in one
place (vs spread into multiple databases or indexes), and find a way
to implement your business logic against that data model.

And by keep it in one place, I'm referring to not splitting aspects of
a given unit of data into multiple locations (vs sharding or some
other partitioning of data, which is of course, always a good idea if
you're running into scale issues).

Thanks,
Troy


On Thu, Dec 22, 2011 at 10:02 AM, Trevor Watson
<twat...@datassimilate.com>  wrote:

I think sometimes i need to just type an email to myself to get my thoughts
out of my head.

Would it be possible to implement a custom filter that does a search on each
result from the "contents" index on the "tagged" index.  The only problem I
have with that would be the overhead cost of opening/closing the tagged
index and then doing the search for each item.







On 12/22/2011 9:27 AM, Trevor Watson wrote:

Sorry to keep bothering you all.

I'd like to create a 2nd index for my project for "tagging" items in the
document set.  This way, it would be significantly easier to manage the
tagged documents (add/remove/mass delete) than to deal with them in the
existing index (where I would have to re-build fields and make sure that
everything is moved over and can't do a mass delete
(indexWriter.DeleteAll())).

Is it possible to do this using a MultiSearcher?  There would be a common
Unique ID field in each index and in the additional database, or would I
have to do 2 searches, (one from the main index, one from either the
database or the 2nd index) and then loop through the results and pull out
the tagged documents from there?

Thanks in advance.

Re: [Lucene.Net] Common ID Field across Indices - MultiSearcher and de-duplication?

Reply via email to