Indexing a document that modifies itself as it's being indexed

Stephen Green Tue, 11 Mar 2014 11:35:37 -0700

I'm working on a system that uses Lucene 4.6.0, and I have a couple of use
cases for documents that modify themselves as they're being indexed.


For example, we have text classifiers that we would like to run on the
contents of certain fields.  These classifiers produce field values (i.e.,
the classes that the document is in) that I would like to be part of the
document.

Now, the text classifiers want to tokenize the text in order to do the
classification, and I'd like to avoid re-tokenizing the text multiple
times, so I can build a token filter that collects the tokens and then runs
the classifier.  This filter can know about the oald.Document that's being
processed, but I suspected that adding elements to Document.fields  while
it's being indexed would lead to a concurrent modification exception.

Since IndexWriter.addDocument takes an Iterable<IndexableField>, I figured
I could just make my own document class that implemented Iterable, but
would allow me to add new fields onto the end of the document and extend
the iteration to cover those fields.

I did this, but it didn't have the effect that I was hoping for, because
the fields that were added were never processed.

Working through the code, I discovered that
DocFieldProcessor.processDocument iterates through all the fields in the
document, collecting them by field name (using it's own hash table?) before
processing them.

Of  course, this breaks my add-fields-as-other-fields-are-being-processed
approach because the iterator is exhausted before any of the processing
happens.

So, my questions are: Does it make any sense to try to do this?  If so, is
there an approach that will work without having to rewrite a lot of
indexing code?

Thanks,

Steve Green
-- 
Stephen Green

Indexing a document that modifies itself as it's being indexed

Reply via email to