I'm working on a system that uses Lucene 4.6.0, and I have a couple of use cases for documents that modify themselves as they're being indexed.
For example, we have text classifiers that we would like to run on the contents of certain fields. These classifiers produce field values (i.e., the classes that the document is in) that I would like to be part of the document. Now, the text classifiers want to tokenize the text in order to do the classification, and I'd like to avoid re-tokenizing the text multiple times, so I can build a token filter that collects the tokens and then runs the classifier. This filter can know about the oald.Document that's being processed, but I suspected that adding elements to Document.fields while it's being indexed would lead to a concurrent modification exception. Since IndexWriter.addDocument takes an Iterable<IndexableField>, I figured I could just make my own document class that implemented Iterable, but would allow me to add new fields onto the end of the document and extend the iteration to cover those fields. I did this, but it didn't have the effect that I was hoping for, because the fields that were added were never processed. Working through the code, I discovered that DocFieldProcessor.processDocument iterates through all the fields in the document, collecting them by field name (using it's own hash table?) before processing them. Of course, this breaks my add-fields-as-other-fields-are-being-processed approach because the iterator is exhausted before any of the processing happens. So, my questions are: Does it make any sense to try to do this? If so, is there an approach that will work without having to rewrite a lot of indexing code? Thanks, Steve Green -- Stephen Green