Hi, On Thu, Jul 16, 2009 at 12:28, Jukka Zitting<jukka.zitt...@gmail.com> wrote: > If I understand correctly, we update the search index within the > transaction but if a text extraction task takes longer than the > configurable limit, that part of the index update is replaced with an > empty string ...
correct. > ... and a new background task is fired to update the index > for that document once the text extraction is complete. that requires clarification on the term 'background task' ;) in general text is extracted using an executor, which might consist of a pool of threads. per default the executor is equipped with a number of threads that is twice the number of available processors. either: - the extractor task (TextExtractorJob) completes within the extractorTimeout and the text gets index or - the task gets pushed into the indexing queue where it will complete at some point in the future. a periodic check task will then update the index when the extractor has finished its work. > Would it be a problem to *always* defer text extraction to a > background task that's disconnected from the transaction? That would > make things a lot simpler at a slight loss of functionality. I don't think that would be a big problem. but the issue remains, how to detect when the extractor has finished its work. > Alternatively, we should probably move the extraction timeout handling > to some getExtractedText(long timeout) method that does a > wait(timeout) call on the extraction task, waiting for it to return > the extracted text as a String. If the timeout is reached, then just > an empty string is used and the rest of the extraction task is placed > in the indexing queue. does that mean you want to change the return value of a TextExtractor from Reader to String? or that would be on top of the existing interface? regards marcel