Hi,

On Thu, Jul 16, 2009 at 12:28, Jukka Zitting<jukka.zitt...@gmail.com> wrote:
> If I understand correctly, we update the search index within the
> transaction but if a text extraction task takes longer than the
> configurable limit, that part of the index update is replaced with an
> empty string ...

correct.

> ... and a new background task is fired to update the index
> for that document once the text extraction is complete.

that requires clarification on the term 'background task' ;)

in general text is extracted using an executor, which might consist
of a pool of threads. per default the executor is equipped with
a number of threads that is twice the number of available processors.

either:

- the extractor task (TextExtractorJob) completes within the
extractorTimeout and the text gets index

or

- the task gets pushed into the indexing queue where it will
complete at some point in the future. a periodic check task
will then update the index when the extractor has finished its work.

> Would it be a problem to *always* defer text extraction to a
> background task that's disconnected from the transaction? That would
> make things a lot simpler at a slight loss of functionality.

I don't think that would be a big problem. but the issue remains, how to
detect when the extractor has finished its work.

> Alternatively, we should probably move the extraction timeout handling
> to some getExtractedText(long timeout) method that does a
> wait(timeout) call on the extraction task, waiting for it to return
> the extracted text as a String. If the timeout is reached, then just
> an empty string is used and the rest of the extraction task is placed
> in the indexing queue.

does that mean you want to change the return value of a TextExtractor from
Reader to String? or that would be on top of the existing interface?

regards
 marcel

Reply via email to