Re: LazyTextExtractorField and background text extraction

Jukka Zitting Thu, 16 Jul 2009 04:25:24 -0700

Hi,

On Thu, Jul 16, 2009 at 1:16 PM, Marcel
Reutegger<marcel.reuteg...@gmx.net> wrote:
> On Thu, Jul 16, 2009 at 12:28, Jukka Zitting<jukka.zitt...@gmail.com> wrote:
>> ... and a new background task is fired to update the index
>> for that document once the text extraction is complete.
>
> that requires clarification on the term 'background task' ;)


Yeah, I meant the latter of your points, i.e. the task will fall
outside the ongoing transaction and will be performed at some point in
(near) future.

>> Would it be a problem to *always* defer text extraction to a
>> background task that's disconnected from the transaction? That would
>> make things a lot simpler at a slight loss of functionality.
>
> I don't think that would be a big problem. but the issue remains, how to
> detect when the extractor has finished its work.

AFAIUI we wouldn't need any timeout detection as there would be no
user transaction to be blocked. Or are also the background index
updates serialized to a single write lock? But even in that case we
can simply wait until we've received all the text from the Reader and
fire the index update only after that.

>> Alternatively, we should probably move the extraction timeout handling
>> to some getExtractedText(long timeout) method that does a
>> wait(timeout) call on the extraction task, waiting for it to return
>> the extracted text as a String. If the timeout is reached, then just
>> an empty string is used and the rest of the extraction task is placed
>> in the indexing queue.
>
> does that mean you want to change the return value of a TextExtractor from
> Reader to String? or that would be on top of the existing interface?

I'm in any case thinking of replacing the TextExtractor interface with
the Tika Parser. We'd do any extra processing on top of that.

BR,

Jukka Zitting

Re: LazyTextExtractorField and background text extraction

Reply via email to