Re: LazyTextExtractorField and background text extraction

Marcel Reutegger Thu, 16 Jul 2009 02:05:21 -0700

Hi,

On Wed, Jul 15, 2009 at 19:23, Marcel Reutegger<[email protected]> wrote:
> On Wed, Jul 15, 2009 at 18:21, Jukka Zitting<[email protected]> wrote:
>> Many of the text extractors (especially now that we're migrating to
>> Apache Tika) parse the document in streaming mode. So if we do need to
>> convert the Reader to a String, it would be more efficient if the text
>> extractor weren't blocked until when the indexer calls the
>> stringValue() method.
>
> I see. I'll have to look into this in more detail. so far the
> background text extraction basically assumed that when the reader is
> returned the major part of the extraction job is done. but it seems
> that's not the case anymore.


hmm, even if the conversion from reader to string is done in a
separate thread as part of the extractor job, there remains the issue
when the reader is used as is. this will also cause the indexer to
wait for the tika parser. thus causing the index update to stall.

we'd have to change the way how the indexer finds out whether the
extractor times out. currently it does it based on the time it takes
to get the reader from the text extractor. but now it should also take
into consideration how long it takes to consume the reader. the big
issue here is that the reader is only consumed when the lucene
document is added to the index. that's too late to replace the reader
with a dummy value and put the extractor job into the indexing queue.

hmm, unless the TextExtractorReader in between tracks the time it
takes to consume the stream and simply stops feeding if it takes too
long. however that would require access to the indexing queue from
within the TextExtractorReader. it would have to create a new lucene
document that goes into the indexing queue and ensure that it contains
all the content. this might be difficult because other readers may
have been read and closed already...

regards
 marcel

Re: LazyTextExtractorField and background text extraction

Reply via email to