Re: LazyTextExtractorField and background text extraction

Marcel Reutegger Wed, 15 Jul 2009 10:23:50 -0700

Hi,

On Wed, Jul 15, 2009 at 18:21, Jukka Zitting<jukka.zitt...@gmail.com> wrote:
> Hi,
>
> Looking through the indexing code I found that the binary values are
> currently being turned into fulltext fields using the
> LazyTextExtractorField class. This class only reads from the Reader
> returned from the configured TextExtractor when the stringValue()
> method is called. And then it reads the entire extracted text stream
> into a string that gets returned.
>
> Two questions:
>
> * Why do we need to convert the Reader to a String?


the class was introduced for JCR-1730. when a value needs to be stored
in the index and not just tokenized one must use a String and cannot
use a Reader.

> * Why isn't the Reader to String conversion happening in one of the
> pooled text extractor threads?

hmm, I guess I haven't really thought about that when I introduced
LazyTextExtractorField. all the extractor stuff is based on Readers
that will contain the text. I'm not sure how this can be changed
easily.

> Many of the text extractors (especially now that we're migrating to
> Apache Tika) parse the document in streaming mode. So if we do need to
> convert the Reader to a String, it would be more efficient if the text
> extractor weren't blocked until when the indexer calls the
> stringValue() method.

I see. I'll have to look into this in more detail. so far the
background text extraction basically assumed that when the reader is
returned the major part of the extraction job is done. but it seems
that's not the case anymore.

regards
 marcel

Re: LazyTextExtractorField and background text extraction

Reply via email to