Hi,

Looking through the indexing code I found that the binary values are
currently being turned into fulltext fields using the
LazyTextExtractorField class. This class only reads from the Reader
returned from the configured TextExtractor when the stringValue()
method is called. And then it reads the entire extracted text stream
into a string that gets returned.

Two questions:

* Why do we need to convert the Reader to a String?

* Why isn't the Reader to String conversion happening in one of the
pooled text extractor threads?

Many of the text extractors (especially now that we're migrating to
Apache Tika) parse the document in streaming mode. So if we do need to
convert the Reader to a String, it would be more efficient if the text
extractor weren't blocked until when the indexer calls the
stringValue() method.

BR,

Jukka Zitting

Reply via email to