Hi, Looking through the indexing code I found that the binary values are currently being turned into fulltext fields using the LazyTextExtractorField class. This class only reads from the Reader returned from the configured TextExtractor when the stringValue() method is called. And then it reads the entire extracted text stream into a string that gets returned.
Two questions: * Why do we need to convert the Reader to a String? * Why isn't the Reader to String conversion happening in one of the pooled text extractor threads? Many of the text extractors (especially now that we're migrating to Apache Tika) parse the document in streaming mode. So if we do need to convert the Reader to a String, it would be more efficient if the text extractor weren't blocked until when the indexer calls the stringValue() method. BR, Jukka Zitting