Hi, On Wed, Jul 15, 2009 at 19:23, Marcel Reutegger<[email protected]> wrote: > On Wed, Jul 15, 2009 at 18:21, Jukka Zitting<[email protected]> wrote: >> Many of the text extractors (especially now that we're migrating to >> Apache Tika) parse the document in streaming mode. So if we do need to >> convert the Reader to a String, it would be more efficient if the text >> extractor weren't blocked until when the indexer calls the >> stringValue() method. > > I see. I'll have to look into this in more detail. so far the > background text extraction basically assumed that when the reader is > returned the major part of the extraction job is done. but it seems > that's not the case anymore.
hmm, even if the conversion from reader to string is done in a separate thread as part of the extractor job, there remains the issue when the reader is used as is. this will also cause the indexer to wait for the tika parser. thus causing the index update to stall. we'd have to change the way how the indexer finds out whether the extractor times out. currently it does it based on the time it takes to get the reader from the text extractor. but now it should also take into consideration how long it takes to consume the reader. the big issue here is that the reader is only consumed when the lucene document is added to the index. that's too late to replace the reader with a dummy value and put the extractor job into the indexing queue. hmm, unless the TextExtractorReader in between tracks the time it takes to consume the stream and simply stops feeding if it takes too long. however that would require access to the indexing queue from within the TextExtractorReader. it would have to create a new lucene document that goes into the indexing queue and ensure that it contains all the content. this might be difficult because other readers may have been read and closed already... regards marcel
