Hi, On Thu, Jul 16, 2009 at 11:32, Jukka Zitting<jukka.zitt...@gmail.com> wrote: > On Thu, Jul 16, 2009 at 11:04 AM, Marcel > Reutegger<marcel.reuteg...@gmx.net> wrote: >> hmm, even if the conversion from reader to string is done in a >> separate thread as part of the extractor job, there remains the issue >> when the reader is used as is. > > As far as I can tell from the code, this is currently not the case as > all the binary values get wrapped into LazyTextExtractorFields.
that's correct. I meant, it would be a general problem, even if we changed the way LazyTextExtractorField works. i.e. if it would return the reader in case the content is not stored in the index. >> we'd have to change the way how the indexer finds out whether the >> extractor times out. > > Would it help if we added an unlimited buffering mechanism (backed by > temporary files as needed) to the Readers so that if the indexer gets > blocked extracting text from one document, all the other pending > documents can automatically continue text extraction in parallel? This > might cause occasional blocking in the indexer, but on the average it > should do about as well as maintaining an explicit indexing queue. I'm not sure I understand that correctly. with the current design multiple nodes are already indexed in parallel. but the index update as a whole will still be blocked, waiting for *all* nodes to be indexed. the indexing queue is meant to takes over long running text extractions and do that work outside of the index update, instead of indexing the real content, the timed out text extracts are replaced with dummy values. a new index update is done when the extraction has finished (this is currently detected by an available reader from the ) with the complete text extract. there is a configuration parameter extractorTimeout which limits the amount of time spent in extracting text (or waiting for that to happen). I think it must be possible to configure the repository to never block on text extractions. it is vital because jackrabbit currently only supports one writing transaction at a time and the indexing is part of that transaction. regards marcel > In fact if we did this in Tika, we could avoid the extra buffering > entirely for things like plain text documents and other formats where > the parsing overhead is negligible. > > BR, > > Jukka Zitting >