[
https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting updated JCR-2219:
-------------------------------
Attachment: JCR-2219.patch
Attached a patch that starts the background text extraction thread as early as
possible and counts the extraction timeout not only against the creation of a
Reader but also against reading the extracted text from the Reader.
Note that the patch buffers the *entire* extracted text into memory before
passing it on to indexing. Currently we in any case buffer the text to a
String, so this isn't that much of a regression (though now we have two copies
of the string) but obviously it would be better if we could avoid that.
Some of the test cases had implicit assumptions about indexing speed that were
broken by these changes. Based on some previous code snippets I added a new
SearchIndex.flush() method that makes sure that all pending index changes have
been processed and flushed to disk. This method is now automatically called by
the executeSQLQuery() and executeXPATHQuery() methods in AbstractQueryTest to
avoid any issues with late index updates. Later on we might find some uses for
the new flush() method also outside the test suite.
Things to do:
* The patch still mostly follows the existing code structure to make it easier
to review the changes. We could probably simplify the code and avoid the extra
String copy of the extracted text by merging the TextExtractorReader and
TextExtractorJob classes.
* Going further, we could probably drop the PooledTextExtractor class in favor
of a simpler thread pool that the NodeIndexer would use to execute
TextExtractorJobs.
> Improved background text extraction
> -----------------------------------
>
> Key: JCR-2219
> URL: https://issues.apache.org/jira/browse/JCR-2219
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: indexing, jackrabbit-core
> Reporter: Jukka Zitting
> Priority: Minor
> Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see
> http://markmail.org/message/syt7lc2guzapt7la), the current approach to text
> extraction in background threads doesn't work that well especially with the
> Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into
> Strings before being passed into the Lucene index. It would be good if we
> could somehow get back to passing just Readers to Lucene.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.