[jira] Updated: (JCR-2219) Improved background text extraction

Jukka Zitting (JIRA) Thu, 16 Jul 2009 08:26:52 -0700

     [ 
https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting updated JCR-2219:
-------------------------------

    Attachment: JCR-2219.patch

Attached a patch that starts the background text extraction thread as early as 
possible and counts the extraction timeout not only against the creation of a 
Reader but also against reading the extracted text from the Reader.

Note that the patch buffers the *entire* extracted text into memory before 
passing it on to indexing. Currently we in any case buffer the text to a 
String, so this isn't that much of a regression (though now we have two copies 
of the string) but obviously it would be better if we could avoid that.

Some of the test cases had implicit assumptions about indexing speed that were 
broken by these changes. Based on some previous code snippets I added a new 
SearchIndex.flush() method that makes sure that all pending index changes have 
been processed and flushed to disk. This method is now automatically called by 
the executeSQLQuery() and executeXPATHQuery() methods in AbstractQueryTest to 
avoid any issues with late index updates. Later on we might find some uses for 
the new flush() method also outside the test suite.

Things to do:

* The patch still mostly follows the existing code structure to make it easier 
to review the changes. We could probably simplify the code and avoid the extra 
String copy of the extracted text by merging the TextExtractorReader and 
TextExtractorJob classes.

* Going further, we could probably drop the PooledTextExtractor class in favor 
of a simpler thread pool that the NodeIndexer would use to execute 
TextExtractorJobs.


> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see 
> http://markmail.org/message/syt7lc2guzapt7la), the current approach to text 
> extraction in background threads doesn't work that well especially with the 
> Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into 
> Strings before being passed into the Lucene index. It would be good if we 
> could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-2219) Improved background text extraction

Reply via email to