Improved background text extraction
-----------------------------------

                 Key: JCR-2219
                 URL: https://issues.apache.org/jira/browse/JCR-2219
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: indexing, jackrabbit-core
            Reporter: Jukka Zitting
            Priority: Minor


As recently discussed on the mailing list (see 
http://markmail.org/message/syt7lc2guzapt7la), the current approach to text 
extraction in background threads doesn't work that well especially with the 
Tika-based extractors that support streamed parsing of many document types.

Also, we currently *all* of the extracted text streams are buffered into 
Strings before being passed into the Lucene index. It would be good if we could 
somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to