[jira] Updated: (JCR-2219) Improved background text extraction

Jukka Zitting (JIRA) Thu, 16 Jul 2009 08:26:53 -0700

     [ 
https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting updated JCR-2219:
-------------------------------

    Status: Patch Available  (was: Open)

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see 
> http://markmail.org/message/syt7lc2guzapt7la), the current approach to text 
> extraction in background threads doesn't work that well especially with the 
> Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into 
> Strings before being passed into the Lucene index. It would be good if we 
> could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (JCR-2219) Improved background text extraction

Reply via email to