[
https://issues.apache.org/jira/browse/JCR-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771340#action_12771340
]
Marcel Reutegger commented on JCR-2365:
---------------------------------------
Answering some follow up questions that I got from Jeremy by email:
> Is my understanding correct in that once upgrading to 1.6.1, the current
> Text-extractors module will become obsolete?
no, 1.6.1 will be just a bug fix release without changes in module
dependencies. 1.6.1 will contain a fix to the HTML text extractor.
> If so will any changes be required to the workspace.xml for the
> textFilterClasses parameter to enable the use of the Apache Tika
> extractors?
The Apache Tika based text extractor is only available in the upcoming 2.0
release, but not in 1.6.x.
> Is it possible to enable this for JCR 1.6.0 so that HTML files have their
> numerics extracted and indexed?
It's probably easier to patch the 1.6.0 release, build the
jackrabbit-text-extractors on 1.6 branch or wait for the 1.6.1 release.
> HTML Text Extractor does not extract or index numerics
> ------------------------------------------------------
>
> Key: JCR-2365
> URL: https://issues.apache.org/jira/browse/JCR-2365
> Project: Jackrabbit Content Repository
> Issue Type: Bug
> Components: indexing, jackrabbit-text-extractors
> Affects Versions: 1.6.0
> Environment: Win XP-Pro; Win 2003 Enterprise 32bit
> Reporter: Jeremy Anderson
> Fix For: 1.6.1, 2.0.0
>
>
> Numerics such as addresses/dates/financial figures are not extracted or
> indexed by the current HTML Extractor. These values are handled properly and
> searchable when done via the PlainTextExtractor
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.