[
https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015812#comment-17015812
]
Sebastian Nagel commented on NUTCH-2525:
----------------------------------------
Thanks, [~jurian]! I've updated the patch again so that it applies to the
recent master and verified that it fixes a [problem recently discussed in user
mailing
list|https://lists.apache.org/thread.html/0816be26d6793985dc19f66c3aa9ee3ee5c077e562ff8d10f5e0c077%40%3Cuser.nutch.apache.org%3E]
where mixed-case meta tags (from a PDF parsed by parse-tika) are not properly
indexed.
I will commit soon.
Note:
- for HTML documents meta tags are treated differently by parse-html and
parse-tika. The former lowercases the meta tag names while the latter keeps the
casing. We have to think how we address this to ensure the usability of the
index-metadata plugin.
- the plugin parse-metatags performs a case-insensitive extraction of meta tags
and lowercases the meta tag names. In combination with index-metadata they have
to be now configured as lowercase, e.g., "metatag.dc.creator"
> Metadata indexer cannot handle uppercase parse metadata
> -------------------------------------------------------
>
> Key: NUTCH-2525
> URL: https://issues.apache.org/jira/browse/NUTCH-2525
> Project: Nutch
> Issue Type: Bug
> Components: indexer, plugin
> Affects Versions: 1.14
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.17
>
> Attachments: NUTCH-2525-p1.patch, NUTCH-2525.patch
>
>
> MetadataIndexer lowercases keys for parse metadata, making it impossible to
> index metadata containing uppercase.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)