[ 
https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015812#comment-17015812
 ] 

Sebastian Nagel commented on NUTCH-2525:
----------------------------------------

Thanks, [~jurian]! I've updated the patch again so that it applies to the 
recent master and verified that it fixes a [problem recently discussed in user 
mailing 
list|https://lists.apache.org/thread.html/0816be26d6793985dc19f66c3aa9ee3ee5c077e562ff8d10f5e0c077%40%3Cuser.nutch.apache.org%3E]
 where mixed-case meta tags (from a PDF parsed by parse-tika) are not properly 
indexed.

I will commit soon.

Note:
- for HTML documents meta tags are treated differently by parse-html and 
parse-tika. The former lowercases the meta tag names while the latter keeps the 
casing. We have to think how we address this to ensure the usability of the 
index-metadata plugin. 
- the plugin parse-metatags performs a case-insensitive extraction of meta tags 
and lowercases the meta tag names. In combination with index-metadata they have 
to be now configured as lowercase, e.g., "metatag.dc.creator"

> Metadata indexer cannot handle uppercase parse metadata
> -------------------------------------------------------
>
>                 Key: NUTCH-2525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2525
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, plugin
>    Affects Versions: 1.14
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.17
>
>         Attachments: NUTCH-2525-p1.patch, NUTCH-2525.patch
>
>
> MetadataIndexer lowercases keys for parse metadata, making it impossible to 
> index metadata containing uppercase. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to