[
https://issues.apache.org/jira/browse/TIKA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995648#comment-13995648
]
Ken Krugler commented on TIKA-1296:
-----------------------------------
Hi Phil - thanks for bringing this up, I didn't even realize that
stringignorecase was an option. Can you think of any reason why we wouldn't
want to just change all of these HTML-related match values to use
stringignorecase?
> Add case insensitive matching for text/html mime type
> -----------------------------------------------------
>
> Key: TIKA-1296
> URL: https://issues.apache.org/jira/browse/TIKA-1296
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Affects Versions: 1.5
> Reporter: Phil Lester
>
> Currently in tika-mimetypes.xml for the mime type text/html (and possibly
> others) matches in a couple different cases are provided for the elements so
> that varying HTML writing styles are matched. As of version 1.5 of Tika the
> ability exists to make these case insensitive using the "stringignorecase"
> type. This would allow consolidation of some matches and improve detection of
> poorly-formed HTML that would be rendered by most browsers regardless of case.
> For example:
> <match value="<BODY" type="string" offset="0"/>
> <match value="<body" type="string" offset="0"/>
> could become:
> <match value="<BODY" type="stringignorecase" offset="0"/>
--
This message was sent by Atlassian JIRA
(v6.2#6252)