Phil Lester created TIKA-1296:
---------------------------------

             Summary: Add case insensitive matching for text/html mime type
                 Key: TIKA-1296
                 URL: https://issues.apache.org/jira/browse/TIKA-1296
             Project: Tika
          Issue Type: Improvement
          Components: mime
    Affects Versions: 1.5
            Reporter: Phil Lester


Currently in tika-mimetypes.xml for the mime type text/html (and possibly 
others) matches in a couple different cases are provided for the elements so 
that varying HTML writing styles are matched. As of version 1.5 of Tika the 
ability exists to make these case insensitive using the "stringignorecase" 
type. This would allow consolidation of some matches and improve detection of 
poorly-formed HTML that would be rendered by most browsers regardless of case.

For example:
      <match value="&lt;BODY" type="string" offset="0"/>
      <match value="&lt;body" type="string" offset="0"/>

could become:
      <match value="&lt;BODY" type="stringignorecase" offset="0"/>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to