Parsers and non-canonical mimetypes
-----------------------------------

                 Key: TIKA-620
                 URL: https://issues.apache.org/jira/browse/TIKA-620
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.9
            Reporter: Nick Burch


As discovered though TIKA-555, we weren't correctly handling the case of non 
canonical mimetypes properly

There are three bits to this:
 * The default parser as created by TikaConfig.getDefaultConfig() needs the 
full mime registry passing in automatically, so it can walk the mime tree. 
Fixed in r1084801
 * The Composite Parser needs to handle child parsers that declare they support 
an alias rather than the canonical mimetype. Initial fix in r1084798, Jukka has 
an idea for a cleaner way
 * When Composite Parser looks for a parser for a mime type, it'll need to 
canonicalise this before checking. (this isn't a problem for the Auto Detect 
parser, but can be for others)

We also probably need a couple more unit tests for this, as TIKA-555 had broken 
auto detect parsing of bmp (though not direct ImageParser), though none of our 
tests noticed...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to