Hi,

Both of Jukka's options look good to me. Another option is to modify the
existing Parser -- extract the extra information when possible, stick with
current behavior if not.

We've run into this problem with images by trying to run OCR and extract
metadata at the same time. Please see TIKA-1445 (
https://issues.apache.org/jira/browse/TIKA-1445) for a more involved
solution.

Best wishes,
Tyler

On Fri, Jan 2, 2015 at 11:50 AM, Jukka Zitting <[email protected]>
wrote:

> Hi,
>
> 2015-01-02 11:27 GMT-05:00 Grant Ingersoll <[email protected]>:
> > I'm prototyping a new parser (to be donated) for a file type that already
> > has a parser.  This parser will only be applicable for certain sub types
> of
> > that file type.  How is this best handled in an auto-detection scenario?
> > Are there hints we can give the MIME detector?
>
> I see two ways to handle this:
>
> 1. The "do the right thing" approach: Tika knows how to handle media
> type hierarchies and optional type parameters when matching the
> detected media type to the appropriate parser. So you could either
> define an extra media type and mark it as a subtype of the more
> generic type (like application/java-archive is to application/zip) or
> add extra type parameters to add more detailed type information (like
> text/plain;charset=utf-8 is to text/plain). You can then define your
> new parser to only accept files of that specific subtype or parameter.
> Once type detection can correctly detect such files, your parser will
> automatically be used to parse them.
>
> 2. The "worse is better" option: The above option requires you to
> defining a new subtype or a parameter and to modify the type detection
> mechanism to correctly detect such files. To avoid the extra work, you
> could simply mark your new parser as being able to handle all files of
> the more generic type, and then in your parser include a fallback
> option to call the original Tika parser when encountering a file the
> new parser can't handle.
>
> BR,
>
> Jukka Zitting
>

Reply via email to