On Fri, Jan 2, 2015 at 11:50 AM, Jukka Zitting <[email protected]>
wrote:

> Hi,
>
> 2015-01-02 11:27 GMT-05:00 Grant Ingersoll <[email protected]>:
> > I'm prototyping a new parser (to be donated) for a file type that already
> > has a parser.  This parser will only be applicable for certain sub types
> of
> > that file type.  How is this best handled in an auto-detection scenario?
> > Are there hints we can give the MIME detector?
>
> I see two ways to handle this:
>
> 1. The "do the right thing" approach: Tika knows how to handle media
> type hierarchies and optional type parameters when matching the
> detected media type to the appropriate parser. So you could either
> define an extra media type and mark it as a subtype of the more
> generic type (like application/java-archive is to application/zip) or
> add extra type parameters to add more detailed type information (like
> text/plain;charset=utf-8 is to text/plain). You can then define your
> new parser to only accept files of that specific subtype or parameter.
> Once type detection can correctly detect such files, your parser will
> automatically be used to parse them.
>
>
I think the problem is that the file types in question are not discernible
by anything other than the actual content, with the big problem being this
is an expensive operation.

For example purposes, let's say I had a parser that parsed JPEGs that had >
50% of blue in them w/ better accuracy than a general purpose parser.  All
other indicators point to it as being just another JPEG.  So, ideally, when
I see > 50% blue, I would choose the smarter parser.  The challenge, of
course, being that you have to do most of the work to even determine if it
meets the criteria.  That being said, there may be heuristics one could
employ to determine if the input is of the desired form or not.

I'll poke around here a bit and see if anything stands out.


> 2. The "worse is better" option: The above option requires you to
> defining a new subtype or a parameter and to modify the type detection
> mechanism to correctly detect such files. To avoid the extra work, you
> could simply mark your new parser as being able to handle all files of
> the more generic type, and then in your parser include a fallback
> option to call the original Tika parser when encountering a file the
> new parser can't handle.
>
> BR,
>
> Jukka Zitting
>

Reply via email to