[
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340662#comment-14340662
]
Tyler Palsulich commented on TIKA-1509:
---------------------------------------
Just to reiterate the above and be clear about the issues we're running into
with this, here is a list. Please correct/update if I'm misunderstanding or
leaving something out.
# Multiple Parsers may support any given file. So, users should be able to
provide a strategy of which Parser is used or how Parser results are merged.
# The default behavior when multiple Parsers support a file will be:
## Pick an initial Parser with _some strategy_. If it fails, keep trying
additional Parsers.
## Run all Parsers and merge results.
# If you're trying multiple Parsers, how do you/should you merge the Metadata?
# If you're trying multiple Parsers, how do you/should you merge
ContentHandler? A ContentHandler is fed information from the Parser while
consuming the input stream. Possible answers:
## Make ContentHandlers have a reset() functionality -- drop all previously
passed content.
## Make users pass in a ContentHandlerFactory, so the Parsers can create a new
ContentHandler when they start Parsing. This is essentially a reset in the form
of creating a new ContentHandler.
# How do you reset the given InputStream when starting a new parse?
# How do container aware Parsers factor into this?
> Create configurable strategies for composite parsers
> ----------------------------------------------------
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering
> which parser is chosen (roughly) by the alphabetic order of the parser class
> name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here:
> http://wiki.apache.org/tika/CompositeParserDiscussion
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)