[
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312369#comment-14312369
]
Nick Burch commented on TIKA-1509:
----------------------------------
Two things now spring to mind as possible problems, both about the
ContentHandler
If we're going for the "try until one works" approach, and a parser gets
partway then exceptions out, resetting the Metadata shouldn't be too tricky, if
desired. However, what happens if the parser has output some text to the
ContentHandler? Should we try somehow to reset the ContentHandler then restart?
If we're going for the "try all of them for maximum fidelity" approach, then
having parsers append keys and values to the Metadata object is probably fine.
However, what happens Content Handler wise when one parser has finished, then
the second wants to add some more information to the {{<head>}} block?
Appending more text to the body should be fine, provided we wrap the "end
document" call to prevent it going through after the same parser, but what
about things for the header? Buffer the whole thing? Prevent later parsers
getting at the header? Treat all parsers like we would embedded, and put their
header and body into a special set of tags in the body?
Buffering and merging would potentially mean lots of memory used, and might not
be that simple to do. Putting each parser in their own divs in the body means
that you'll get quite different html from the single parser and composite
parser cases. Only allowing the first parser to output the header seems like it
won't work for many uses cases. Saying "only the first parser can output
content" will probably fail for even more usecases
> Create configurable strategies for composite parsers
> ----------------------------------------------------
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering
> which parser is chosen (roughly) by the alphabetic order of the parser class
> name.
> Let's allow users to configure strategies for picking parsers.
> ***NOTE: this description is just a place holder, will edit later.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)