[
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312860#comment-14312860
]
Chris A. Mattmann commented on TIKA-1509:
-----------------------------------------
Great work as a start Nick, and good discussion Tim. Some comments from me:
bq. One addition thing to consider is that CompositeParser will walk its way up
the type hierarchy until it finds a parser for the type. If someone has two
parsers for Microsoft Excel .xls, and one parser for x-tika-msoffice (the ole2
container that .xls sits in), should they be able to say that all parsers for
parent types also be tried? Or would it just be "go up the type hierarchy until
you find at least one parser, then run all parsers at that level based on the
strategy"?
I think we need to know which Parsers are container-aware parsers, which could
help us here. But if we had a reset method, there is no reason even if there is
a container, that we shouldn't be able to call it along with any other MIME
matching parsers.
bq. If we're going for the "try until one works" approach, and a parser gets
partway then exceptions out, resetting the Metadata shouldn't be too tricky, if
desired. However, what happens if the parser has output some text to the
ContentHandler? Should we try somehow to reset the ContentHandler then restart?
What about simply creating a BufferedContentHandler that wraps all incoming
ContentHandlers and has the ability to reset()? Similar to Tim's approach. This
would then decorate the incoming handler and take care of the streaming ones.
Maybe some code here would help.
bq. If we're going for the "try all of them for maximum fidelity" approach,
then having parsers append keys and values to the Metadata object is probably
fine. However, what happens Content Handler wise when one parser has finished,
then the second wants to add some more information to the <head> block?
Appending more text to the body should be fine, provided we wrap the "end
document" call to prevent it going through after the same parser, but what
about things for the header? Buffer the whole thing? Prevent later parsers
getting at the header? Treat all parsers like we would embedded, and put their
header and body into a special set of tags in the body?
Good question. Maybe create intermediate outputs, and then merge them together
when we're done? Need to think about this.
> Create configurable strategies for composite parsers
> ----------------------------------------------------
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering
> which parser is chosen (roughly) by the alphabetic order of the parser class
> name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here:
> http://wiki.apache.org/tika/CompositeParserDiscussion
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)