[ 
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312369#comment-14312369
 ] 

Nick Burch commented on TIKA-1509:
----------------------------------

Two things now spring to mind as possible problems, both about the 
ContentHandler

If we're going for the "try until one works" approach, and a parser gets 
partway then exceptions out, resetting the Metadata shouldn't be too tricky, if 
desired. However, what happens if the parser has output some text to the 
ContentHandler? Should we try somehow to reset the ContentHandler then restart?

If we're going for the "try all of them for maximum fidelity" approach, then 
having parsers append keys and values to the Metadata object is probably fine. 
However, what happens Content Handler wise when one parser has finished, then 
the second wants to add some more information to the {{<head>}} block? 
Appending more text to the body should be fine, provided we wrap the "end 
document" call to prevent it going through after the same parser, but what 
about things for the header? Buffer the whole thing? Prevent later parsers 
getting at the header? Treat all parsers like we would embedded, and put their 
header and body into a special set of tags in the body? 

Buffering and merging would potentially mean lots of memory used, and might not 
be that simple to do. Putting each parser in their own divs in the body means 
that you'll get quite different html from the single parser and composite 
parser cases. Only allowing the first parser to output the header seems like it 
won't work for many uses cases. Saying "only the first parser can output 
content" will probably fail for even more usecases

> Create configurable strategies for composite parsers
> ----------------------------------------------------
>
>                 Key: TIKA-1509
>                 URL: https://issues.apache.org/jira/browse/TIKA-1509
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering 
> which parser is chosen (roughly) by the alphabetic order of the parser class 
> name.
> Let's allow users to configure strategies for picking parsers.
> ***NOTE: this description is just a place holder, will edit later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to