[
https://issues.apache.org/jira/browse/TIKA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886110#action_12886110
]
Jukka Zitting commented on TIKA-458:
------------------------------------
The reason why I originally didn't do this was to avoid making it a
backwards-compatibility requirement that the HTML parser uses a SAX content
handler to internally process HTML documents. This assumption may no longer
hold if we decide to use libraries like boilerpipe (see TIKA-420) as the
default HTML parsing mechanism.
That said, I guess in this case the benefits probably outweight the possible
drawbacks of increased backwards-compatibility requirements on the HTML parser
design.
About the patch itself, the proposed design of the way HTMLHandler is used is a
bit troublesome as the only way for a custom HTMLHandler to access the output
ContenHandler, the Metadata instance and the parse context is if they've been
passed in to the custom HTMLHandler instance by the client application. This
won't work correctly for example when working with composite documents like Zip
archives. A better solution might be to introduce a factory interface like this:
public interface HTMLHandlerFactory {
ContentHandler createHTMLHandler(
ContentHandler handler, Metadata metadata, ParseContext context);
}
PS. The patch seems to contain a few unrelated changes to the HTML parser. Can
you handle file separate issues for those changes?
PPS. It would be better if we used only spaces for indentation.
> Specify HTMLHandler via Context
> -------------------------------
>
> Key: TIKA-458
> URL: https://issues.apache.org/jira/browse/TIKA-458
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 0.7
> Reporter: Julien Nioche
> Attachments: TIKA-458.patch
>
>
> One of the recent changes on Tika is the possibility to specify a custom
> HTMLMapper via the Context - which I think is an elegant mechanism. I was
> wondering whether there would be a reason NOT to be able to do the same for
> the HTMLHandler and if nothing is passed via the Context, rely on the current
> implementation. This would give more control to the user on what to do with
> the SAX events while at the same time preserving the functionality by default.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.