[ 
https://issues.apache.org/jira/browse/TIKA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886110#action_12886110
 ] 

Jukka Zitting commented on TIKA-458:
------------------------------------

The reason why I originally didn't do this was to avoid making it a 
backwards-compatibility requirement that the HTML parser uses a SAX content 
handler to internally process HTML documents. This assumption may no longer 
hold if we decide to use libraries like boilerpipe (see TIKA-420) as the 
default HTML parsing mechanism.

That said, I guess in this case the benefits probably outweight the possible 
drawbacks of increased backwards-compatibility requirements on the HTML parser 
design.

About the patch itself, the proposed design of the way HTMLHandler is used is a 
bit troublesome as the only way for a custom HTMLHandler to access the output 
ContenHandler, the Metadata instance and the parse context is if they've been 
passed in to the custom HTMLHandler instance by the client application. This 
won't work correctly for example when working with composite documents like Zip 
archives. A better solution might be to introduce a factory interface like this:

    public interface HTMLHandlerFactory {
        ContentHandler createHTMLHandler(
            ContentHandler handler, Metadata metadata, ParseContext context);
    }

PS. The patch seems to contain a few unrelated changes to the HTML parser. Can 
you handle file separate issues for those changes?

PPS. It would be better if we used only spaces for indentation.


> Specify HTMLHandler via Context
> -------------------------------
>
>                 Key: TIKA-458
>                 URL: https://issues.apache.org/jira/browse/TIKA-458
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>         Attachments: TIKA-458.patch
>
>
> One of the recent changes on Tika is the possibility to specify a custom 
> HTMLMapper via the Context - which I think is an elegant mechanism. I was 
> wondering whether there would be a reason NOT to be able to do the same for 
> the HTMLHandler and if nothing is passed via the Context, rely on the current 
> implementation. This would give more control to the user on what to do with 
> the SAX events while at the same time preserving the functionality by default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to