[ 
https://issues.apache.org/jira/browse/TIKA-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572077#comment-17572077
 ] 

Björn Kautler commented on TIKA-1484:
-------------------------------------

I analysed the code a bit.
As far as I have seen the situation is the following:
 - Boilerpipe is only used in 
{{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-commons/src/main/java/org/apache/tika/sax/boilerpipe/BoilerpipeContentHandler.java}}
 - {{BoilerpipeContentHandler}} is only used in
 -- {{tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java}}
 -- {{tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java}}
 -- 
{{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java}}
 -- 
{{tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java}}
 - {{tika-parser-html-commons}} consists solely of 
{{BoilerpipeContentHandler.java}}

So my suggestion would be to
 * move the test that tests {{BoilerpipeContentHandler}} to 
{{tika-parser-html-commons}}
 * remove the dependency from {{tika-parser-html-module}} to 
{{tika-parser-html-commons}}
 * add a dependency from {{tika-app}} to {{tika-parser-html-commons}} 
 * maybe even rename {{tika-parser-html-commons}} to 
{{tika-parser-boilerplate}} to also reduce the risk of it being used again in 
the future by the html module

This way {{tika-app}} and {{tika-server}} (which already has the explicit 
dependeny on {{{}tika-parser-html-commons{}}}) continue to work as before, but 
for users using Tika as library the Boilerpipe dependency vanishes.

It will be a breaking change for the unlikely situation where someone actually 
used it explicitly, but will be an improvement for almost everyone else.

> Boilerpipe dependency is evil
> -----------------------------
>
>                 Key: TIKA-1484
>                 URL: https://issues.apache.org/jira/browse/TIKA-1484
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Ben McCann
>            Priority: Major
>
> The Boilerpipe project bundles inside it two classes from org.cyberneko.html. 
> We're already using NekoHTML in our project. Depending on which library shows 
> up on our classpath certain parts of our project will either work or not. I'd 
> really love it if Boilerpipe could be fixed or replaced with some other 
> library that is a better citizen.
> I see I'm not the first person to run into this as another Tika user has 
> filed a bug on the Boilerpipe project: 
> https://code.google.com/p/boilerpipe/issues/detail?id=62



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to