[
https://issues.apache.org/jira/browse/TIKA-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572077#comment-17572077
]
Björn Kautler commented on TIKA-1484:
-------------------------------------
I analysed the code a bit.
As far as I have seen the situation is the following:
- Boilerpipe is only used in
{{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-commons/src/main/java/org/apache/tika/sax/boilerpipe/BoilerpipeContentHandler.java}}
- {{BoilerpipeContentHandler}} is only used in
-- {{tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java}}
-- {{tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java}}
--
{{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java}}
--
{{tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java}}
- {{tika-parser-html-commons}} consists solely of
{{BoilerpipeContentHandler.java}}
So my suggestion would be to
* move the test that tests {{BoilerpipeContentHandler}} to
{{tika-parser-html-commons}}
* remove the dependency from {{tika-parser-html-module}} to
{{tika-parser-html-commons}}
* add a dependency from {{tika-app}} to {{tika-parser-html-commons}}
* maybe even rename {{tika-parser-html-commons}} to
{{tika-parser-boilerplate}} to also reduce the risk of it being used again in
the future by the html module
This way {{tika-app}} and {{tika-server}} (which already has the explicit
dependeny on {{{}tika-parser-html-commons{}}}) continue to work as before, but
for users using Tika as library the Boilerpipe dependency vanishes.
It will be a breaking change for the unlikely situation where someone actually
used it explicitly, but will be an improvement for almost everyone else.
> Boilerpipe dependency is evil
> -----------------------------
>
> Key: TIKA-1484
> URL: https://issues.apache.org/jira/browse/TIKA-1484
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.6
> Reporter: Ben McCann
> Priority: Major
>
> The Boilerpipe project bundles inside it two classes from org.cyberneko.html.
> We're already using NekoHTML in our project. Depending on which library shows
> up on our classpath certain parts of our project will either work or not. I'd
> really love it if Boilerpipe could be fixed or replaced with some other
> library that is a better citizen.
> I see I'm not the first person to run into this as another Tika user has
> filed a bug on the Boilerpipe project:
> https://code.google.com/p/boilerpipe/issues/detail?id=62
--
This message was sent by Atlassian Jira
(v8.20.10#820010)