[
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217389#comment-16217389
]
ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------
YossiTamari commented on issue #233: NUTCH-2449: Replace Tika
LanguageIdentifier in language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-339083180
Sorry for the mix-up, pipldev is my alternate account...
Speaking as a relatively new user of Nutch, I have never looked for a readme
file in a plugin directory. It's a good thing, since I just sampled a few, and
none had a readme file.
In my experience, getting into Nutch is a challenging task, and often
requires looking into the code (and often debugging) to understand how things
are supposed to work.
I do not think adding one more way to confuse users is a good idea. Just as
an example for why it would be so confusing, the plugin has a number of
configuration settings. Do we use the same properties to configure both
plugins? Do we duplicate all of them in nutch-default.xml and try to explain in
the comments there why there are two sets with the exact same functionalities?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
> Key: NUTCH-2449
> URL: https://issues.apache.org/jira/browse/NUTCH-2449
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.13
> Reporter: Yossi Tamari
>
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
> and it doesn’t even fail gracefully with them - in my experience Chinese was
> recognized as Italian.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)