[
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217309#comment-16217309
]
ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------
lewismc commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in
language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-339066193
@pipldev thanks
> 90% of the code of the plugin would be copied, and somebody would have to
manage applying every fix to both plugins.
I agree. There have been VERY few changes to this plugin in almost a decade.
I don't anticipate that there will be too many changes moving forward.
> I think it would be really confusing to users to find two plugins that do
exactly the same thing, with the only difference being an implementation detail.
Not if it is documented in a README.md contained within the plugin
directory. Usually people interested in Linguistics are also very interested in
the application or techniques used in tasks such as language identification,
statistical machine translations, etc. i think any confusion could be easily
avoided if we provide a simple README.
> Keep in mind that while the the existing plugin is stable, it relies on
deprecated code, and presumably would break eventually when Tika decides to
remove the deprecated implementation.
I agree, but as I said it is stable. We have tight integration between Nutch
and Tika dev community. Once the code is removed over in Tika, it is trivial to
remove the code over here.
At the end of the day it is down to the contributor if this code is to be
updated or not, I just think preservation is an important part of what we do
here. Retaining the legacy plugin is doing no harm whatsoever.
Thanks
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
> Key: NUTCH-2449
> URL: https://issues.apache.org/jira/browse/NUTCH-2449
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.13
> Reporter: Yossi Tamari
>
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
> and it doesn’t even fail gracefully with them - in my experience Chinese was
> recognized as Italian.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)