Hi,

Solr supports pluggable language detectors 
<https://solr.apache.org/guide/solr/latest/indexing-guide/language-detection.html>:

> Solr supports three implementations of this feature:
> 
> Tika’s language detection feature: 
> https://tika.apache.org/1.28.4/detection.html
> LangDetect language detection: https://github.com/shuyo/language-detection
> OpenNLP language detection: 
> http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect

Since our first implementation, the Tika project 
<https://tika.apache.org/2.7.0/detection.html#Language_Detection> has evolved 
it's language detection capabilities and added a pluggable architecture as well:
https://github.com/apache/tika/tree/main/tika-langdetect

One of Solr's langid plugins is "langdetect" which has not been updated in 10 
years. I'd like to deprecate it and remove it in main for that reason.

Longer term question: Does it make sense for us to keep maintaining our own set 
of language detectors in this landscape?
We could re-purpose the langid module so that uses Tika's pluggable detectors 
in some way, perhaps with thin wrapper classes in Solr?

Wdyt?

Jan

Reply via email to