+1 for delegating to Tika which is a much better place for that (and that they are actively evolving).
+1 for deprecating the old and not updated plugins as well (langdetect) Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Thu, 2 Mar 2023 at 20:22, Jan Høydahl <jan....@cominvent.com> wrote: > Hi, > > Solr supports pluggable language detectors < > https://solr.apache.org/guide/solr/latest/indexing-guide/language-detection.html > >: > > > Solr supports three implementations of this feature: > > > > Tika’s language detection feature: > https://tika.apache.org/1.28.4/detection.html > > LangDetect language detection: > https://github.com/shuyo/language-detection > > OpenNLP language detection: > http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect > > Since our first implementation, the Tika project < > https://tika.apache.org/2.7.0/detection.html#Language_Detection> has > evolved it's language detection capabilities and added a pluggable > architecture as well: > https://github.com/apache/tika/tree/main/tika-langdetect > > One of Solr's langid plugins is "langdetect" which has not been updated in > 10 years. I'd like to deprecate it and remove it in main for that reason. > > Longer term question: Does it make sense for us to keep maintaining our > own set of language detectors in this landscape? > We could re-purpose the langid module so that uses Tika's pluggable > detectors in some way, perhaps with thin wrapper classes in Solr? > > Wdyt? > > Jan