+1 for delegating to Tika which is a much better place for that (and that
they are actively evolving).

+1 for deprecating the old and not updated plugins as well (langdetect)

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 2 Mar 2023 at 20:22, Jan Høydahl <jan....@cominvent.com> wrote:

> Hi,
>
> Solr supports pluggable language detectors <
> https://solr.apache.org/guide/solr/latest/indexing-guide/language-detection.html
> >:
>
> > Solr supports three implementations of this feature:
> >
> > Tika’s language detection feature:
> https://tika.apache.org/1.28.4/detection.html
> > LangDetect language detection:
> https://github.com/shuyo/language-detection
> > OpenNLP language detection:
> http://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.langdetect
>
> Since our first implementation, the Tika project <
> https://tika.apache.org/2.7.0/detection.html#Language_Detection> has
> evolved it's language detection capabilities and added a pluggable
> architecture as well:
> https://github.com/apache/tika/tree/main/tika-langdetect
>
> One of Solr's langid plugins is "langdetect" which has not been updated in
> 10 years. I'd like to deprecate it and remove it in main for that reason.
>
> Longer term question: Does it make sense for us to keep maintaining our
> own set of language detectors in this landscape?
> We could re-purpose the langid module so that uses Tika's pluggable
> detectors in some way, perhaps with thin wrapper classes in Solr?
>
> Wdyt?
>
> Jan

Reply via email to