Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Mike Thomsen Thu, 17 Jan 2019 13:59:00 -0800

Ken,

Here's a Gist version of it:


https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b

Thanks,

Mike

On Thu, Jan 17, 2019 at 2:25 PM Ken Krugler <[email protected]>
wrote:

> Hi Mike,
>
> I don’t see the script - did it get stripped?
>
> Below is a list of the language profiles that I believe are bundled with
> the language-detector jar that’s pulled in by Tika.
>
> I don’t see “gr” - note that Greek is “el”.
>
> And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect
> detection to work for your test cases.
>
> — Ken
>
> af
> an
> ar
> ast
> be
> bg
> bn
> br
> ca
> cs
> cy
> da
> de
> el
> en
> es
> et
> eu
> fa
> fi
> fr
> ga
> gl
> gu
> he
> hi
> hr
> ht
> hu
> id
> is
> it
> ja
> km
> kn
> ko
> lt
> lv
> mk
> ml
> mr
> ms
> mt
> ne
> nl
> no
> oc
> pa
> pl
> pt
> ro
> ru
> sk
> sl
> so
> sq
> sr
> sv
> sw
> ta
> te
> th
> tl
> tr
> uk
> ur
> vi
> yi
> zh-CN
> zh-TW
>
>
> > On Jan 17, 2019, at 9:39 AM, Mike Thomsen <[email protected]>
> wrote:
> >
> > I wrote a Groovy script (attached) to test a bunch of languages against
> the LanguageDetector class, and these were the results:
> >
> > ar    fa
> > de    de
> > en    en
> > es    es
> > fr    fr
> > gr    el
> > it    it
> > ko    lt
> > nl    nl
> > ru    ru
> > zh    lt
> >
> > Is there something that needs to be done to enable the detection of
> Asian languages or should I file this as a bug report?
> >
> > Thanks,
> >
> > Mike
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Reply via email to