Hi Mike, So the issues are Arabic, Korean and Chinese, right?
I’d suggest filing an issue for Tika, so at least we can track it, though likely the issue is with the language-detector project we’re using for detection. I’m leaving on a trip this evening, but back next week, so will try to look at it then. Regards, — Ken > On Jan 17, 2019, at 1:48 PM, Mike Thomsen <[email protected]> wrote: > > Ken, > > Here's a Gist version of it: > > https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b > > Thanks, > > Mike > > On Thu, Jan 17, 2019 at 2:25 PM Ken Krugler <[email protected]> > wrote: > >> Hi Mike, >> >> I don’t see the script - did it get stripped? >> >> Below is a list of the language profiles that I believe are bundled with >> the language-detector jar that’s pulled in by Tika. >> >> I don’t see “gr” - note that Greek is “el”. >> >> And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect >> detection to work for your test cases. >> >> — Ken >> >> af >> an >> ar >> ast >> be >> bg >> bn >> br >> ca >> cs >> cy >> da >> de >> el >> en >> es >> et >> eu >> fa >> fi >> fr >> ga >> gl >> gu >> he >> hi >> hr >> ht >> hu >> id >> is >> it >> ja >> km >> kn >> ko >> lt >> lv >> mk >> ml >> mr >> ms >> mt >> ne >> nl >> no >> oc >> pa >> pl >> pt >> ro >> ru >> sk >> sl >> so >> sq >> sr >> sv >> sw >> ta >> te >> th >> tl >> tr >> uk >> ur >> vi >> yi >> zh-CN >> zh-TW >> >> >>> On Jan 17, 2019, at 9:39 AM, Mike Thomsen <[email protected]> >> wrote: >>> >>> I wrote a Groovy script (attached) to test a bunch of languages against >> the LanguageDetector class, and these were the results: >>> >>> ar fa >>> de de >>> en en >>> es es >>> fr fr >>> gr el >>> it it >>> ko lt >>> nl nl >>> ru ru >>> zh lt >>> >>> Is there something that needs to be done to enable the detection of >> Asian languages or should I file this as a bug report? >>> >>> Thanks, >>> >>> Mike >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> Custom big data solutions & training >> Flink, Solr, Hadoop, Cascading & Cassandra >> >> -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
