I can see that, so switching the language codes I think should be something
that should be done when we do bigger changes anyway. Maybe for 1.6 together
with a switch to opennlp-ml and maybe bigger changes in our feature generation
code.

Jörn

On 5/17/11 10:32 PM, Benson Margulies wrote:
there are important distinctions missing in the twos. Farsi / Dari/
etc and others.

On May 17, 2011, at 4:25 PM, "Jörn Kottmann"<[email protected]>  wrote:

Is there support for -3 in java? Currently all we do is a check that the
language is
a valid 2 letter code. The idea was when we added it that we will be able
to have language dependent feature generation one day, but up to today we
only do something special in the sentence detector for thai.

Jörn

On 5/17/11 8:50 PM, Benson Margulies wrote:
-2 is pretty useless. Use -3 if you want to switch.

On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<[email protected]>   wrote:
My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
those who builds the solutions such as openNLP + tesseract.

-Oleg

On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
<[email protected]>wrote:

I think we should change to the three character convention for language
specific materials, e.g. "eng" rather than "en" for English.

http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Do others agree?

--
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge


Reply via email to