Hi all, I'm Korean and can give lots of Korean corpus if you need to train the LanguageIdentifier.
Please just let me know, the charset the text should be in and kind of content like a news, a novel, dialogs or technical articles, etc. Regards, Cheolgoo On 1/5/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi Jerome, > > So all that's needed is a Japanese, Chinese, and Korean text corpus to > "train" the identifier? Can the LanguageIdentifier deal and properly handle > multi-byte character sets? > > Thanks, > Otis > > ----- Original Message ---- > From: Jérôme Charron <[EMAIL PROTECTED]> > To: [email protected]; Otis Gospodnetic <[EMAIL PROTECTED]> > Sent: Wed Jan 4 18:54:49 2006 > Subject: [Nutch-general] Re: LanguageIdentifierPlugin and CJK > > > I'm interested in Language Identifier plugin that Sami and Jerome put > > together. I noticed the list of "supported" languaged does not include CJK > > languages: > > http://wiki.apache.org/nutch/LanguageIdentifierPlugin > > Yes, it's true. > > > > I'm wondering: > > 1. why is that? (technical difficulty of some kind?) > > Reasons for me: > 1. My plans were only to enhance performances and precision. > 2. I not have enougth knowledge about CJK (and don't take time to test it, > but I'm interested in helping testing it) > 3. My idea was to provide the basis to easily add some supported language > (to give the keys for each "specialist" of a particular language to add some > new supported language) > > 2. are there plans for CJK support? > > Not for me (but I can help). > Teruhiko Kurosaka could probably give us some help and keys > > > > 3. what would it take to add CJK support? > > A CJK textual corpus for constructing the cjk profile. > > Regards > > Jérôme > > > >
