Re: NGram Language Categorization Source

Andrzej Bialecki Mon, 22 Aug 2005 00:51:55 -0700

Kevin Burton wrote:

A lot depends on the reference profiles (which in turn depend on the
quality of your training corpus - in this case, your corpus is not the
best choice, because each text contains a lot of foreign words).



I realize that my corpus isnt' the best.  That's one of the reason's
I've open source'd it.  The main improvement in ngramcat (my code) is
that if the result isn't obvious we throw an Exception so
theoreticallyi we won't see any false positives unless the language
categorization is WAY off.

That's also how other implementations do it - you need to set anarbitrary threshold, and if the profiles score below that threshold thenan "unknown" value is returned (or null, or Exception).

It was
also found that the way you create ngram profiles (e.g. with or without
surrounding spaces, single length or mixed length) affects the LI

performance.



LI???


LI = Language Identification. Sorry for the confusion.

I haven't benchmarked it but I'd be interested in any suggestions you have.

For documents with mixed languages it was also found that
methods, which combine ngrams with stopwords, work better.



Hm.. interesting.. where?  URL I can reads?

Someone mentioned the Linguini paper, where they found that using "shortwords" features gives similarly good performance as using ngrams.


See also the following papers :

http://www.xrce.xerox.com/Publications/Attachments/1995-012/Gref---Comparing-two-language-identification-schemes.pdf
http://citeseer.ist.psu.edu/40861.html
http://www.xs4all.nl/~ajwp/langident.pdf

In general, using stop words works only for texts above certain minimumlength (greater than with n-gram methods), and then gives nearly 100%accuracy.

So, there is still a lot to do in this area, if you come up with some
unique way of improving LI performance...



Maybe I'm being dense but what is LI performance?

Language Identification performance - in the sense that a givenidentifier "performs" better if it can correctly identify morelanguages, using shorter input text.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGram Language Categorization Source

Reply via email to