Kevin Burton wrote:
A lot depends on the reference profiles (which in turn depend on the
quality of your training corpus - in this case, your corpus is not the
best choice, because each text contains a lot of foreign words).
I realize that my corpus isnt' the best. That's one of the reason's
I've open source'd it. The main improvement in ngramcat (my code) is
that if the result isn't obvious we throw an Exception so
theoreticallyi we won't see any false positives unless the language
categorization is WAY off.
That's also how other implementations do it - you need to set an
arbitrary threshold, and if the profiles score below that threshold then
an "unknown" value is returned (or null, or Exception).
It was
also found that the way you create ngram profiles (e.g. with or without
surrounding spaces, single length or mixed length) affects the LI
performance.
LI???
LI = Language Identification. Sorry for the confusion.
I haven't benchmarked it but I'd be interested in any suggestions you have.
For documents with mixed languages it was also found that
methods, which combine ngrams with stopwords, work better.
Hm.. interesting.. where? URL I can reads?
Someone mentioned the Linguini paper, where they found that using "short
words" features gives similarly good performance as using ngrams.
See also the following papers :
http://www.xrce.xerox.com/Publications/Attachments/1995-012/Gref---Comparing-two-language-identification-schemes.pdf
http://citeseer.ist.psu.edu/40861.html
http://www.xs4all.nl/~ajwp/langident.pdf
In general, using stop words works only for texts above certain minimum
length (greater than with n-gram methods), and then gives nearly 100%
accuracy.
So, there is still a lot to do in this area, if you come up with some
unique way of improving LI performance...
Maybe I'm being dense but what is LI performance?
Language Identification performance - in the sense that a given
identifier "performs" better if it can correctly identify more
languages, using shorter input text.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]