On 3 Sep 2009, at 21:59, Evan Prodromou wrote:

This sounds really hard to me, would take a lot of time at notice submit time, and would be almost intractable for the latin code range. I think probably someone should do this work, somewhere... but it's probably not up to us to do it.


Less difficult for Latin scripts than one might imagine. <http:// wiki.musicbrainz.org/Tell_Similar_Languages_Apart> is a quick guide.

e.g. ç is only really used in French, Portuguese and Catalan. Having narrowed down a notice to just three languages, you can apply a process of elimination. A sequence L-interpunct-L (l·l) is a sure sign of Catalan. Look for acute accents - French only uses them on the letter 'E', so if you find them on any other letter, then it's not French. Look for umlauts (diaereses) - in French they're occasionally seen on 'e', but if you see one on an 'i' or 'u' you're probably looking at Catalan. Portuguese is the only of these three languages to use a tilde. If you've still not narrowed it down to a single possibility, look at commonly used words - 'I' in French/ Portuguese/Catalan is 'je'/'eu'/'jo'; 'and' is 'et'/'e'/'i'.

The Perl module Text::Language::Guess is a purely dictionary-based approach and works pretty well:

http://search.cpan.org/dist/Text-Language-Guess/lib/Text/Language/ Guess.pm

--
Toby A Inkster
<mailto:m...@tobyinkster.co.uk>
<http://tobyinkster.co.uk>



_______________________________________________
Laconica-dev mailing list
Laconica-dev@laconi.ca
http://mail.laconi.ca/mailman/listinfo/laconica-dev

Reply via email to