Re: [StatusNet-dev] [Laconica-dev] Indicating the language of a notice

Toby Inkster Sat, 05 Sep 2009 04:59:35 -0700

On 3 Sep 2009, at 21:59, Evan Prodromou wrote:

This sounds really hard to me, would take a lot of time at noticesubmit time, and would be almost intractable for the latin coderange. I think probably someone should do this work, somewhere...but it's probably not up to us to do it.

Less difficult for Latin scripts than one might imagine. <http://wiki.musicbrainz.org/Tell_Similar_Languages_Apart> is a quick guide.

e.g. ç is only really used in French, Portuguese and Catalan. Havingnarrowed down a notice to just three languages, you can apply aprocess of elimination. A sequence L-interpunct-L (l·l) is a suresign of Catalan. Look for acute accents - French only uses them onthe letter 'E', so if you find them on any other letter, then it'snot French. Look for umlauts (diaereses) - in French they'reoccasionally seen on 'e', but if you see one on an 'i' or 'u' you'reprobably looking at Catalan. Portuguese is the only of these threelanguages to use a tilde. If you've still not narrowed it down to asingle possibility, look at commonly used words - 'I' in French/Portuguese/Catalan is 'je'/'eu'/'jo'; 'and' is 'et'/'e'/'i'.

The Perl module Text::Language::Guess is a purely dictionary-basedapproach and works pretty well:

http://search.cpan.org/dist/Text-Language-Guess/lib/Text/Language/Guess.pm


--
Toby A Inkster
<mailto:m...@tobyinkster.co.uk>
<http://tobyinkster.co.uk>



_______________________________________________
Laconica-dev mailing list
Laconica-dev@laconi.ca
http://mail.laconi.ca/mailman/listinfo/laconica-dev

Re: [StatusNet-dev] [Laconica-dev] Indicating the language of a notice

Reply via email to