Re: Language Detection in Spam Assassin

Sidney Markowitz Thu, 27 Jun 2013 04:23:32 -0700

[email protected] wrote, On 6/26/13 3:54 AM:
> The users list would've been a more appropriate place to post this.
> 
> This web search appears to give useful results:  spamassassin language
>


I agree that questions about how to use an existing feature of SpamAssassin,
or in this case a question about whether SpamAssassin has some feature that a
quick look makes it appear not to have, are better asked on the Users list.

However, I want to point out that the language detection method that I helped
put in to SpamAssassin many years ago, textcat, has not proven to be all that
practical. This dev list would be the correct forum for discussing better ways
to detect language if anyone does have any ideas. Based on what I see in the
abstract, I would start by looking into Radim Řehůřek and Milan Kolkus' 2009
paper "Language Identification on the Web: Extending the Dictionary Method".
The method described in their paper seems to be simple, elegant, and a logical
improvement over Textcat. However I haven't tried it yet. Has anyone on the
list had experience with it? I see that there is an online implementation
available to play with at http://mlcomp.org/programs/633 but don't see much
mention of it besides that.

Re: Language Detection in Spam Assassin

Reply via email to