On Tue, Jun 11, 2019 at 12:03:27PM +0530, Shreyansh Shrivastava. wrote:
> When Bayes tokenizes the message, it ignores words with length<3 along with a
> list of stop words using a regexp as they lie in the gray area. But for other
> languages, the presence of these English stop words can be a great indication
> for spam. Is there a way to not remove these words for other languages?

One could use TextCat results to detect if message contains English, but
it's not foolproof and TextCat module could be disabled by user anyway.

Does the current stoplist actually do anything useful?  Someone should try
10-fold cross validation with and without..

Reply via email to