On Tue, 11 Jun 2019, 17:41 RW, <[email protected]> wrote: > On Tue, 11 Jun 2019 13:43:35 +0300 > Henrik K wrote: > > > > Does the current stoplist actually do anything useful? Someone > > should try 10-fold cross validation with and without.. > > My understanding is that it was intended purely as a speed-up.
Speedup plus less storage was the reason for removing stop words. > The words > are chosen to be neutral tokens that wont affect the final result. > These words will result in neutral tokens only when the user's primary language is English. For a Spanish user, an English mail is highly likely to be a spam, hence we shouldn't remove stop words in this case. >
