Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

Platonides Mon, 15 Apr 2013 09:06:40 -0700

On 14/04/13 15:41, anubhav agarwal wrote:
> I don't we could take in account the roll back for automated learning. It
> is not necessary that the person who edited the document, then rolled it
> back did because it was a spam.


Getting the right data to train from is hard, since wiki is so flexible.
The good point of rollback is that a) It's easy to detect, b) It's
restricted (a random user can't use it) and c) On some wikis policy
restricts it's use to “clearly bad edits”.

So you _should_ be training with "unwanted edits". But there will be
false positives.



> Though a "Train as spam" checkbox is a good idea. I was thinking about the
> "report spam" button along with "edit" button on the top-right hand corner
> of a section.

However, that only tells you that "somewhere in the page there is spam",
not what the spam is (the last revision? an edit from 2 months ago?) nor
does it encourage for fixing it.


> I was thinking of creating a Job Queue for big websites like Wikipedia,
> each edit will go in a queue which will be processed offline and then later
> roll backed to the original content if it triggers the alarm.

I'm not a big fan of this. You will have edit-conflicts to handle, and
it looks messy to have reverts by an extension. I recommend you to work
on the bayesian detection of spam, and leave the potential refactoring
to configure it to work through the job queue for later.

I think I could look in the archives of deleted pages from the WM-ES
wiki for spam data for you.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

Reply via email to