Re: [ol-tech] Ideas for machine learning OL editing / spam detection

raj kumar Fri, 01 Feb 2013 10:37:44 -0800

Hi Ben and Tom,

Two summers ago, Open Library had an intern who wrote the incredibly useful 
IAWatchBot. It detected several obvious mistakes/vandalism, such as deleting 
subject or ocaid fields.


It also used NLTK to detect change in language, which made it easy to detect 
spammers who were changing English text into spam in other languages. It also 
flagged all suspicious links.

It then would email an admin once a day to review what it found, and if someone 
confirmed that the flagged edit was spam, it would be reverted.

The code is here: 
https://github.com/dmontalvo/IAWatchBot/blob/master/iawatchbot.py

Unfortunately, the bot required quite a bit of daily handholding, and since the 
people who were holding its hand are no longer working at IA, it is no longer 
active. But perhaps the community can use the IAWatchBot code to build a better 
anti-spam bot.

-raj


On Jan 31, 2013, at 2:41 PM, Tom Morris <[email protected]> wrote:

> On Thu, Jan 31, 2013 at 8:56 AM, Ben Companjen <[email protected]> wrote:
> 
> Last night I wrote down [1] some ideas for learning to detect "bad" (and 
> good) edits. They're not brilliant or new, but hopefully inspiring to anyone 
> thinking of building some sort of bot to learn how (not) to edit the 
> catalogue. I'm envisioning semi-autonomous bots that suggest corrections and 
> in the long run fully autonomous bots to do repetitive editing tasks
> 
> It's really useful that the complete history is available. Machine learning 
> algorithms can use it to find patterns of bad edits to revert (or even stop 
> from happening) and good edits that can be applied to other records.
> 
> Has anyone ever tried to apply machine learning to OL?
> Any comments?
> 
> I think it would be useful distinguish among spam, vandalism, and mistakes.  
> They (probably) all have different signals associated with them.
> 
> OpenLibrary link spam may be close enough to blog comment spam to be able to 
> take advantage of solutions which target that space such as 
> http://akismet.com/how/
> 
> As far as machine learning goes, if you were going to roll your own solution, 
> I think that would be an appropriate tool to use.  The trick is to choose the 
> right set of features and to train a classifier for edits.  The solution will 
> also probably need a mechanism for collecting user feedback and updating or 
> retraining the model as the spam evolves.
> 
> Some features that you didn't mention which might be useful include: age of 
> account, IP address of client, some measure of frequency of edits, some 
> measure of diversity of records add/edited, domains of links added, etc.
> 
> It'd be a cool project, but it seems like there might be higher priority 
> things to work on with OL basically on life support.
> 
> Tom
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Ideas for machine learning OL editing / spam detection

Reply via email to