Hi Ben and Tom, Two summers ago, Open Library had an intern who wrote the incredibly useful IAWatchBot. It detected several obvious mistakes/vandalism, such as deleting subject or ocaid fields.
It also used NLTK to detect change in language, which made it easy to detect spammers who were changing English text into spam in other languages. It also flagged all suspicious links. It then would email an admin once a day to review what it found, and if someone confirmed that the flagged edit was spam, it would be reverted. The code is here: https://github.com/dmontalvo/IAWatchBot/blob/master/iawatchbot.py Unfortunately, the bot required quite a bit of daily handholding, and since the people who were holding its hand are no longer working at IA, it is no longer active. But perhaps the community can use the IAWatchBot code to build a better anti-spam bot. -raj On Jan 31, 2013, at 2:41 PM, Tom Morris <[email protected]> wrote: > On Thu, Jan 31, 2013 at 8:56 AM, Ben Companjen <[email protected]> wrote: > > Last night I wrote down [1] some ideas for learning to detect "bad" (and > good) edits. They're not brilliant or new, but hopefully inspiring to anyone > thinking of building some sort of bot to learn how (not) to edit the > catalogue. I'm envisioning semi-autonomous bots that suggest corrections and > in the long run fully autonomous bots to do repetitive editing tasks > > It's really useful that the complete history is available. Machine learning > algorithms can use it to find patterns of bad edits to revert (or even stop > from happening) and good edits that can be applied to other records. > > Has anyone ever tried to apply machine learning to OL? > Any comments? > > I think it would be useful distinguish among spam, vandalism, and mistakes. > They (probably) all have different signals associated with them. > > OpenLibrary link spam may be close enough to blog comment spam to be able to > take advantage of solutions which target that space such as > http://akismet.com/how/ > > As far as machine learning goes, if you were going to roll your own solution, > I think that would be an appropriate tool to use. The trick is to choose the > right set of features and to train a classifier for edits. The solution will > also probably need a mechanism for collecting user feedback and updating or > retraining the model as the spam evolves. > > Some features that you didn't mention which might be useful include: age of > account, IP address of client, some measure of frequency of edits, some > measure of diversity of records add/edited, domains of links added, etc. > > It'd be a cool project, but it seems like there might be higher priority > things to work on with OL basically on life support. > > Tom > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
