On Thu, Jan 31, 2013 at 8:56 AM, Ben Companjen <[email protected]>wrote:

Last night I wrote down [1] some ideas for learning to detect "bad" (and
> good) edits. They're not brilliant or new, but hopefully inspiring to
> anyone thinking of building some sort of bot to learn how (not) to edit the
> catalogue. I'm envisioning semi-autonomous bots that suggest corrections
> and in the long run fully autonomous bots to do repetitive editing tasks
>
> It's really useful that the complete history is available. Machine
> learning algorithms can use it to find patterns of bad edits to revert (or
> even stop from happening) and good edits that can be applied to other
> records.
>
> Has anyone ever tried to apply machine learning to OL?
> Any comments?
>

I think it would be useful distinguish among spam, vandalism, and mistakes.
 They (probably) all have different signals associated with them.

OpenLibrary link spam may be close enough to blog comment spam to be able
to take advantage of solutions which target that space such as
http://akismet.com/how/

As far as machine learning goes, if you were going to roll your own
solution, I think that would be an appropriate tool to use.  The trick is
to choose the right set of features and to train a classifier for edits.
 The solution will also probably need a mechanism for collecting user
feedback and updating or retraining the model as the spam evolves.

Some features that you didn't mention which might be useful include: age of
account, IP address of client, some measure of frequency of edits, some
measure of diversity of records add/edited, domains of links added, etc.

It'd be a cool project, but it seems like there might be higher priority
things to work on with OL basically on life support.

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to