Hi Tom, Awesome ideas! In my head I was playing with similar ideas a while ago, but never got to it. I'm still enrolled in the Coursera machine learning course and consider this an interesting problem. I remember thinking about automatically correcting records too - things I used my VacuumBot for: normalising format names, cleaning up various fields - although it's probably hard to do this right. For this type of machine support I would want to look at human-edited/bot-supported changes per field, under the assumption that such edits are correcting (small) human mistakes. Hmm, VacuumBot did very monotonous work, but could the records it targeted be a source for a sample of human submissions?
I was thinking "online learning" might be a useful approach, but it takes a crowd to train such a system. People are excellent in detecting spam, of course, but you will want to divide ham detection over a lot of people if you're going to go this way. Assuming most submissions are okay might be a reasonable alternative. Concerning integration: in an article on Wikipedia vandalism detection [1] I see a mention of alerting editors when vandalism submission is detected. I haven't read the full article though. A web service (or cluster of services) could be run separately from OpenLibrary that silently flags submissions and users for review. When the ratio of false positives (ham submissions classified as spam) to all submissions is low enough, OpenLibrary could start blocking spam. Keep us posted! Ben [1]: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1494&context=cis_papers On 9 September 2015 at 03:26, Tom Morris <[email protected]> wrote: > I'm going to experiment with training some machine learning classifiers to > detect OpenLibrary spam submissions and spammer accounts. To do this, I'll > need a corpus of known spam. My current thinking is to take the revert > history (https://openlibrary.org/recentchanges/revert), extract the reverted > changesets, examine them to get the reverted revisions and save those as my > spam training set. Does this seem like a reasonable approach? Has someone > already curated a corpus of OL spam which would make this effort > unnecessary? > > What attributes should I try for in terms of size, variety, etc of the > corpus. For reference, there are just under 5,000 spam accounts identified > in the reversion history with almost 200,000 changesets each containing at > least one change. Given that a lot of the spam is mostly identical, I was > thinking I'd go for a) diversity over time and b) diversity of accounts. > Any other attributes to attempt to diversify? Language? Character set? > Other attributes? > > The ham training set is a little trickier since a) I want submissions from > humans, not bots and b) not all of the spam has been identified meaning that > a random sample may contain both spam and ham. On the other hand, if I just > use edits from a hand picked list of known good accounts, it may lack > diversity. I'd also, if possible, like to pick a set of accounts/edits > which are distributed in a similar way as the spam. As a first > approximation, I may just go with human added books which are still in the > database on the assumption that the bulk of them will not be spam and then > iterate from there. Does anyone have any better ideas? > > What format would be most useful for people? The two obvious ones are JSON > dictionaries of the entire document and just the extracted text, perhaps all > concatenated together or perhaps in separate fields. Any opinions? > > Thanks in advance for any feedback. I'll post the corpus someplace public > when it's done (along with the corpus extraction code). > > Tom > > p.s. I included "spam detection" in the subject line, but didn't really > address it. Does anyone have any good ideas for where to integrate any > resulting classifiers? It looks like there may be some rudimentary code to > check a static spam words list when users attempt to add a new book, but > that's obviously not slowing down the spam much. > > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > Archives: http://www.mail-archive.com/[email protected]/ > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
