I'm going to experiment with training some machine learning classifiers to detect OpenLibrary spam submissions and spammer accounts. To do this, I'll need a corpus of known spam. My current thinking is to take the revert history (https://openlibrary.org/recentchanges/revert), extract the reverted changesets, examine them to get the reverted revisions and save those as my spam training set. Does this seem like a reasonable approach? Has someone already curated a corpus of OL spam which would make this effort unnecessary?
What attributes should I try for in terms of size, variety, etc of the corpus. For reference, there are just under 5,000 spam accounts identified in the reversion history with almost 200,000 changesets each containing at least one change. Given that a lot of the spam is mostly identical, I was thinking I'd go for a) diversity over time and b) diversity of accounts. Any other attributes to attempt to diversify? Language? Character set? Other attributes? The ham training set is a little trickier since a) I want submissions from humans, not bots and b) not all of the spam has been identified meaning that a random sample may contain both spam and ham. On the other hand, if I just use edits from a hand picked list of known good accounts, it may lack diversity. I'd also, if possible, like to pick a set of accounts/edits which are distributed in a similar way as the spam. As a first approximation, I may just go with human added books which are still in the database on the assumption that the bulk of them will not be spam and then iterate from there. Does anyone have any better ideas? What format would be most useful for people? The two obvious ones are JSON dictionaries of the entire document and just the extracted text, perhaps all concatenated together or perhaps in separate fields. Any opinions? Thanks in advance for any feedback. I'll post the corpus someplace public when it's done (along with the corpus extraction code). Tom p.s. I included "spam detection" in the subject line, but didn't really address it. Does anyone have any good ideas for where to integrate any resulting classifiers? It looks like there may be some rudimentary code to check a static spam words list when users attempt to add a new book, but that's obviously not slowing down the spam much.
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
