On 9 September 2015 at 13:26, Tom Morris <[email protected]> wrote: > I'm going to experiment with training some machine learning classifiers to > detect OpenLibrary spam submissions and spammer accounts. To do this, I'll > need a corpus of known spam. My current thinking is to take the revert > history (https://openlibrary.org/recentchanges/revert), extract the > reverted changesets, examine them to get the reverted revisions and save > those as my spam training set. Does this seem like a reasonable approach? > Has someone already curated a corpus of OL spam which would make this > effort unnecessary? > > Hi Tom, A while back I wrote a little script to try and catch OL spam entries based on the pattern of a new user which adds many works soon after creation. Most legitimate new users in the system do not add new works immediately. It seemed to work relatively well for having a pretty simplistic algorithm, so I put the thing online and set it to auto check every day to list *all* new users who have added works, and highlight the ones that look like spam, with links to the admin interface to delete/revert the user (which in turn should delete all the associated works, sometimes this seems to fail, but infrequently).
Here is the tool: http://ol-spam-finder.herokuapp.com/ It updates every day, and I am sporadically re-running the checks by month when I see that a lot of reverts have occurred in order to keep the data relevant. I've been in touch with Jessamyn who has been using it to perform those reverts in the recent changes list. > What attributes should I try for in terms of size, variety, etc of the > corpus. For reference, there are just under 5,000 spam accounts identified > in the reversion history with almost 200,000 changesets each containing at > least one change. Given that a lot of the spam is mostly identical, I was > thinking I'd go for a) diversity over time and b) diversity of accounts. > Any other attributes to attempt to diversify? Language? Character set? > Other attributes? > > The ham training set is a little trickier since a) I want submissions from > humans, not bots and b) not all of the spam has been identified meaning > that a random sample may contain both spam and ham. On the other hand, if I > just use edits from a hand picked list of known good accounts, it may lack > diversity. I'd also, if possible, like to pick a set of accounts/edits > which are distributed in a similar way as the spam. As a first > approximation, I may just go with human added books which are still in the > database on the assumption that the bulk of them will not be spam and then > iterate from there. Does anyone have any better ideas? > > I'm not sure how this tool affects what you are wanting to do, but the algorithm I have used is pretty basic once it focuses down on new users who add works. Here is the regular expressions and matches to determine whether a work is spam (coded in Ruby), currently it only looks at the title, which surprised me how effective that was: book.title && ( book.title =~ /[【〚〖┫『《▶➸。ㆍ→≒♥⑧]/ || # book.title.include?('tpm1004.com') || book.title.include?('-BAMWAR닷컴') || book.title.include?('★최신') || book.title =~ /(PDF|FREE|EBOOK|FONT|DRIVER) DOWNLOAD$/ || book.title.include?("POOR CHARLIE'S ALMANACK EBOOK") || book.title =~ /\p{Hangul}.+(COM|com|net|NET|CoM)/ || # Korean with .com .net etc book.title.include?('바카라') || # Bacarat in Korean book.title.include?('\\') || book.title =~ /\+\d{9}/ # phone numbers ) Some of the texts are very specific to catch large amounts of spam that occurred historically over a specific period of time. The domain extensions could be written better to catch more combinations, but again, it seems to be effective enough as it is. The source code isn't on github yet, but I have been meaning to put it there. I am more than happy to collaborate or share the approach I have used. It's very rough, but has been surprisingly effective at catching most spam. There do seem to be a number of one off, presumably manually entered, spam works that slip through, and any spammer that pre-creates an account days before adding their first work will slip through too, but it is catching the bulk spammers. These, as far as I can tell, must be using some combination of manually directed automated tools. I did a little research and there are some people who will charge to effectively use manual labour to break any anti-spam-bot tests. With that as a possibility, the only option left is to dis-incentivise the spammers by deleting it promptly so the cost of adding isn't worth it. Regards, Charles. It'll be funny if this message gets blocked as spam because of the Korean and other odd unicode characters in the code above!
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
