Re: [ol-tech] OpenLibrary spam corpus and spam detection

Ben Companjen Thu, 10 Sep 2015 14:44:25 -0700

Hi Tom,

Awesome ideas! In my head I was playing with similar ideas a while
ago, but never got to it. I'm still enrolled in the Coursera machine
learning course and consider this an interesting problem. I remember
thinking about automatically correcting records too - things I used my
VacuumBot for: normalising format names, cleaning up various fields -
although it's probably hard to do this right. For this type of machine
support I would want to look at human-edited/bot-supported changes per
field, under the assumption that such edits are correcting (small)
human mistakes. Hmm, VacuumBot did very monotonous work, but could the
records it targeted be a source for a sample of human submissions?


I was thinking "online learning" might be a useful approach, but it
takes a crowd to train such a system. People are excellent in
detecting spam, of course, but you will want to divide ham detection
over a lot of people if you're going to go this way. Assuming most
submissions are okay might be a reasonable alternative.

Concerning integration: in an article on Wikipedia vandalism detection
[1] I see a mention of alerting editors when vandalism submission is
detected. I haven't read the full article though.
A web service (or cluster of services) could be run separately from
OpenLibrary that silently flags submissions and users for review. When
the ratio of false positives (ham submissions classified as spam) to
all submissions is low enough, OpenLibrary could start blocking spam.

Keep us posted!

Ben

[1]: 
http://repository.upenn.edu/cgi/viewcontent.cgi?article=1494&context=cis_papers

On 9 September 2015 at 03:26, Tom Morris <[email protected]> wrote:
> I'm going to experiment with training some machine learning classifiers to
> detect OpenLibrary spam submissions and spammer accounts.  To do this, I'll
> need a corpus of known spam.  My current thinking is to take the revert
> history (https://openlibrary.org/recentchanges/revert), extract the reverted
> changesets, examine them to get the reverted revisions and save those as my
> spam training set.  Does this seem like a reasonable approach?  Has someone
> already curated a corpus of OL spam which would make this effort
> unnecessary?
>
> What attributes should I try for in terms of size, variety, etc of the
> corpus.  For reference, there are just under 5,000 spam accounts identified
> in the reversion history with almost 200,000 changesets each containing at
> least one change.  Given that a lot of the spam is mostly identical, I was
> thinking I'd go for a) diversity over time and b) diversity of accounts.
> Any other attributes to attempt to diversify?  Language?  Character set?
> Other attributes?
>
> The ham training set is a little trickier since a) I want submissions from
> humans, not bots and b) not all of the spam has been identified meaning that
> a random sample may contain both spam and ham. On the other hand, if I just
> use edits from a hand picked list of known good accounts, it may lack
> diversity.  I'd also, if possible, like to pick a set of accounts/edits
> which are distributed in a similar way as the spam.  As a first
> approximation, I may just go with human added books which are still in the
> database on the assumption that the bulk of them will not be spam and then
> iterate from there.  Does anyone have any better ideas?
>
> What format would be most useful for people?  The two obvious ones are JSON
> dictionaries of the entire document and just the extracted text, perhaps all
> concatenated together or perhaps in separate fields.  Any opinions?
>
> Thanks in advance for any feedback.  I'll post the corpus someplace public
> when it's done (along with the corpus extraction code).
>
> Tom
>
> p.s. I included "spam detection" in the subject line, but didn't really
> address it.  Does anyone have any good ideas for where to integrate any
> resulting classifiers?  It looks like there may be some rudimentary code to
> check a static spam words list when users attempt to add a new book, but
> that's obviously not slowing down the spam much.
>
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> Archives: http://www.mail-archive.com/[email protected]/
> To unsubscribe from this mailing list, send email to
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] OpenLibrary spam corpus and spam detection

Reply via email to