I'm going to experiment with training some machine learning classifiers to
detect OpenLibrary spam submissions and spammer accounts.  To do this, I'll
need a corpus of known spam.  My current thinking is to take the revert
history (https://openlibrary.org/recentchanges/revert), extract the
reverted changesets, examine them to get the reverted revisions and save
those as my spam training set.  Does this seem like a reasonable approach?
Has someone already curated a corpus of OL spam which would make this
effort unnecessary?

What attributes should I try for in terms of size, variety, etc of the
corpus.  For reference, there are just under 5,000 spam accounts identified
in the reversion history with almost 200,000 changesets each containing at
least one change.  Given that a lot of the spam is mostly identical, I was
thinking I'd go for a) diversity over time and b) diversity of accounts.
Any other attributes to attempt to diversify?  Language?  Character set?
Other attributes?

The ham training set is a little trickier since a) I want submissions from
humans, not bots and b) not all of the spam has been identified meaning
that a random sample may contain both spam and ham. On the other hand, if I
just use edits from a hand picked list of known good accounts, it may lack
diversity.  I'd also, if possible, like to pick a set of accounts/edits
which are distributed in a similar way as the spam.  As a first
approximation, I may just go with human added books which are still in the
database on the assumption that the bulk of them will not be spam and then
iterate from there.  Does anyone have any better ideas?

What format would be most useful for people?  The two obvious ones are JSON
dictionaries of the entire document and just the extracted text, perhaps
all concatenated together or perhaps in separate fields.  Any opinions?

Thanks in advance for any feedback.  I'll post the corpus someplace public
when it's done (along with the corpus extraction code).

Tom

p.s. I included "spam detection" in the subject line, but didn't really
address it.  Does anyone have any good ideas for where to integrate any
resulting classifiers?  It looks like there may be some rudimentary code to
check a static spam words list when users attempt to add a new book, but
that's obviously not slowing down the spam much.
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to