On 9 September 2015 at 13:26, Tom Morris <[email protected]> wrote:

> I'm going to experiment with training some machine learning classifiers to
> detect OpenLibrary spam submissions and spammer accounts.  To do this, I'll
> need a corpus of known spam.  My current thinking is to take the revert
> history (https://openlibrary.org/recentchanges/revert), extract the
> reverted changesets, examine them to get the reverted revisions and save
> those as my spam training set.  Does this seem like a reasonable approach?
> Has someone already curated a corpus of OL spam which would make this
> effort unnecessary?
>
>
Hi Tom,
A while back I wrote a little script to try and catch OL spam entries based
on the pattern of a new user which adds many works soon after creation.
Most legitimate new users in the system do not add new works immediately.
It seemed to work relatively well for having a pretty simplistic algorithm,
so I put the thing online and set it to auto check every day to list *all*
new users who have added works, and highlight the ones that look like spam,
with links to the admin interface to delete/revert the user (which in turn
should delete all the associated works, sometimes this seems to fail, but
infrequently).

Here is the tool: http://ol-spam-finder.herokuapp.com/  It updates every
day, and I am sporadically re-running the checks by month when I see that a
lot of reverts have occurred in order to keep the data relevant.

I've been in touch with Jessamyn who has been using it to perform those
reverts in the recent changes list.


> What attributes should I try for in terms of size, variety, etc of the
> corpus.  For reference, there are just under 5,000 spam accounts identified
> in the reversion history with almost 200,000 changesets each containing at
> least one change.  Given that a lot of the spam is mostly identical, I was
> thinking I'd go for a) diversity over time and b) diversity of accounts.
> Any other attributes to attempt to diversify?  Language?  Character set?
> Other attributes?
>
> The ham training set is a little trickier since a) I want submissions from
> humans, not bots and b) not all of the spam has been identified meaning
> that a random sample may contain both spam and ham. On the other hand, if I
> just use edits from a hand picked list of known good accounts, it may lack
> diversity.  I'd also, if possible, like to pick a set of accounts/edits
> which are distributed in a similar way as the spam.  As a first
> approximation, I may just go with human added books which are still in the
> database on the assumption that the bulk of them will not be spam and then
> iterate from there.  Does anyone have any better ideas?
>
>
I'm not sure how this tool affects what you are wanting to do, but the
algorithm I have used is pretty basic once it focuses down on new users who
add works. Here is the regular expressions and matches to determine whether
a work is spam (coded in Ruby), currently it only looks at the title, which
surprised me how effective that was:

    book.title && (
      book.title =~ /[【〚〖┫『《▶➸。ㆍ→≒♥⑧]/ ||
      # book.title.include?('tpm1004.com') ||
      book.title.include?('-BAMWAR닷컴') ||
      book.title.include?('★최신') ||
      book.title =~ /(PDF|FREE|EBOOK|FONT|DRIVER) DOWNLOAD$/ ||
      book.title.include?("POOR CHARLIE'S ALMANACK EBOOK") ||
      book.title =~ /\p{Hangul}.+(COM|com|net|NET|CoM)/ || # Korean with
.com .net etc
      book.title.include?('바카라')  || # Bacarat in Korean
      book.title.include?('\\') ||
      book.title =~ /\+\d{9}/ # phone numbers
    )

Some of the texts are very specific to catch large amounts of spam that
occurred historically over a specific period of time. The domain extensions
could be written better to catch more combinations, but again, it seems to
be effective enough as it is.

The source code isn't on github yet, but I have been meaning to put it
there. I am more than happy to collaborate or share the approach I have
used. It's very rough, but has been surprisingly effective at catching most
spam.

There do seem to be a number of one off, presumably manually entered, spam
works that slip through, and  any spammer that pre-creates an account days
before adding their first work will slip through too, but it is catching
the bulk spammers.  These, as far as I can tell, must be using some
combination of manually directed automated tools. I did a little research
and there are some people who will charge to effectively use manual labour
to break any anti-spam-bot tests. With that as a possibility, the only
option left is to dis-incentivise the spammers by deleting it promptly so
the cost of adding isn't worth it.

Regards,
Charles.

 It'll be funny if this message gets blocked as spam because of the Korean
and other odd unicode characters in the code above!
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to