Re: [ol-tech] OpenLibrary spam corpus and spam detection

Charles Horn Sun, 27 Sep 2015 19:16:08 -0700

Thanks Tom for the feedback!

On 25 September 2015 at 06:34, Tom Morris <[email protected]> wrote:



> Charles - thanks for publishing your code (and creating the tool in the
> first place).  When I look at
> http://ol-spam-finder.herokuapp.com/2015-09/10 I notice that:
> - almost everything is spam :-( Do you tally how many of the 675 new books
> were spam?  Looks like 600+ from glancing at
> https://openlibrary.org/recentchanges/2015/09/10/add-book#humans
>

That day 675 new books were created by 40 unique users, 27 of those were
created on the day. Of those 27, 18 were created for spamming. Implying
that 22 users added good books.

I find it interesting that there were actually 575 new sign-ups in one day,
which seems to be a pretty representative number. The percentage of new
users who add books is very low, but unfortunately the majority of those
users tend to be spammers. I think legitimate users who add books sign up,
and take longer to add their first book, and then continue to add more over
time.

To answer your question, the tools doesn't show how many of the 675 books
were spam. Doing the check manually I see that only 50 books of those 675
were not created by users flagged as spammers. That doesn't surprise me
much as each of the Korean spammers can add a couple of hundred almost
identical spam works in a short space of time. That's what makes me think
they have some kind of semi-automated system in place. The astrologers and
escort spammers tend to only add between one and three books.


> - deleted users get categorized as "good" users in the listing
>

Correct. I was going to class them as something else, e.g. 'cleared' but
decided that they were 'good' in the sense that no further action needed to
be taken as they were dealt with.


> - https://openlibrary.org/people/zamiulcse is an unidentified spammer
> (these less obvious ones are the kinds of things I'm hoping a machine
> learning solution will catch)
>
> Yes, I have decided to leave those as they are much harder to catch using
my approach without risking false positives. This tool tries to be _very_
conservative in order to avoid false positives, therefore it will let
through a fair bit. I wanted to catch the really obvious spam, the 100s per
day of Korean spam. I have never seen a Korean ham work added, but I am
relatively confident that my script will all ow it through as I am not just
matching on Hangul characters. It's Hangul + some other sign of spam in the
title. Any legitimate Korean books about Bacarat _will_ be flagged as spam
though -- that's an unfortunate bug :)

I have seen a couple of self-published e-books that look very similar to
the above entries too, and it appears the submitter is trying to promote
downloads, but they seem like they could be legitimate if OL is including
non-print books. On a second look I can't really tell if that is really
spam, it could be a legitimate book that appears very spam like? I'm not
sure about the OL policy on real books that basically only exist for some
kind of promotional purpose.


> There are also a couple of uncaught Korean spammers on
> http://ol-spam-finder.herokuapp.com/2015-09/22
>

That is an odd format, and again in the interests of being conservative I
wasn't completely sure that it was spam. They are less obvious than the
others, I thought it may some kind of technical serial published over
consecutive years?... Google translate shoes the words 'massage' appear, so
I think you are correct. I can't see how that even works as spam since it
looks so garbled. The words 'shepherd' is also repeated in one of them
<shrug>.

The phone number regex didn't catch a few things like:
>  - this abortion drug peddler https://openlibrary.org/people/fatin23
>  - these Korean ?phone numbers? https://openlibrary.org/people/hdl0696
>  - this escort service https://openlibrary.org/people/indianescorts
>

I didn't want to match all long series of numbers in case there were
legitimate technical manuals or something that would be matched, but the
phone number reg-ex could certainly be improved to catch more variants.


> I'm not sure how we'd catch stuff like this:
> https://openlibrary.org/works/OL17198564W/Fai-da-te_con_fiori (basically
> just a display ad in book cover form with no URL, phone number or other
> identifying characteristic)
>
>
Agreed. I have seen some odd things that have been added that are hard to
classify. Some people add books that are obvious tests, with the word
'test'  in all fields. Some are just really bad ham. I can't find it now,
but there was a 'harry potter' with no further distinguishing features.
This is another one that looks like it was entered in good faith:
https://openlibrary.org/books/OL25765319M/stephanie_plum_one_for_the_money
but there are problems with every field except publication date, and it
looks like it is meant to be this already existing book here:
https://openlibrary.org/books/OL1437742M/One_for_the_money that is
considerably better, but not perfect either. My thoughts for these entries
are that community policing and maybe some way to ask the submitter to
clarify what they were trying to do, like what happens in Wikipedia and
discogs.com etc, is the best way to handle them.


> Looking at months like Oct 2014, it's clear that there's still some
> historical cleanup to do: http://ol-spam-finder.herokuapp.com/2014-10
>
>
Jessamyn pointed out that a large number of pdf download spam occurred
around that time and was slipping through. I updated the matcher recently
and re-ran the check to highlight them. Looks like they have not yet been
cleared.


> I'll try to eek out some time to do some more experimentation soon and let
> folks know what I find.
>
>
I look forward to hearing how it goes. My tool is about at the limit of its
usefulness. To take it further I could do some similarly basic checking on
description fields for obvious external links, or 'Download' image buttons
that occurred on the pdf download spam. Other than that, implementing a
more intelligent Bayesian algorithm and define a threshold for spam, but
then that sounds like what you are working towards automating. I'm happy to
leave my tool as something to catch the most obvious, lazy, and prolific
spammers :)

The main 'tell' for these spammers is a newly created user that adds a
significant number of books on day one. I found one legit user who added
about 11 on the first day, but that was a stand out exception, most will
add only one or two. The rest of the additions are made by long time users.
I have not yet found an OL spammer who hasn't fit this profile.

Getting the spam problem sorted is something I want to see too, so yell out
if you want a hand with anything. I understand it's hard to get a decent
block of time to tackle this problem properly as I'm only doing what I can
between other work too.


Charles.

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] OpenLibrary spam corpus and spam detection

Reply via email to