Thanks Tom for the feedback! On 25 September 2015 at 06:34, Tom Morris <[email protected]> wrote:
> Charles - thanks for publishing your code (and creating the tool in the > first place). When I look at > http://ol-spam-finder.herokuapp.com/2015-09/10 I notice that: > - almost everything is spam :-( Do you tally how many of the 675 new books > were spam? Looks like 600+ from glancing at > https://openlibrary.org/recentchanges/2015/09/10/add-book#humans > That day 675 new books were created by 40 unique users, 27 of those were created on the day. Of those 27, 18 were created for spamming. Implying that 22 users added good books. I find it interesting that there were actually 575 new sign-ups in one day, which seems to be a pretty representative number. The percentage of new users who add books is very low, but unfortunately the majority of those users tend to be spammers. I think legitimate users who add books sign up, and take longer to add their first book, and then continue to add more over time. To answer your question, the tools doesn't show how many of the 675 books were spam. Doing the check manually I see that only 50 books of those 675 were not created by users flagged as spammers. That doesn't surprise me much as each of the Korean spammers can add a couple of hundred almost identical spam works in a short space of time. That's what makes me think they have some kind of semi-automated system in place. The astrologers and escort spammers tend to only add between one and three books. > - deleted users get categorized as "good" users in the listing > Correct. I was going to class them as something else, e.g. 'cleared' but decided that they were 'good' in the sense that no further action needed to be taken as they were dealt with. > - https://openlibrary.org/people/zamiulcse is an unidentified spammer > (these less obvious ones are the kinds of things I'm hoping a machine > learning solution will catch) > > Yes, I have decided to leave those as they are much harder to catch using my approach without risking false positives. This tool tries to be _very_ conservative in order to avoid false positives, therefore it will let through a fair bit. I wanted to catch the really obvious spam, the 100s per day of Korean spam. I have never seen a Korean ham work added, but I am relatively confident that my script will all ow it through as I am not just matching on Hangul characters. It's Hangul + some other sign of spam in the title. Any legitimate Korean books about Bacarat _will_ be flagged as spam though -- that's an unfortunate bug :) I have seen a couple of self-published e-books that look very similar to the above entries too, and it appears the submitter is trying to promote downloads, but they seem like they could be legitimate if OL is including non-print books. On a second look I can't really tell if that is really spam, it could be a legitimate book that appears very spam like? I'm not sure about the OL policy on real books that basically only exist for some kind of promotional purpose. > There are also a couple of uncaught Korean spammers on > http://ol-spam-finder.herokuapp.com/2015-09/22 > That is an odd format, and again in the interests of being conservative I wasn't completely sure that it was spam. They are less obvious than the others, I thought it may some kind of technical serial published over consecutive years?... Google translate shoes the words 'massage' appear, so I think you are correct. I can't see how that even works as spam since it looks so garbled. The words 'shepherd' is also repeated in one of them <shrug>. The phone number regex didn't catch a few things like: > - this abortion drug peddler https://openlibrary.org/people/fatin23 > - these Korean ?phone numbers? https://openlibrary.org/people/hdl0696 > - this escort service https://openlibrary.org/people/indianescorts > I didn't want to match all long series of numbers in case there were legitimate technical manuals or something that would be matched, but the phone number reg-ex could certainly be improved to catch more variants. > I'm not sure how we'd catch stuff like this: > https://openlibrary.org/works/OL17198564W/Fai-da-te_con_fiori (basically > just a display ad in book cover form with no URL, phone number or other > identifying characteristic) > > Agreed. I have seen some odd things that have been added that are hard to classify. Some people add books that are obvious tests, with the word 'test' in all fields. Some are just really bad ham. I can't find it now, but there was a 'harry potter' with no further distinguishing features. This is another one that looks like it was entered in good faith: https://openlibrary.org/books/OL25765319M/stephanie_plum_one_for_the_money but there are problems with every field except publication date, and it looks like it is meant to be this already existing book here: https://openlibrary.org/books/OL1437742M/One_for_the_money that is considerably better, but not perfect either. My thoughts for these entries are that community policing and maybe some way to ask the submitter to clarify what they were trying to do, like what happens in Wikipedia and discogs.com etc, is the best way to handle them. > Looking at months like Oct 2014, it's clear that there's still some > historical cleanup to do: http://ol-spam-finder.herokuapp.com/2014-10 > > Jessamyn pointed out that a large number of pdf download spam occurred around that time and was slipping through. I updated the matcher recently and re-ran the check to highlight them. Looks like they have not yet been cleared. > I'll try to eek out some time to do some more experimentation soon and let > folks know what I find. > > I look forward to hearing how it goes. My tool is about at the limit of its usefulness. To take it further I could do some similarly basic checking on description fields for obvious external links, or 'Download' image buttons that occurred on the pdf download spam. Other than that, implementing a more intelligent Bayesian algorithm and define a threshold for spam, but then that sounds like what you are working towards automating. I'm happy to leave my tool as something to catch the most obvious, lazy, and prolific spammers :) The main 'tell' for these spammers is a newly created user that adds a significant number of books on day one. I found one legit user who added about 11 on the first day, but that was a stand out exception, most will add only one or two. The rest of the additions are made by long time users. I have not yet found an OL spammer who hasn't fit this profile. Getting the spam problem sorted is something I want to see too, so yell out if you want a hand with anything. I understand it's hard to get a decent block of time to tackle this problem properly as I'm only doing what I can between other work too. Charles.
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
