Dear Sean,

That's great! I'm glad you found it useful. I hope your manager isn't too
depressed to see the numbers go down. ;)

Regarding the difference in between runs, it looks like it has to do with
the order of the user agent patterns in the file. For example, there are
325498 hits from "Googlebot" which get purged first, then there's a later
user agent "bot" which matches 520514 requests, but 325498 of those would
have already been purged from the "Googlebot" match. There are also about
100,000 matches for "robot" and "robots", both of which overlap with the
"bot" pattern and each other. Maybe I should add a note to the output of
the total to say it's not a reliable number. The most accurate number would
be the hits actually purged.

Also, I think I'm going to change the purge option to just be "-p" without
an argument like the debug flag... to be consistent and require less
typing...

Cheers,

On Tue, Nov 12, 2019 at 2:50 PM Sean Carte <[email protected]> wrote:

> Thanks, Alan!
>
> Total number of bot hits purged: 575004
>
> One thing I found curious is that I first ran it with -pno -d, then -pyes
> and got a different result each time:
>
> dspace@ir:/home/dspace$ scripts/check-spider-hits.sh -u
> http://localhost:8080/solr -f
> /dspacecris-dut/config/spiders/agents/example -pno -d
> (DEBUG) Using spiders pattern file:
> /dspacecris-dut/config/spiders/agents/example
> (DEBUG) Checking for hits from spider: AllenTrack
> (DEBUG) Checking for hits from spider: Arachmo
> (DEBUG) Checking for hits from spider: ContentSmartz
> (DEBUG) Checking for hits from spider: DSurf
> (DEBUG) Checking for hits from spider: EmailSiphon
> (DEBUG) Checking for hits from spider: EmailWolf
> (DEBUG) Checking for hits from spider: GetRight
> (DEBUG) Checking for hits from spider: Googlebot
> Found 325498 hits from Googlebot in statistics
> (DEBUG) Checking for hits from spider: HTTrack
> Found 1366 hits from HTTrack in statistics
> (DEBUG) Checking for hits from spider: LOCKSS
> (DEBUG) Checking for hits from spider: MSNBot
> (DEBUG) Checking for hits from spider: Milbot
> (DEBUG) Checking for hits from spider: MuscatFerre
> (DEBUG) Checking for hits from spider: NABOT
> (DEBUG) Checking for hits from spider: NaverBot
> (DEBUG) Checking for hits from spider: OurBrowser
> (DEBUG) Checking for hits from spider: Readpaper
> (DEBUG) Checking for hits from spider: Strider
> Found 1 hits from Strider in statistics
> (DEBUG) Checking for hits from spider: Teoma
> Found 2 hits from Teoma in statistics
> (DEBUG) Checking for hits from spider: Wanadoo
> Found 7 hits from Wanadoo in statistics
> (DEBUG) Checking for hits from spider: WebCloner
> (DEBUG) Checking for hits from spider: WebCopier
> (DEBUG) Checking for hits from spider: WebReaper
> (DEBUG) Checking for hits from spider: WebStripper
> (DEBUG) Checking for hits from spider: WebZIP
> (DEBUG) Checking for hits from spider: Webinator
> (DEBUG) Checking for hits from spider: Webmetrics
> (DEBUG) Checking for hits from spider: Wget
> Found 170 hits from Wget in statistics
> (DEBUG) Checking for hits from spider: alexa
> Found 238 hits from alexa in statistics
> (DEBUG) Checking for hits from spider: almaden
> (DEBUG) Checking for hits from spider: appie
> (DEBUG) Checking for hits from spider: architext
> (DEBUG) Checking for hits from spider: arks
> Found 18 hits from arks in statistics
> (DEBUG) Checking for hits from spider: asterias
> (DEBUG) Checking for hits from spider: atomz
> (DEBUG) Checking for hits from spider: autoemailspider
> (DEBUG) Checking for hits from spider: awbot
> (DEBUG) Checking for hits from spider: baiduspider
> (DEBUG) Checking for hits from spider: bbot
> (DEBUG) Checking for hits from spider: biadu
> (DEBUG) Checking for hits from spider: biglotron
> (DEBUG) Checking for hits from spider: bjaaland
> (DEBUG) Checking for hits from spider: bloglines
> (DEBUG) Checking for hits from spider: blogpulse
> (DEBUG) Checking for hits from spider: bot
> Found 520514 hits from bot in statistics
> (DEBUG) Checking for hits from spider: bspider
> Found 72 hits from bspider in statistics
> (DEBUG) Checking for hits from spider: bwh3_user_agent
> (DEBUG) Checking for hits from spider: celestial
> (DEBUG) Checking for hits from spider: cfnetwork|checkbot
> (DEBUG) Solr query returned HTTP 400, skipping cfnetwork|checkbot.
> (DEBUG) Checking for hits from spider: combine
> (DEBUG) Checking for hits from spider: contentmatch
> (DEBUG) Checking for hits from spider: core
> (DEBUG) Checking for hits from spider: crawl
> Found 15205 hits from crawl in statistics
> (DEBUG) Checking for hits from spider: crawler
> Found 15191 hits from crawler in statistics
> (DEBUG) Checking for hits from spider: cursor
> (DEBUG) Checking for hits from spider: custo
> Found 4 hits from custo in statistics
> (DEBUG) Checking for hits from spider: daumoa
> (DEBUG) Checking for hits from spider: docomo
> (DEBUG) Checking for hits from spider: dtSearchSpider
> (DEBUG) Checking for hits from spider: dumbot
> (DEBUG) Checking for hits from spider: easydl
> (DEBUG) Checking for hits from spider: exabot
> Found 133 hits from exabot in statistics
> (DEBUG) Checking for hits from spider: fast-webcrawler
> (DEBUG) Checking for hits from spider: favorg
> (DEBUG) Checking for hits from spider: feedburner
> (DEBUG) Checking for hits from spider: ferret
> (DEBUG) Checking for hits from spider: findlinks
> Found 10626 hits from findlinks in statistics
> (DEBUG) Checking for hits from spider: gaisbot
> (DEBUG) Checking for hits from spider: geturl
> (DEBUG) Checking for hits from spider: gigabot
> (DEBUG) Checking for hits from spider: girafabot
> (DEBUG) Checking for hits from spider: gnodspider
> (DEBUG) Checking for hits from spider: google
> Found 327642 hits from google in statistics
> (DEBUG) Checking for hits from spider: grub
> (DEBUG) Checking for hits from spider: gulliver
> (DEBUG) Checking for hits from spider: harvest
> (DEBUG) Checking for hits from spider: heritrix
> Found 765 hits from heritrix in statistics
> (DEBUG) Checking for hits from spider: hl_ftien_spider
> (DEBUG) Checking for hits from spider: holmes
> (DEBUG) Checking for hits from spider: htdig
> (DEBUG) Checking for hits from spider: htmlparser
> (DEBUG) Checking for hits from spider: httrack
> (DEBUG) Checking for hits from spider: iSiloX
> (DEBUG) Checking for hits from spider: ia_archiver
> Found 243 hits from ia_archiver in statistics
> (DEBUG) Checking for hits from spider: ichiro
> Found 1153 hits from ichiro in statistics
> (DEBUG) Checking for hits from spider: iktomi
> (DEBUG) Checking for hits from spider: ilse
> (DEBUG) Checking for hits from spider: internetseer
> (DEBUG) Checking for hits from spider: intute
> (DEBUG) Checking for hits from spider: java
> Found 2 hits from java in statistics
> (DEBUG) Checking for hits from spider: jeeves
> (DEBUG) Checking for hits from spider: jobo
> (DEBUG) Checking for hits from spider: kyluka
> (DEBUG) Checking for hits from spider: larbin
> (DEBUG) Checking for hits from spider: libwww
> Found 113 hits from libwww in statistics
> (DEBUG) Checking for hits from spider: lilina
> (DEBUG) Checking for hits from spider: linkbot
> (DEBUG) Checking for hits from spider: linkcheck
> (DEBUG) Checking for hits from spider: linkchecker
> (DEBUG) Checking for hits from spider: linkscan
> (DEBUG) Checking for hits from spider: linkwalker
> (DEBUG) Checking for hits from spider: lmspider
> (DEBUG) Checking for hits from spider: lwp
> (DEBUG) Checking for hits from spider: megite
> (DEBUG) Checking for hits from spider: milbot
> (DEBUG) Checking for hits from spider: mimas
> (DEBUG) Checking for hits from spider: mj12bot
> (DEBUG) Checking for hits from spider: mnogosearch
> (DEBUG) Checking for hits from spider: moget
> (DEBUG) Checking for hits from spider: mojeekbot
> (DEBUG) Checking for hits from spider: momspider
> (DEBUG) Checking for hits from spider: motor
> Found 8 hits from motor in statistics
> (DEBUG) Checking for hits from spider: msiecrawler
> (DEBUG) Checking for hits from spider: msnbot
> Found 8993 hits from msnbot in statistics
> (DEBUG) Checking for hits from spider: myweb
> (DEBUG) Checking for hits from spider: nagios
> (DEBUG) Checking for hits from spider: netcraft
> (DEBUG) Checking for hits from spider: netluchs
> (DEBUG) Checking for hits from spider: no_user_agent
> (DEBUG) Checking for hits from spider: nomad
> (DEBUG) Checking for hits from spider: nutch
> Found 68 hits from nutch in statistics
> (DEBUG) Checking for hits from spider: ocelli
> (DEBUG) Checking for hits from spider: onetszukaj
> (DEBUG) Checking for hits from spider: perman
> (DEBUG) Checking for hits from spider: pioneer
> (DEBUG) Checking for hits from spider: powermarks
> (DEBUG) Checking for hits from spider: psbot
> Found 3 hits from psbot in statistics
> (DEBUG) Checking for hits from spider: python
> Found 1 hits from python in statistics
> (DEBUG) Checking for hits from spider: qihoobot
> (DEBUG) Checking for hits from spider: rambler
> (DEBUG) Checking for hits from spider: redalert|robozilla
> (DEBUG) Solr query returned HTTP 400, skipping redalert|robozilla.
> (DEBUG) Checking for hits from spider: robot
> Found 56183 hits from robot in statistics
> (DEBUG) Checking for hits from spider: robots
> Found 43145 hits from robots in statistics
> (DEBUG) Checking for hits from spider: rss
> (DEBUG) Checking for hits from spider: scan4mail
> (DEBUG) Checking for hits from spider: scientificcommons
> (DEBUG) Checking for hits from spider: scirus
> (DEBUG) Checking for hits from spider: scooter
> (DEBUG) Checking for hits from spider: seekbot
> (DEBUG) Checking for hits from spider: seznambot
> (DEBUG) Checking for hits from spider: shoutcast
> (DEBUG) Checking for hits from spider: slurp
> Found 104 hits from slurp in statistics
> (DEBUG) Checking for hits from spider: sogou
> Found 2178 hits from sogou in statistics
> (DEBUG) Checking for hits from spider: speedy
> Found 139 hits from speedy in statistics
> (DEBUG) Checking for hits from spider: spider
> Found 23341 hits from spider in statistics
> (DEBUG) Checking for hits from spider: spiderman
> (DEBUG) Checking for hits from spider: spiderview
> (DEBUG) Checking for hits from spider: sunrise
> (DEBUG) Checking for hits from spider: superbot
> (DEBUG) Checking for hits from spider: surveybot
> (DEBUG) Checking for hits from spider: tailrank
> (DEBUG) Checking for hits from spider: technoratibot
> (DEBUG) Checking for hits from spider: titan
> (DEBUG) Checking for hits from spider: turnitinbot
> (DEBUG) Checking for hits from spider: twiceler
> (DEBUG) Checking for hits from spider: ucsd
> (DEBUG) Checking for hits from spider: ultraseek
> (DEBUG) Checking for hits from spider: urlaliasbuilder
> (DEBUG) Checking for hits from spider: urllib
> Found 66 hits from urllib in statistics
> (DEBUG) Checking for hits from spider: voila
> (DEBUG) Checking for hits from spider: webcollage
> (DEBUG) Checking for hits from spider: weblayers
> (DEBUG) Checking for hits from spider: webmirror
> (DEBUG) Checking for hits from spider: webreaper
> (DEBUG) Checking for hits from spider: wordpress
> (DEBUG) Checking for hits from spider: worm
> (DEBUG) Checking for hits from spider: xenu
> (DEBUG) Checking for hits from spider: yacy
> Found 2 hits from yacy in statistics
> (DEBUG) Checking for hits from spider: yahoo
> Found 153 hits from yahoo in statistics
> (DEBUG) Checking for hits from spider: yahoofeedseeker
> (DEBUG) Checking for hits from spider: yahooseeker
> (DEBUG) Checking for hits from spider: yandex
> Found 8591 hits from yandex in statistics
> (DEBUG) Checking for hits from spider: yodaobot
> (DEBUG) Checking for hits from spider: zealbot
> (DEBUG) Checking for hits from spider: zeus
> (DEBUG) Checking for hits from spider: zyborg
> (DEBUG) Checking for hits from spider: parsijoo
> Found 38 hits from parsijoo in statistics
> (DEBUG) Checking for hits from spider: validator
>
> Total number of hits from bots: 1361976
> dspace@ir:/home/dspace$ scripts/check-spider-hits.sh -u
> http://localhost:8080/solr -f
> /dspacecris-dut/config/spiders/agents/example -pyes
> Purging 325498 hits from Googlebot in statistics
> Purging 1366 hits from HTTrack in statistics
> Purging 1 hits from Strider in statistics
> Purging 2 hits from Teoma in statistics
> Purging 7 hits from Wanadoo in statistics
> Purging 170 hits from Wget in statistics
> Purging 238 hits from alexa in statistics
> Purging 18 hits from arks in statistics
> Purging 195014 hits from bot in statistics
> Purging 72 hits from bspider in statistics
> Purging 14714 hits from crawl in statistics
> Purging 4 hits from custo in statistics
> Purging 10626 hits from findlinks in statistics
> Purging 2271 hits from google in statistics
> Purging 765 hits from heritrix in statistics
> Purging 5 hits from ia_archiver in statistics
> Purging 598 hits from ichiro in statistics
> Purging 2 hits from java in statistics
> Purging 113 hits from libwww in statistics
> Purging 8 hits from motor in statistics
> Purging 1 hits from python in statistics
> Purging 103 hits from slurp in statistics
> Purging 2178 hits from sogou in statistics
> Purging 139 hits from speedy in statistics
> Purging 20938 hits from spider in statistics
> Purging 66 hits from urllib in statistics
> Purging 49 hits from yahoo in statistics
> Purging 38 hits from parsijoo in statistics
>
> Total number of bot hits purged: 575004
>
>
> On Sun, 10 Nov 2019 at 18:12, Alan Orth <[email protected]> wrote:
>
>> Dear list,
>>
>> I ended up writing a little bash script¹ to read known spider user agents
>> from a file such as DSpace's `example` pattern file and check for matching
>> documents in the Solr statistics core (or yearly statistics shards). It can
>> optionally purge the matched records, but this is disabled by default. In
>> our case, I purged 2 MILLION hits from our statistics core, which has data
>> going back nine years. It feels nice to know that our usage statistics are
>> more accurate now, though the repository managers will be depressed because
>> their content wasn't as popular as they thought. :)
>>
>> To use the script you need to be able to access your DSpace's Solr
>> instance directly, either by running the script on the same machine or by
>> making the port available via an SSH tunnel:
>>
>> $ ssh -L 8080:localhost:8080 dspace.example.edu
>>
>> Then you can run the script, specifying the location of the Solr instance
>> and the location of the patterns file:
>>
>> $ ./check-spider-hits.sh -u http://localhost:8080/solr -f
>> ~/dspace/config/spiders/agents/example
>>
>> Read the script source or check its help text with `-h` to see more
>> options. There is one implementation detail that is interesting: DSpace
>> uses the spider agents file from the COUNTER-Robots project², which
>> contains some plaintext names as well as regular expressions. Unfortunately
>> Solr 4.x as used in current DSpace 5 and 6 only has basic support for
>> regular expressions. For example, all patterns are anchored with ^ and $ by
>> default, you need to use [0-9] instead of \d, etc. As such, my script does
>> some basic filtering of the input pattern file to remove user agents that
>> are using regular expression characters. I imagine this is part of the
>> reason why DSpace's mark spider feature was never completed for user
>> agents, because the example agents file used by SpiderDetector.java cannot
>> be used when searching Solr later for marking spiders.
>>
>> I hope this is helpful for someone. Thanks to the contributors of the
>> COUNTER-Robots project for curating this list.
>>
>> Regards,
>>
>> ¹ https://github.com/ilri/DSpace/blob/5_x-prod/check-spider-hits.sh
>> ² https://github.com/atmire/COUNTER-Robots
>>
>> On Thu, Nov 7, 2019 at 3:55 PM Alan Orth <[email protected]> wrote:
>>
>>> Thank you, Mark. For now I'll just settle for an updated list of spider
>>> agents from COUNTER-Robots¹ (dropping the text file into
>>> dspace/config/spiders/agents seems to work).
>>>
>>> Regards,
>>>
>>> ¹ https://github.com/atmire/COUNTER-Robots
>>>
>>> On Tue, Nov 5, 2019 at 4:02 PM Mark H. Wood <[email protected]>
>>> wrote:
>>>
>>>> On Mon, Nov 04, 2019 at 11:10:25PM +0200, Alan Orth wrote:
>>>> > The DSpace 5.x (and presumably 6.x) documentation[0] suggests that it
>>>> is
>>>> > possible to mark existing Solr statistics records as being bots or
>>>> spiders
>>>> > using the following command:
>>>> >
>>>> > $ dspace stats-util -m
>>>> >
>>>> > After trying to test this with an updated list of user agents[1] for a
>>>> > while I realized that the feature is only implemented for IPs. As it
>>>> stands
>>>> > right now the code in StatisticsClient.java only marks robots based on
>>>> > their IPs, but not on their user agents or domains:
>>>> >
>>>> > else if (line.hasOption('m'))
>>>> > {
>>>> >     SolrLogger.markRobotsByIP();
>>>> > }
>>>> >
>>>> > Strangely enough, SolrLogger has a markRobotByUserAgent() function
>>>> that is
>>>> > never called anywhere in the Java code base (also it seems to only be
>>>> > partially implemented, as it does not iterate over agents).
>>>> >
>>>> > Should I file a bug? This issue affects DSpace 5.x and 6.x for sure.
>>>>
>>>> https://jira.duraspace.org/browse/DS-2431
>>>>
>>>> There are several Issues related to completing the work on extended
>>>> spider marking and filtering.
>>>>
>>>> --
>>>> Mark H. Wood
>>>> Lead Technology Analyst
>>>>
>>>> University Library
>>>> Indiana University - Purdue University Indianapolis
>>>> 755 W. Michigan Street
>>>> Indianapolis, IN 46202
>>>> 317-274-0749
>>>> www.ulib.iupui.edu
>>>>
>>>> --
>>>> All messages to this mailing list should adhere to the DuraSpace Code
>>>> of Conduct: https://duraspace.org/about/policies/code-of-conduct/
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "DSpace Technical Support" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/dspace-tech/20191105140039.GA30402%40IUPUI.Edu
>>>> .
>>>>
>>>
>>>
>>> --
>>> Alan Orth
>>> [email protected]
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>>
>>
>>
>> --
>> Alan Orth
>> [email protected]
>> https://picturingjordan.com
>> https://englishbulgaria.net
>> https://mjanja.ch
>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>
>> --
>> All messages to this mailing list should adhere to the DuraSpace Code of
>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/dspace-tech/CAKKdN4Xs1_AOP9UWaaScEFb26a_q36A7jnVsZ_dYGcrAuF_8tQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/dspace-tech/CAKKdN4Xs1_AOP9UWaaScEFb26a_q36A7jnVsZ_dYGcrAuF_8tQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
>
>

-- 
Alan Orth
[email protected]
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/CAKKdN4U%2BmeNWbMJ7hkobQwkP%2Bxe7oJwET9do906bqY1ku%3Djjxg%40mail.gmail.com.

Reply via email to