Dear list,

I ended up writing a little bash script¹ to read known spider user agents
from a file such as DSpace's `example` pattern file and check for matching
documents in the Solr statistics core (or yearly statistics shards). It can
optionally purge the matched records, but this is disabled by default. In
our case, I purged 2 MILLION hits from our statistics core, which has data
going back nine years. It feels nice to know that our usage statistics are
more accurate now, though the repository managers will be depressed because
their content wasn't as popular as they thought. :)

To use the script you need to be able to access your DSpace's Solr instance
directly, either by running the script on the same machine or by making the
port available via an SSH tunnel:

$ ssh -L 8080:localhost:8080 dspace.example.edu

Then you can run the script, specifying the location of the Solr instance
and the location of the patterns file:

$ ./check-spider-hits.sh -u http://localhost:8080/solr -f
~/dspace/config/spiders/agents/example

Read the script source or check its help text with `-h` to see more
options. There is one implementation detail that is interesting: DSpace
uses the spider agents file from the COUNTER-Robots project², which
contains some plaintext names as well as regular expressions. Unfortunately
Solr 4.x as used in current DSpace 5 and 6 only has basic support for
regular expressions. For example, all patterns are anchored with ^ and $ by
default, you need to use [0-9] instead of \d, etc. As such, my script does
some basic filtering of the input pattern file to remove user agents that
are using regular expression characters. I imagine this is part of the
reason why DSpace's mark spider feature was never completed for user
agents, because the example agents file used by SpiderDetector.java cannot
be used when searching Solr later for marking spiders.

I hope this is helpful for someone. Thanks to the contributors of the
COUNTER-Robots project for curating this list.

Regards,

¹ https://github.com/ilri/DSpace/blob/5_x-prod/check-spider-hits.sh
² https://github.com/atmire/COUNTER-Robots

On Thu, Nov 7, 2019 at 3:55 PM Alan Orth <alan.o...@gmail.com> wrote:

> Thank you, Mark. For now I'll just settle for an updated list of spider
> agents from COUNTER-Robots¹ (dropping the text file into
> dspace/config/spiders/agents seems to work).
>
> Regards,
>
> ¹ https://github.com/atmire/COUNTER-Robots
>
> On Tue, Nov 5, 2019 at 4:02 PM Mark H. Wood <mwoodiu...@gmail.com> wrote:
>
>> On Mon, Nov 04, 2019 at 11:10:25PM +0200, Alan Orth wrote:
>> > The DSpace 5.x (and presumably 6.x) documentation[0] suggests that it is
>> > possible to mark existing Solr statistics records as being bots or
>> spiders
>> > using the following command:
>> >
>> > $ dspace stats-util -m
>> >
>> > After trying to test this with an updated list of user agents[1] for a
>> > while I realized that the feature is only implemented for IPs. As it
>> stands
>> > right now the code in StatisticsClient.java only marks robots based on
>> > their IPs, but not on their user agents or domains:
>> >
>> > else if (line.hasOption('m'))
>> > {
>> >     SolrLogger.markRobotsByIP();
>> > }
>> >
>> > Strangely enough, SolrLogger has a markRobotByUserAgent() function that
>> is
>> > never called anywhere in the Java code base (also it seems to only be
>> > partially implemented, as it does not iterate over agents).
>> >
>> > Should I file a bug? This issue affects DSpace 5.x and 6.x for sure.
>>
>> https://jira.duraspace.org/browse/DS-2431
>>
>> There are several Issues related to completing the work on extended
>> spider marking and filtering.
>>
>> --
>> Mark H. Wood
>> Lead Technology Analyst
>>
>> University Library
>> Indiana University - Purdue University Indianapolis
>> 755 W. Michigan Street
>> Indianapolis, IN 46202
>> 317-274-0749
>> www.ulib.iupui.edu
>>
>> --
>> All messages to this mailing list should adhere to the DuraSpace Code of
>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to dspace-tech+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/dspace-tech/20191105140039.GA30402%40IUPUI.Edu
>> .
>>
>
>
> --
> Alan Orth
> alan.o...@gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>


-- 
Alan Orth
alan.o...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/CAKKdN4Xs1_AOP9UWaaScEFb26a_q36A7jnVsZ_dYGcrAuF_8tQ%40mail.gmail.com.

Reply via email to