Hello, Alan.

I tried the bash script and received the following message (several times).

-:1: parser error : Document is empty




Em domingo, 10 de novembro de 2019 14:12:23 UTC-2, Alan Orth escreveu:
>
> Dear list,
>
> I ended up writing a little bash script¹ to read known spider user agents 
> from a file such as DSpace's `example` pattern file and check for matching 
> documents in the Solr statistics core (or yearly statistics shards). It can 
> optionally purge the matched records, but this is disabled by default. In 
> our case, I purged 2 MILLION hits from our statistics core, which has data 
> going back nine years. It feels nice to know that our usage statistics are 
> more accurate now, though the repository managers will be depressed because 
> their content wasn't as popular as they thought. :)
>
> To use the script you need to be able to access your DSpace's Solr 
> instance directly, either by running the script on the same machine or by 
> making the port available via an SSH tunnel: 
>
> $ ssh -L 8080:localhost:8080 dspace.example.edu
>
> Then you can run the script, specifying the location of the Solr instance 
> and the location of the patterns file:
>
> $ ./check-spider-hits.sh -u http://localhost:8080/solr -f 
> ~/dspace/config/spiders/agents/example
>
> Read the script source or check its help text with `-h` to see more 
> options. There is one implementation detail that is interesting: DSpace 
> uses the spider agents file from the COUNTER-Robots project², which 
> contains some plaintext names as well as regular expressions. Unfortunately 
> Solr 4.x as used in current DSpace 5 and 6 only has basic support for 
> regular expressions. For example, all patterns are anchored with ^ and $ by 
> default, you need to use [0-9] instead of \d, etc. As such, my script does 
> some basic filtering of the input pattern file to remove user agents that 
> are using regular expression characters. I imagine this is part of the 
> reason why DSpace's mark spider feature was never completed for user 
> agents, because the example agents file used by SpiderDetector.java cannot 
> be used when searching Solr later for marking spiders.
>
> I hope this is helpful for someone. Thanks to the contributors of the 
> COUNTER-Robots project for curating this list.
>
> Regards,
>
> ¹ https://github.com/ilri/DSpace/blob/5_x-prod/check-spider-hits.sh
> ² https://github.com/atmire/COUNTER-Robots
>
> On Thu, Nov 7, 2019 at 3:55 PM Alan Orth <alan...@gmail.com <javascript:>> 
> wrote:
>
>> Thank you, Mark. For now I'll just settle for an updated list of spider 
>> agents from COUNTER-Robots¹ (dropping the text file into 
>> dspace/config/spiders/agents seems to work). 
>>
>> Regards,
>>
>> ¹ https://github.com/atmire/COUNTER-Robots
>>
>> On Tue, Nov 5, 2019 at 4:02 PM Mark H. Wood <mwood...@gmail.com 
>> <javascript:>> wrote:
>>
>>> On Mon, Nov 04, 2019 at 11:10:25PM +0200, Alan Orth wrote:
>>> > The DSpace 5.x (and presumably 6.x) documentation[0] suggests that it 
>>> is
>>> > possible to mark existing Solr statistics records as being bots or 
>>> spiders
>>> > using the following command:
>>> > 
>>> > $ dspace stats-util -m
>>> > 
>>> > After trying to test this with an updated list of user agents[1] for a
>>> > while I realized that the feature is only implemented for IPs. As it 
>>> stands
>>> > right now the code in StatisticsClient.java only marks robots based on
>>> > their IPs, but not on their user agents or domains:
>>> > 
>>> > else if (line.hasOption('m'))
>>> > {
>>> >     SolrLogger.markRobotsByIP();
>>> > }
>>> > 
>>> > Strangely enough, SolrLogger has a markRobotByUserAgent() function 
>>> that is
>>> > never called anywhere in the Java code base (also it seems to only be
>>> > partially implemented, as it does not iterate over agents).
>>> > 
>>> > Should I file a bug? This issue affects DSpace 5.x and 6.x for sure.
>>>
>>> https://jira.duraspace.org/browse/DS-2431
>>>
>>> There are several Issues related to completing the work on extended
>>> spider marking and filtering.
>>>
>>> -- 
>>> Mark H. Wood
>>> Lead Technology Analyst
>>>
>>> University Library
>>> Indiana University - Purdue University Indianapolis
>>> 755 W. Michigan Street
>>> Indianapolis, IN 46202
>>> 317-274-0749
>>> www.ulib.iupui.edu
>>>
>>> -- 
>>> All messages to this mailing list should adhere to the DuraSpace Code of 
>>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "DSpace Technical Support" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to dspac...@googlegroups.com <javascript:>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/dspace-tech/20191105140039.GA30402%40IUPUI.Edu
>>> .
>>>
>>
>>
>> -- 
>> Alan Orth
>> alan...@gmail.com <javascript:>
>> https://picturingjordan.com
>> https://englishbulgaria.net
>> https://mjanja.ch
>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>
>
>
> -- 
> Alan Orth
> alan...@gmail.com <javascript:>
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/c29819e1-9d85-4865-84a3-bb89c629d1b1%40googlegroups.com.

Reply via email to