Hi- Is there a reputable list of IPs or Agents? I have some download numbers that seem way too high.
Filters I’m using in Solr: isBot:False, -dns:*bot*, -dns:*spider*. Also we have spider IP text files in [dspace]/config/spiders Thanks, s — Susan Borda Digital Technologies Development Librarian Montana State University Library 406-994-1873 — Susan Borda Digital Technologies Development Librarian Montana State University Library 406-994-1873 From: "Pottinger, Hardy J." <[email protected]<mailto:[email protected]>> Date: Friday, May 15, 2015 at 7:19 AM To: Anthony Petryk <[email protected]<mailto:[email protected]>>, "Monika C. Mevenkamp" <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Dspace-tech] spider ip recognition Hi, you've run into a known issue, and one I very recently wrestled with myself: https://jira.duraspace.org/browse/DS-2431 See my last comment on that ticket, I found a way around the issue, by simply deleting the spider docs from the stats index via a query in the Solr admin interface. --Hardy ________________________________ From: Anthony Petryk [[email protected]<mailto:[email protected]>] Sent: Thursday, May 14, 2015 12:06 PM To: Monika C. Mevenkamp; [email protected]<mailto:[email protected]> Subject: Re: [Dspace-tech] spider ip recognition Hi again, Unfortunately, the documentation for the stats-util command is incorrect. Specifically this line: -i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, or Agent name. Will prune out all records that match spider identification patterns. Running “stats-util –i” does not actually remove spiders by DNS name or Agent name. Here’s are the relevant sections of the code, from StatisticsClient.java and SolrLogger.java: (…) else if(line.hasOption('i')) { SolrLogger.deleteRobotsByIP(); } public static void deleteRobotsByIP() { for(String ip : SpiderDetector.getSpiderIpAddresses()){ deleteIP(ip); } } What this means is that, if a spider is in your Solr stats, there’s no way to remove it other than manually adding its IP to [dpsace]/config/spiders; adding its DNS name or Agent name to the configs will not expunge it. Updating the spider files with “stats-util –u” does little to help because the IP lists it pulls from are out of date. An example is the spider from the Bing search engine: bingbot. As of DSpace 4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in the list of spider IP addresses. So anyone running DSpace 4.3 likely has usage stats inflated by visits from this spider. The only way to remove it is to specify all the IPs for bingbot. Multiply that by all the other “new” spiders and we’re talking about a lot of work. I tried briefly to modify the code to take domains/agents into account when marking or deleting spiders, but I wasn’t able to figure out how to query Solr with regex patterns. It’s easier to do with IPs because each IP or IP range is transformed into a String and used as a standard query parameter. Anthony From: Monika C. Mevenkamp [mailto:[email protected]] Sent: Thursday, May 14, 2015 11:17 AM To: Anthony Petryk Cc: Monika C. Mevenkamp; [email protected]<mailto:[email protected]> Subject: Re: [Dspace-tech] spider ip recognition Anthony Since dspace 4 you can filter by userAgent see https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders I have not used this myself and am not sure whether these filters are applied as crawlers access content - or whether you need to run the [dspace]/bin/dspace stats-util command on a regular basis. You definitely need to run it to prune mark usage events after you configure a list of userAgents you want to filter against. Monika ________________ Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 12, 2015, at 2:13 PM, Anthony Petryk <[email protected]<mailto:[email protected]>> wrote: After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There’s nothing that I can see in the code that will do this by domain or agent, only IP. We’re not excited at the prospect of pulling out the IPs of all the spiders in order run “stats-util –i” effectively. Cheers, Anthony From: Monika C. Mevenkamp [mailto:[email protected]] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: [email protected]<mailto:[email protected]> Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it as ruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false<UrlBlockedError.aspx> exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika ________________ Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk <[email protected]<mailto:[email protected]>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). -- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/dspace-tech. For more options, visit https://groups.google.com/d/optout.
