Hi Susan,
The one that DSpace uses is http://iplists.com. It was last updated 2 years ago. I haven’t come across another one myself, at least not in such an easy to use format. We’ve taken to manually periodically removing the main offenders (facet your Solr query by IP – the top ones will likely be bots). A more up-to-date list would be welcome indeed! Anthony On Wednesday, February 3, 2016 at 5:22:07 PM UTC-5, Susan Borda wrote: > > Hi- > Is there a reputable list of IPs or Agents? I have some download numbers > that seem way too high. > > Filters I’m using in Solr: isBot:False, -dns:*bot*, -dns:*spider*. Also we > have spider IP text files in [dspace]/config/spiders > > Thanks, > s > — > Susan Borda > Digital Technologies Development Librarian > Montana State University Library > 406-994-1873 > — > Susan Borda > Digital Technologies Development Librarian > Montana State University Library > 406-994-1873 > > > > From: "Pottinger, Hardy J." <[email protected] <javascript:>> > Date: Friday, May 15, 2015 at 7:19 AM > To: Anthony Petryk <[email protected] <javascript:>>, "Monika C. > Mevenkamp" <[email protected] <javascript:>>, " > [email protected] <javascript:>" < > [email protected] <javascript:>> > Subject: Re: [Dspace-tech] spider ip recognition > > Hi, you've run into a known issue, and one I very recently wrestled with > myself: > > https://jira.duraspace.org/browse/DS-2431 > > See my last comment on that ticket, I found a way around the issue, by > simply deleting the spider docs from the stats index via a query in the > Solr admin interface. > > --Hardy > > ------------------------------ > *From:* Anthony Petryk [[email protected] <javascript:>] > *Sent:* Thursday, May 14, 2015 12:06 PM > *To:* Monika C. Mevenkamp; [email protected] <javascript:> > *Subject:* Re: [Dspace-tech] spider ip recognition > > Hi again, > > > > Unfortunately, the documentation for the stats-util command is incorrect. > Specifically this line: > > > > *-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS > name, or Agent name. Will prune out all records that match spider > identification patterns.* > > > > Running “stats-util –i” does not actually remove spiders by DNS name or > Agent name. Here’s are the relevant sections of the code, from > StatisticsClient.java and SolrLogger.java: > > > > (…) > > else if(line.hasOption('i')) > > { > > SolrLogger.deleteRobotsByIP(); > > } > > > > public static void deleteRobotsByIP() > > { > > for(String ip : SpiderDetector.getSpiderIpAddresses()){ > > deleteIP(ip); > > } > > } > > > > What this means is that, if a spider is in your Solr stats, there’s no way > to remove it other than manually adding its IP to [dpsace]/config/spiders; > adding its DNS name or Agent name to the configs will not expunge it. > Updating the spider files with “stats-util –u” does little to help because > the IP lists it pulls from are out of date. > > > > An example is the spider from the Bing search engine: bingbot. As of > DSpace 4.3, it’s not in the list of spiders by DNS name or Agent name, nor > is it in the list of spider IP addresses. So anyone running DSpace 4.3 > likely has usage stats inflated by visits from this spider. The only way > to remove it is to specify all the IPs for bingbot. Multiply that by all > the other “new” spiders and we’re talking about a lot of work. > > > > I tried briefly to modify the code to take domains/agents into account > when marking or deleting spiders, but I wasn’t able to figure out how to > query Solr with regex patterns. It’s easier to do with IPs because each IP > or IP range is transformed into a String and used as a standard query > parameter. > > > > Anthony > > > > *From:* Monika C. Mevenkamp [mailto:[email protected] <javascript:>] > *Sent:* Thursday, May 14, 2015 11:17 AM > *To:* Anthony Petryk > *Cc:* Monika C. Mevenkamp; [email protected] <javascript:> > *Subject:* Re: [Dspace-tech] spider ip recognition > > > > Anthony > > > > Since dspace 4 you can filter by userAgent > > see > https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders > > I have not used this myself and am not sure whether these filters are > applied as crawlers access content - or whether you need to run the > > [dspace]/bin/dspace stats-util command on a regular basis. You definitely > need to run it to prune mark usage events after you configure > > a list of userAgents you want to filter against. > > > > Monika > > > > ________________ > > Monika Mevenkamp > > phone: 609-258-4161 > > Princeton University, Princeton, NJ 08544 > > > > On May 12, 2015, at 2:13 PM, Anthony Petryk <[email protected] > <javascript:>> wrote: > > > > After a bit of investigation, it turns out that a significant portion of > our items stats come from spiders. Any thoughts on the best way to go > about removing them from Solr retroactively? There’s nothing that I can > see in the code that will do this by domain or agent, only IP. We’re not > excited at the prospect of pulling out the IPs of all the spiders in order > run “stats-util –i” effectively. > > > > Cheers, > > > > Anthony > > > > *From:* Monika C. Mevenkamp [mailto:[email protected] <javascript:>] > *Sent:* Friday, May 08, 2015 9:59 AM > *To:* Anthony Petryk > *Cc:* [email protected] <javascript:> > *Subject:* Re: [Dspace-tech] spider ip recognition > > > > Anthony > > > > I wrote a small ruby script to put solr queries together when I was poking > around my stats > > > > see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb > > an example parameter file is > https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml > > > > run it as ruby solr/solr_query.rb > > > > of cause you ned to adjust the parameters in the mL file > > > > you can query like this > > > > > http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false > > <http://UrlBlockedError.aspx> > > > > exclude records that are marked as bots > > do type:2 - aka items > > do id:218 - aka item with id 218 > > return one item > > facet on ip addresses > > > > crank up the number of rows to get more matching docs > > > > Monika > > > > > > > > ________________ > Monika Mevenkamp > phone: 609-258-4161 > Princeton University, Princeton, NJ 08544 > > > > > On May 7, 2015, at 3:26 PM, Anthony Petryk <[email protected] > <javascript:>> wrote: > > Anyway, we want to determine whether these stats are bona fide or whether > there's something wrong with the spider detection. From the documentation > it seems we have to query Solr directly to do this. Not being an expert in > Solr, I'm hoping someone on this list could provide the query that > retrieves *all the stats for a given item* (i.e. what's listed under > "Common stored fields for all usage events" in the documentation). > > > -- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/dspace-tech. For more options, visit https://groups.google.com/d/optout.
