[
https://jira.duraspace.org/browse/DS-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=27352#comment-27352
]
Steve Swinsburg commented on DS-790:
------------------------------------
We use a variety of stats packages and noticed lately that the stats for the
past few months had been wildly inflated in the DSpace Solr stats, but ok in
google analytics. Using webalizer we found the following spiders included in
the logs, so I created a custom IP list, shared here. Turns out you can just
stick an extra file in the spiders directory in the code - the stats-util just
loads in every file in that directory.
# Funnelback
122.99.95.231
# Google
66.249.74
66.249.76
66.249.77
# Yandex
199.21.99
# MSN
65.55.52
65.55.55
157.55.32
157.55.33
157.55.34
157.55.35
157.55.36
157.56.93
> SOLR - Spider detection to match on hostname or useragent
> ---------------------------------------------------------
>
> Key: DS-790
> URL: https://jira.duraspace.org/browse/DS-790
> Project: DSpace
> Issue Type: Improvement
> Components: Solr
> Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.7.0
> Environment: solr
> Reporter: Peter Dietz
> Assignee: Mark H. Wood
> Original Estimate: 0 minutes
> Remaining Estimate: 0 minutes
>
> Spiders are currently detected by matching their IP address to one listed in
> the /dspace/config/spiders/ip-list-X.txt, however as spiders change IP
> addresses, or the ip-list is unmaintained, then many spiders can slip
> through, however they will usually keep their user agent or hostname intact.
> I've noticed a sore point in my solr data, where msnbot is completely
> unfiltered by solr. They have an additional ip list:
> http://www.iplists.com/nw/msn.txt however it is very old, and with additional
> bingbots on the horizon, it would be easier to detect, and filter them out of
> the logs by user-agent, then to maintain all of the IP address ranges. The
> code to do this in SOLR is unimplemented, and this ticket is a place holder
> to encourage this work to filter out based on user agent / dns-hostname to be
> finished.
> To see all of the hits from msnbot that are unfiltered, look at:
> http://localhost:8080/solr/statistics/select?q=dns:msnbot*&facet=true&facet.field=dns&facet.mincount=1&facet.limit=5000
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel