[ 
https://jira.duraspace.org/browse/DS-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=26862#comment-26862
 ] 

Bram Luyten (@mire) commented on DS-790:
----------------------------------------

Again, slightly out of scope with the title but preferred to log it here 
instead of starting a new issue.
I came across another service that offers IP lists:

http://myip.ms/browse/blacklist/Blacklist_IP_Blacklist_IP_Addresses_Live_Database_Real-time
text file export: 
http://myip.ms/downloads/blacklist/Blacklist_IP_Blacklist_IP_Addresses_Live_Database_Real-time

http://myip.ms/browse/web_bots/Known_Web_Bots_Web_Bots_2012_Web_Spider_List.html
text file export: 
http://myip.ms/downloads/web_bots/Known_Web_Bots_Web_Bots_2012_Web_Spider_List.html
                
> SOLR - Spider detection to match on hostname or useragent
> ---------------------------------------------------------
>
>                 Key: DS-790
>                 URL: https://jira.duraspace.org/browse/DS-790
>             Project: DSpace
>          Issue Type: Improvement
>          Components: Solr
>    Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.7.0
>         Environment: solr
>            Reporter: Peter Dietz
>            Assignee: Mark H. Wood
>   Original Estimate: 0 minutes
>  Remaining Estimate: 0 minutes
>
> Spiders are currently detected by matching their IP address to one listed in 
> the /dspace/config/spiders/ip-list-X.txt, however as spiders change IP 
> addresses, or the ip-list is unmaintained, then many spiders can slip 
> through, however they will usually keep their user agent or hostname intact.
> I've noticed a sore point in my solr data, where msnbot is completely 
> unfiltered by solr. They have an additional ip list: 
> http://www.iplists.com/nw/msn.txt however it is very old, and with additional 
> bingbots on the horizon, it would be easier to detect, and filter them out of 
> the logs by user-agent, then to maintain all of the IP address ranges. The 
> code to do this in SOLR is unimplemented, and this ticket is a place holder 
> to encourage this work to filter out based on user agent / dns-hostname to be 
> finished.
> To see all of the hits from msnbot that are unfiltered, look at: 
> http://localhost:8080/solr/statistics/select?q=dns:msnbot*&facet=true&facet.field=dns&facet.mincount=1&facet.limit=5000

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to