SOLR - Spider detection to match on hostname or useragent
---------------------------------------------------------
Key: DS-790
URL: https://jira.duraspace.org/browse/DS-790
Project: DSpace
Issue Type: Improvement
Components: Solr
Affects Versions: 1.7.0, 1.6.2, 1.6.1, 1.6.0
Environment: solr
Reporter: Peter Dietz
Spiders are currently detected by matching their IP address to one listed in
the /dspace/config/spiders/ip-list-X.txt, however as spiders change IP
addresses, or the ip-list is unmaintained, then many spiders can slip through,
however they will usually keep their user agent or hostname intact.
I've noticed a sore point in my solr data, where msnbot is completely
unfiltered by solr. They have an additional ip list:
http://www.iplists.com/nw/msn.txt however it is very old, and with additional
bingbots on the horizon, it would be easier to detect, and filter them out of
the logs by user-agent, then to maintain all of the IP address ranges. The code
to do this in SOLR is unimplemented, and this ticket is a place holder to
encourage this work to filter out based on user agent / dns-hostname to be
finished.
To see all of the hits from msnbot that are unfiltered, look at:
http://localhost:8080/solr/statistics/select?q=dns:msnbot*&facet=true&facet.field=dns&facet.mincount=1&facet.limit=5000
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.duraspace.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel