[
https://jira.duraspace.org/browse/DS-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=20003#action_20003
]
Tim Donohue commented on DS-790:
--------------------------------
This issue was discussed during the DSpace Developers Meeting on April 20, 2011:
[20:17] <tdonohue> SOLR - Spider detection to match on hostname or useragent :
https://jira.duraspace.org/browse/DS-790
[20:17] <mdiggory> Its a warning more than an error
[20:17] <mdiggory> referring to DS-875
[20:17] <mhwood> 790: sounds reasonable, needs code.
[20:18] <tdonohue> stuartlewis -- let's wait till after this part of meeting.
we're still playing "catchup" on our large backlog. We can skip to DS-875
during main discussion
[20:18] <stuartlewis> Sure - no problem.
[20:18] <stuartlewis> mhwood: +1
[20:18] <tdonohue> mhwood +1 on DS-790
[20:19] <mdiggory> DS-790 also why not add the known bingbots to a custom txt
file for future releases?
[20:19] <mdiggory> bingbots = dingbats
[20:19] <mhwood> Probably should, but it's another entry in the Red Queen's
Race.
[20:20] <tdonohue> mdiggory -- PeterDietz is suggesting that it's difficult to
maintain listing of bots by IP addresses, and rather than filtering on IP, we
should be filtering on user-agent. (or at least that's how I read it)
[20:20] <mdiggory> I think we avoided doing dns lookups for efficiency
[20:21] <mdiggory> Filtering on both was once an original requirement
[20:21] <mdiggory> # UA "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
[20:21] <mdiggory> and they are in the files
[20:21] <PeterDietz> hi all, sorry, got distracted
[20:21] <tdonohue> right, I know the user-agents are in the files, but does
Solr actually filter based on user-agents? or just by IP?
[20:21] * mhwood wishes for a standard "I am a bot" header for well-behaved
spiders to use. But, short of that, if they are giving us useful agent strings
then we should use them.
[20:22] <stuartlewis> Just IP at present?
[20:23] <tdonohue> I think it's *just IP* at present. But, I'd agree with
DS-790, that we should also allow filtering by user-agent string
[20:23] <mdiggory> Its mostly just enhancing
http://scm.dspace.org/svn/repo/dspace/trunk//dspace-stats/src/main/java/org/dspace/statistics/util/SpiderDetector.java
[20:24] <tdonohue> OK. DS-790 summary: +3 vote. Needs code (enhance
SpiderDetector) & volunteer
[20:24] <mdiggory> +1 for enhancing
[20:24] <richardrodgers> Does someone want to have a go at it, is the question
...
[20:25] <mhwood> Put my name on 790.
...
[20:25] <tdonohue> DS-790 - assign to mhwood (thanks Mark!)
> SOLR - Spider detection to match on hostname or useragent
> ---------------------------------------------------------
>
> Key: DS-790
> URL: https://jira.duraspace.org/browse/DS-790
> Project: DSpace
> Issue Type: Improvement
> Components: Solr
> Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.7.0
> Environment: solr
> Reporter: Peter Dietz
> Assignee: Mark H. Wood
>
> Spiders are currently detected by matching their IP address to one listed in
> the /dspace/config/spiders/ip-list-X.txt, however as spiders change IP
> addresses, or the ip-list is unmaintained, then many spiders can slip
> through, however they will usually keep their user agent or hostname intact.
> I've noticed a sore point in my solr data, where msnbot is completely
> unfiltered by solr. They have an additional ip list:
> http://www.iplists.com/nw/msn.txt however it is very old, and with additional
> bingbots on the horizon, it would be easier to detect, and filter them out of
> the logs by user-agent, then to maintain all of the IP address ranges. The
> code to do this in SOLR is unimplemented, and this ticket is a place holder
> to encourage this work to filter out based on user agent / dns-hostname to be
> finished.
> To see all of the hits from msnbot that are unfiltered, look at:
> http://localhost:8080/solr/statistics/select?q=dns:msnbot*&facet=true&facet.field=dns&facet.mincount=1&facet.limit=5000
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.duraspace.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel