Just to make sure that I understand:
(1) Using the spider files, the ips listed in these files are filtered out
before stats are computed. So in other words,
This ip: 77.88.25.28 is found in one of these files, so visits from this ip
are not counted.
(2) Also, I noticed that some IPs are commented out in some of the spider
files, so I suppose these are not filtered out. For example:
# Snap.com
# UA "snap.com beta crawler v0"
# UA "Snapbot/1.0"
# UA "semanticdiscovery/0.2(http://www.semanticdiscovery.com/sd/robot.html)"
# UA "semanticdiscovery/0.4(http://www.semanticdiscovery.com/sd/robot.html"
# 38.98.19.100
# 38.98.19.101
# 38.98.19.102
# 38.98.19.103
# 38.98.19.104
# 38.98.19.105
# 38.98.19.106
Why are thes commented out?
(3) And finally, in some of the spider files, you find things that refer to
websites, and it is not commented out, for example,
# Yahoo.com URL verifiers
# UA "Mozilla/4.05"
morgue1.corp.yahoo.com
216.145.54.35
hanta.yahoo.com
216.145.50.40
I suppose that the ips are filtered out, but what does the code do with
morgue1.corp.yahoo.com and hanta.yahoo.com?
Thanks! Jose
From: Bram Luyten [mailto:[email protected]]
Sent: Tuesday, July 20, 2010 4:14 AM
To: [email protected]
Subject: Re: [Dspace-tech] stats and crawler count
Dear Jose,
here are some relevant excerpts from the documentation at
http://www.dspace.org/1_6_2Documentation/
if it still leaves you with questions, please elaborate, so we can improve the
documentation.
5.2. The dspace.cfg Configuration Properties File
5.2.49. DSpace SOLR Statistics Configuration
Property:
solr.log.server
Example Value:
solr.log.server = ${dspace.baseUrl}/solr/statistics
Informational Note:
Is used by the SolrLogger Client class to connect to the SOLR server over http
and perform updates and queries.
Property:
solr.spidersfile
Example Value:
solr.spidersfile = ${dspace.dir}/config/spiders.txt
Informational Note:
Spiders file is utilized by the SolrLogger, this will be populated by running
the following command:dsrun org.dspace.statistics.util.SpiderDetector -i <httpd
log file>
Property:
solr.dbfile
Example Value:
solr.dbfile = ${dspace.dir}/config/GeoLiteCity.dat
Informational Note:
The following refers to the GeoLiteCity database file utilized by the
LocationUtils to calculate the location of client requests based on IP address.
During the Ant build process (both fresh_install and update) this file will be
downloaded from http://www.maxmind.com/app/geolitecity if a new version has
been published or it is absent from your [dspace]/config directory.
Property:
useProxies
Example Value:
useProxies = true
Informational Note:
Will cause Statistics logging to look for X-Forward URI to detect clients IP
that have accessed it through a Proxy service. Allows detection of client IP
when accessing DSpace.
Property:
statistics.item.authorization.admin
Example Value:
statistics.item.authorization.admin = true
Informational Note:
Enables access control restriction on DSpace Statistics pages, Restrictions are
based on access rights to Community, Collection and Item Pages. This will
require the user to sign on to see that statistics. Setting the statistics to
"false" will make them publicly available.
Chapter 8. DSpace System Documentation: System Administration
8.15. Client Statistics
Table 8.15. Client Statistics Command Table
Command used:
[dspace]/bin/dspace stats-util
Java class:
org.dspace.statistics.util.StatisticsClient
Arguments (short and long forms):
Description
-u or --update-spider-files
Update Spider IP Files from internet into /dspace/config/spiders. Downloads
Spider files identified in dspace.cfg under property
-f or --delete-spiders-by-flag
Delete Spiders in Solr By isBot Flag. Will prune out all records that have
isBot:true
-i or --delete-spiders-by-ip
Delete Spiders in Solr By IP Address. Will prune out all records that have IP's
that match spider IPs.
-m or --mark-spiders
Update isBog Flag in Solr. Marks any records currently stored in statistics
that have IP addresses matched in spiders files
-h or --help
Calls up this brief help table at CLI.
Notes:
The usage of these options is open for the user to choose, If they want to keep
spider entires in their repository, they can just mark them using "-m" and they
will be excluded from statistics queries when
"solr.statistics.query.filter.isBot = true" in the dspace.cfg.
If they want to keep the spiders out of the solr repository, they can run just
use the "-i" option and they will be removed immediately.
There are guards in place to control what can be defined as an IP range for a
bot, in [dspace]/config/spiders, spider IP address ranges have to be at least 3
subnet sections in length 123.123.123 and IP Ranges can only be on the smallest
subnet [123.123.123.0 - 123.123.123.255]. If not, loading that row will cause
exceptions in the dspace logs and exclude that IP entry.
kindest regards,
Bram Luyten
@mire - http://www.atmire.com
Technologielaan 9 - 3001 Heverlee - Belgium
533 2nd Street - Encinitas, CA 92024 - USA
http://www.togather.eu - Before getting together, get t...@ther
On Mon, Jul 19, 2010 at 9:52 PM, Mark H. Wood
<[email protected]<mailto:[email protected]>> wrote:
On Mon, Jul 19, 2010 at 10:52:26AM -0400, Blanco, Jose wrote:
> I was looking over the dspace stats code to see if it had anything to remove
> counts from crawlers and I don't see anything in there. I just wanted to
> make sure that is the case.
Would that be the Solr-based stat. code new in 1.6? In 1.6.0 there is
a file called config/spiders.txt to contain a list of crawler IP
addresses. This was changed in a later point release to use multiple
files found in config/spiders. There's also a list of update URLs for
spider lists configured in dspace.cfg as solr.spiderips.urls.
There isn't much documentation, though. We need to correct that.
--
Mark H. Wood, Lead System Programmer [email protected]
Balance your desire for bells and whistles with the reality that only a
little more than 2 percent of world population has broadband.
-- Ledford and Tyler, _Google Analytics 2.0_
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first<http://sprint.com/first> --
http://p.sf.net/sfu/sprint-com-first
_______________________________________________
DSpace-tech mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dspace-tech
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech