Just to make sure that I understand:

(1)    Using the spider files, the ips listed in these files are filtered out 
before stats are computed.  So in other words,
This ip:  77.88.25.28 is found in one of these files, so visits from this ip 
are not counted.


(2)    Also,  I noticed that some IPs are commented out in some of the spider 
files, so I suppose these are not filtered out.  For example:

# Snap.com
# UA "snap.com beta crawler v0"
# UA "Snapbot/1.0"
# UA "semanticdiscovery/0.2(http://www.semanticdiscovery.com/sd/robot.html)"
# UA "semanticdiscovery/0.4(http://www.semanticdiscovery.com/sd/robot.html";
# 38.98.19.100
# 38.98.19.101
# 38.98.19.102
# 38.98.19.103
# 38.98.19.104
# 38.98.19.105
# 38.98.19.106

Why are thes commented out?


(3)    And finally, in some of the spider files, you find things that refer to 
websites, and it is not commented out, for example,

# Yahoo.com URL verifiers
# UA "Mozilla/4.05"
morgue1.corp.yahoo.com
216.145.54.35
hanta.yahoo.com
216.145.50.40

I suppose that the ips are filtered out, but what does the code do with 
morgue1.corp.yahoo.com and hanta.yahoo.com?

Thanks! Jose


From: Bram Luyten [mailto:[email protected]]
Sent: Tuesday, July 20, 2010 4:14 AM
To: [email protected]
Subject: Re: [Dspace-tech] stats and crawler count

Dear Jose,

here are some relevant excerpts from the documentation at 
http://www.dspace.org/1_6_2Documentation/

if it still leaves you with questions, please elaborate, so we can improve the 
documentation.
5.2. The dspace.cfg Configuration Properties File
5.2.49. DSpace SOLR Statistics Configuration
Property:

solr.log.server

Example Value:

solr.log.server = ${dspace.baseUrl}/solr/statistics

Informational Note:

Is used by the SolrLogger Client class to connect to the SOLR server over http 
and perform updates and queries.



Property:

solr.spidersfile

Example Value:

solr.spidersfile = ${dspace.dir}/config/spiders.txt

Informational Note:

Spiders file is utilized by the SolrLogger, this will be populated by running 
the following command:dsrun org.dspace.statistics.util.SpiderDetector -i <httpd 
log file>



Property:

solr.dbfile

Example Value:

solr.dbfile = ${dspace.dir}/config/GeoLiteCity.dat

Informational Note:

The following refers to the GeoLiteCity database file utilized by the 
LocationUtils to calculate the location of client requests based on IP address. 
During the Ant build process (both fresh_install and update) this file will be 
downloaded from http://www.maxmind.com/app/geolitecity if a new version has 
been published or it is absent from your [dspace]/config directory.



Property:

useProxies

Example Value:

useProxies = true

Informational Note:

Will cause Statistics logging to look for X-Forward URI to detect clients IP 
that have accessed it through a Proxy service. Allows detection of client IP 
when accessing DSpace.



Property:

statistics.item.authorization.admin

Example Value:

statistics.item.authorization.admin = true

Informational Note:

Enables access control restriction on DSpace Statistics pages, Restrictions are 
based on access rights to Community, Collection and Item Pages. This will 
require the user to sign on to see that statistics. Setting the statistics to 
"false" will make them publicly available.


Chapter 8. DSpace System Documentation: System Administration
8.15. Client Statistics

Table 8.15. Client Statistics Command Table
Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-u or --update-spider-files

Update Spider IP Files from internet into /dspace/config/spiders. Downloads 
Spider files identified in dspace.cfg under property

-f or --delete-spiders-by-flag

Delete Spiders in Solr By isBot Flag. Will prune out all records that have 
isBot:true

-i or --delete-spiders-by-ip

Delete Spiders in Solr By IP Address. Will prune out all records that have IP's 
that match spider IPs.

-m or --mark-spiders

Update isBog Flag in Solr. Marks any records currently stored in statistics 
that have IP addresses matched in spiders files

-h or --help

Calls up this brief help table at CLI.



Notes:

The usage of these options is open for the user to choose, If they want to keep 
spider entires in their repository, they can just mark them using "-m" and they 
will be excluded from statistics queries when 
"solr.statistics.query.filter.isBot = true" in the dspace.cfg.

If they want to keep the spiders out of the solr repository, they can run just 
use the "-i" option and they will be removed immediately.

There are guards in place to control what can be defined as an IP range for a 
bot, in [dspace]/config/spiders, spider IP address ranges have to be at least 3 
subnet sections in length 123.123.123 and IP Ranges can only be on the smallest 
subnet [123.123.123.0 - 123.123.123.255]. If not, loading that row will cause 
exceptions in the dspace logs and exclude that IP entry.
kindest regards,

Bram Luyten

@mire - http://www.atmire.com

Technologielaan 9 - 3001 Heverlee - Belgium
533 2nd Street - Encinitas, CA 92024 - USA

http://www.togather.eu - Before getting together, get t...@ther

On Mon, Jul 19, 2010 at 9:52 PM, Mark H. Wood 
<[email protected]<mailto:[email protected]>> wrote:
On Mon, Jul 19, 2010 at 10:52:26AM -0400, Blanco, Jose wrote:
> I was looking over the dspace stats code to see if it had anything to remove 
> counts from crawlers and I don't see anything in there.  I just wanted to 
> make sure that is the case.
Would that be the Solr-based stat. code new in 1.6?  In 1.6.0 there is
a file called config/spiders.txt to contain a list of crawler IP
addresses.  This was changed in a later point release to use multiple
files found in config/spiders.  There's also a list of update URLs for
spider lists configured in dspace.cfg as solr.spiderips.urls.

There isn't much documentation, though.  We need to correct that.

--
Mark H. Wood, Lead System Programmer   [email protected]
Balance your desire for bells and whistles with the reality that only a
little more than 2 percent of world population has broadband.
       -- Ledford and Tyler, _Google Analytics 2.0_

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first<http://sprint.com/first> -- 
http://p.sf.net/sfu/sprint-com-first
_______________________________________________
DSpace-tech mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to