Hi Mark,

I sent the message below not long ago, which is related to your concern 
about keeping spiders out of the statistics.

The Dspace Stats package analyses the dspace log's which do not record info 
about spiders. Info from spiders/web crawlers can only be viewed in apache 
logs? Thus, if you wish to ignore hits from web indexing activities the log 
analayser needs to examine the apache logs in addition to the dspace logs.

Naveed

Hi,

It appears that the default stats package (General Overview report) 
displays high numbers for item views and bitstream views (item downloads) 
due to many web crawlers accessing our repositories to download the full 
text for indexing. The majority of traffic is coming from machines not end 
users giving an obscure impression of dspace stats, e.g. 15,000 downloads, 
30,000 item views in our case for January 2007.

For example:

our apache log shows a web crawler accessing the full text of a PDF:-
74.6.86.213 - - [09/Feb/2007:12:04:45 +0000] "GET 
/dspace/bitstream/1983/898/1/webb_IEEE_vtc_spring2006.pdf HTTP/1.0" 200 
178656 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; 
http://help.yahoo.com/help/us/ysearch/slurp)"

the corresponding entry in our dspace log:-
2007-02-09 12:04:45,120 INFO  org.dspace.app.webui.servlet.BitstreamServlet 
@ 
anonymous:session_id=FED19B6B8FDDB615271F818BB7B766C4:ip_addr=137.222.120.28:view_bitstream:bitstream_id=1706

It would be nice to filter activity coming from web crawlers during log 
file analysis. Is it worth adding this as a feature request? Any ideas on 
how this can be achieved?

Thanks,

Naveed
--------------------------------------------------------
Naveed Hashmi
Information Systems and Computing
University of Bristol


Message: 3
Date: Tue, 20 Mar 2007 15:13:08 -0400
From: "Mark H. Wood" <[EMAIL PROTECTED]>
Subject: [Dspace-tech] Keeping spiders out of the statistics
To: dspace-tech@lists.sourceforge.net
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset="us-ascii"

Has anyone found a fairly good automatic method of maintaining a list
of spider addresses, for ignoring hits from web indexing activities
when counting document fetches?

-- 
Mark H. Wood, Lead System Programmer   [EMAIL PROTECTED]
Typically when a software vendor says that a product is "intuitive" he
means the exact opposite.

--------------------------------------------------------
Naveed Hashmi
Information Systems and Computing
University of Bristol
[EMAIL PROTECTED]



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to