Re: [Dspace-tech] Keeping spiders out of the statistics
I've developed a throttle to slow down the fetches from crawlers. It is configured in the dspace-web.xml file as a filter with 3 parameters: 1. PERIOD 2. Number of HITS to allow for the PERIOD. So say you have PERIOD set to 10 seconds, and HITS to 20, you will allow 20 hits from a certain IP for that period. If that is exceeded, the system will deny access to that IP address. 3. Time to block the ip address that exceeds the hit limit from regaining access. I got some of the code for this from the Tapir project. If this is something that you are interested in I can send you the code I have. -Jose -Original Message- From: Cory Snavely [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 21, 2007 9:22 AM To: Jose Blanco Subject: [Fwd: [Dspace-tech] Keeping spiders out of the statistics] You should consider posting about what you developed. Forwarded Message From: Mark H. Wood <[EMAIL PROTECTED]> To: dspace-tech@lists.sourceforge.net Subject: [Dspace-tech] Keeping spiders out of the statistics Date: Tue, 20 Mar 2007 15:13:08 -0400 Has anyone found a fairly good automatic method of maintaining a list of spider addresses, for ignoring hits from web indexing activities when counting document fetches? - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Keeping spiders out of the statistics
Hi Mark, I sent the message below not long ago, which is related to your concern about keeping spiders out of the statistics. The Dspace Stats package analyses the dspace log's which do not record info about spiders. Info from spiders/web crawlers can only be viewed in apache logs? Thus, if you wish to ignore hits from web indexing activities the log analayser needs to examine the apache logs in addition to the dspace logs. Naveed Hi, It appears that the default stats package (General Overview report) displays high numbers for item views and bitstream views (item downloads) due to many web crawlers accessing our repositories to download the full text for indexing. The majority of traffic is coming from machines not end users giving an obscure impression of dspace stats, e.g. 15,000 downloads, 30,000 item views in our case for January 2007. For example: our apache log shows a web crawler accessing the full text of a PDF:- 74.6.86.213 - - [09/Feb/2007:12:04:45 +] "GET /dspace/bitstream/1983/898/1/webb_IEEE_vtc_spring2006.pdf HTTP/1.0" 200 178656 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" the corresponding entry in our dspace log:- 2007-02-09 12:04:45,120 INFO org.dspace.app.webui.servlet.BitstreamServlet @ anonymous:session_id=FED19B6B8FDDB615271F818BB7B766C4:ip_addr=137.222.120.28:view_bitstream:bitstream_id=1706 It would be nice to filter activity coming from web crawlers during log file analysis. Is it worth adding this as a feature request? Any ideas on how this can be achieved? Thanks, Naveed Naveed Hashmi Information Systems and Computing University of Bristol Message: 3 Date: Tue, 20 Mar 2007 15:13:08 -0400 From: "Mark H. Wood" <[EMAIL PROTECTED]> Subject: [Dspace-tech] Keeping spiders out of the statistics To: dspace-tech@lists.sourceforge.net Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="us-ascii" Has anyone found a fairly good automatic method of maintaining a list of spider addresses, for ignoring hits from web indexing activities when counting document fetches? -- Mark H. Wood, Lead System Programmer [EMAIL PROTECTED] Typically when a software vendor says that a product is "intuitive" he means the exact opposite. Naveed Hashmi Information Systems and Computing University of Bristol [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Keeping spiders out of the statistics
I changed the LogAnalyzer code so it reads IP addresses to be excluded from its configuration file, dstat.cfg. So if you know the spiders ip address you could use that. Let me know, Monika On 3/20/07, Mark H. Wood <[EMAIL PROTECTED]> wrote: counts Has anyone found a fairly good automatic method of maintaining a list of spider addresses, for ignoring hits from web indexing activities when counting document fetches? -- Mark H. Wood, Lead System Programmer [EMAIL PROTECTED] Typically when a software vendor says that a product is "intuitive" he means the exact opposite. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Monika Mevenkamp Georgia Institute of Technology Library and Information Center Phone: 404.385.0108 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Keeping spiders out of the statistics
This may be of use: http://www.iplists.com/ Many people already maintain such lists (likewise for spammers). Maintaining a blacklist of your own from these shouldn't be too difficult. Jim On Tue, Mar 20, 2007 at 03:13:08PM -0400, Mark H. Wood wrote: > Has anyone found a fairly good automatic method of maintaining a list > of spider addresses, for ignoring hits from web indexing activities > when counting document fetches? > > -- > Mark H. Wood, Lead System Programmer [EMAIL PROTECTED] > Typically when a software vendor says that a product is "intuitive" he > means the exact opposite. > > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ___ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-tech - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] Keeping spiders out of the statistics
Has anyone found a fairly good automatic method of maintaining a list of spider addresses, for ignoring hits from web indexing activities when counting document fetches? -- Mark H. Wood, Lead System Programmer [EMAIL PROTECTED] Typically when a software vendor says that a product is "intuitive" he means the exact opposite. pgp2Pqo5r6GjV.pgp Description: PGP signature - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech