Dear All, A detailed description of the functionality and architecture of the statistics Add-on we have developed can be found on the docs folder of the downloadable file - http://wiki.dspace.org/static_files/6/68/Stats-addon-2.0.tar.gz
On our production implementation of the Add-on on RepositoriUM, we have developed some more tools/functionality for automated and semi-automated detection and exclusion of crawlers (not only based in "well behaved" robots, but also on the patterns and behavior from IP addresses, etc.), that are not available in the version 2.0 of the Add-on. As we are currently upgrading RepositóriUM to DSpace 1.5, hopefully we will release a Stats Add-on 2.1, compatible with DSpace 1.5, and including the new functionality/tools in late September or October. Best Regards, Eloy Rodrigues Universidade do Minho - Serviços de Documentação Campus de Gualtar - 4710 - 057 Braga Telefone: + 351 253604150; Fax: + 351 253604159 Campus de Azurém - 4800 - 058 Guimarães Telefone: + 351 253510168; Fax: + 351 253510117 -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: quarta-feira, 27 de Agosto de 2008 09:31 To: [email protected] Subject: Dspace-general Digest, Vol 61, Issue 22 Send Dspace-general mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/dspace-general or, via email, send a message with subject or body 'help' to [EMAIL PROTECTED] You can reach the person managing the list at [EMAIL PROTECTED] When replying, please edit your Subject line so it is more specific than "Re: Contents of Dspace-general digest..." Today's Topics: 1. Re: Week 2: Statistics (Tim Donohue) 2. Re: Statistics (Mark H. Wood) 3. Re: Statistics (Dorothea Salo) 4. Re: Statistics (Tim Donohue) 5. Re: Week 2: Statistics (Mark H. Wood) 6. Re: Week 2: Statistics (Dorothea Salo) 7. Re: Week 2: Statistics (Christophe Dupriez) ---------------------------------------------------------------------- Message: 1 Date: Tue, 26 Aug 2008 11:09:15 -0500 From: Tim Donohue <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Week 2: Statistics To: Dorothea Salo <[EMAIL PROTECTED]> Cc: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Dorothea & all, Dorothea Salo wrote: > 2008/8/25 Mark H. Wood <[EMAIL PROTECTED]>: >> One thing to keep in mind about whole-site statistical tables is that >> there are already tools to do this for web sites in general, such as >> AWStats or Webalizer or whatever your favorite may be. We probably >> should not spend effort to try to duplicate those. > > Perhaps not, but if this is the direction we want people to go in, we > probably ought to document how to do it, at least informally on the > wiki. Does anybody have such a system in place? For IDEALS (www.ideals.uiuc.edu), we use AWStats to get site-wide traffic information. However, that information is *not* publicly accessible. We only use it for administrative purposes, since most of the information AWStats generates for us is generally *not* useful to our users. So, for example, AWStats can provide us with the following general information: * Which features of DSpace are being used most frequently (e.g. Subject Browse, Community/Collection browse, search, etc.) * Which web browsers our users are using * # of overall hits in a given month,week,day,hour * Approximate amount of time users spend on our site * What external resources people use to get to our site (e.g. Google, Blog posts, Library website, etc.) * The top searches used to get to your site (in Google, Yahoo, MSN, etc) But, AWStats only works at a global level. So, it *cannot* give us any real information at a community, collection or item level, since it doesn't understand DSpace's internal structure and cannot parse DSpace's log files (it parses the *web server* log files, rather than DSpace's internal logs) So, in the end, AWStats is a worthwhile tool to keep in mind. However, without some major customizations specific to DSpace, it's really more of an Administrative tool to help you determine *how* users are using your site. It doesn't give any real worthwhile "statistics" in terms of file downloads or individual community/collection access counts, which are more likely to be useful to your users. - Tim -- Tim Donohue Research Programmer, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) University of Illinois at Urbana-Champaign [EMAIL PROTECTED] | (217) 333-4648 ------------------------------ Message: 2 Date: Tue, 26 Aug 2008 15:47:20 -0400 From: "Mark H. Wood" <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Statistics To: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="us-ascii" On Tue, Aug 26, 2008 at 10:07:43AM -0500, Tim Donohue wrote: > So, although I think it was already mentioned, I'd add as a requirement > for a good Statistics Package: > > * Must filter out web-crawlers in a semi-automated fashion! +1! Suggestions as to how? The Rochester mod.s could be augmented to filter out the easiest cases more simply. Some well-behaved crawlers can be spotted automatically. (No, I don't recall how.) The filter rules could be made more flexible than just a single type of fixed-size netblocks (if memory serves). I've been meaning to work on these at some point, but haven't yet reached That Point. Crawler filtering sounds like something that might be abstracted from the various existing stat. patches and provided as a common service. We all should invent this wheel only once. -- Mark H. Wood, Lead System Programmer [EMAIL PROTECTED] Typically when a software vendor says that a product is "intuitive" he means the exact opposite. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mailman.mit.edu/pipermail/dspace-general/attachments/20080826/7dddd0c 1/attachment-0001.bin ------------------------------ Message: 3 Date: Tue, 26 Aug 2008 15:09:16 -0500 From: "Dorothea Salo" <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Statistics To: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=UTF-8 2008/8/26 Mark H. Wood <[EMAIL PROTECTED]>: > On Tue, Aug 26, 2008 at 10:07:43AM -0500, Tim Donohue wrote: >> So, although I think it was already mentioned, I'd add as a requirement >> for a good Statistics Package: >> >> * Must filter out web-crawlers in a semi-automated fashion! > > +1! Suggestions as to how? The site <http://www.user-agents.org/> maintains a list of user-agents, classified by type. They have an XML-downloadable version at <http://www.user-agents.org/allagents.xml>, as well as an RSS-feed updater. Perhaps polling this would be a useful starting point? Dorothea -- Dorothea Salo [EMAIL PROTECTED] Digital Repository Librarian AIM: mindsatuw University of Wisconsin Rm 218, Memorial Library (608) 262-5493 ------------------------------ Message: 4 Date: Tue, 26 Aug 2008 15:29:23 -0500 From: Tim Donohue <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Statistics To: Dorothea Salo <[EMAIL PROTECTED]> Cc: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Dorothea Salo wrote: > 2008/8/26 Mark H. Wood <[EMAIL PROTECTED]>: >> On Tue, Aug 26, 2008 at 10:07:43AM -0500, Tim Donohue wrote: >>> So, although I think it was already mentioned, I'd add as a requirement >>> for a good Statistics Package: >>> >>> * Must filter out web-crawlers in a semi-automated fashion! >> +1! Suggestions as to how? > > The site <http://www.user-agents.org/> maintains a list of > user-agents, classified by type. They have an XML-downloadable version > at <http://www.user-agents.org/allagents.xml>, as well as an RSS-feed > updater. Perhaps polling this would be a useful starting point? > > Dorothea > Universidade of Minho's Statistics Add-On for DSpace can do some basic automated filtering of web crawlers: See its list of main features on the DSpace Wiki: http://wiki.dspace.org/index.php//StatisticsAddOn (It looks like they determine spiders by how spiders tend to identify themselves. Most "nice" spiders, like Google, will identify themselves in a common fashion, e.g. "Googlebot") Frankly, although our statistics for IDEALS are nice looking...Minho's work is much more extensive and offers a greater variety of features (from what I've seen/heard of it). It's just missing our "Top 10 Downloads" list :) - Tim -- Tim Donohue Research Programmer, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) University of Illinois at Urbana-Champaign [EMAIL PROTECTED] | (217) 333-4648 ------------------------------ Message: 5 Date: Tue, 26 Aug 2008 16:34:33 -0400 From: "Mark H. Wood" <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Week 2: Statistics To: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="us-ascii" On Tue, Aug 26, 2008 at 09:44:45AM -0500, Dorothea Salo wrote: > 2008/8/25 Mark H. Wood <[EMAIL PROTECTED]>: > > What might be helpful is to provide some views or stored procedures > > that stat. tools could use to classify observations. Such tools > > usually have good facilities for poking around in databases, but could > > perhaps use help in getting the information they need without having to > > understand (and track changes to!) the fulness of DSpace's schema. > > Interesting. Where would this leave the average repository manager who > isn't using Stata, but just wants some numbers to show people? Well, it depends on which numbers are wanted. I do think there will be some reports that are popular enough, and easy enough to get right, that they should be built in. The support for external tools would be aimed at people who do want to use them. What sort of data would be useful to the manager who isn't into heavy statistical analysis, which aren't likely to be provided as built-ins? Where I'm going is: o The realm of reasonable possibilities for statistical analysis and presentation of DSpace activity is rather huge; o people who understand statistical processing have already figured out the hard parts of analysis and presentation; o the tail should not be allowed to wag the dog -- we want statistics, but that's subordinate to building excellend document repository software. Part of, important, but in a supporting role. So I am hoping that we can mostly satisfy most people with relatively modest built-in statistical support, and take care of the other cases with modest support for the development of external reporting mechanisms. This being a community, I imagine that some will develop external solutions that they can share. This is one reason why I think that it should be as easy as possible for multiple stat. projects to tap into built-in streams of observations. Different sites have different needs, and I think we need to be able to easily play with various ways of doing stat.s. I'm not convinced that we are going to understand the need sufficiently without getting into the field a selection of solutions that can be easily snapped in and tried by a sizable number of sites. There are a number of good attempts now, but it's not easy to install them and that limits the amount of experience we can gather. -- Mark H. Wood, Lead System Programmer [EMAIL PROTECTED] Typically when a software vendor says that a product is "intuitive" he means the exact opposite. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mailman.mit.edu/pipermail/dspace-general/attachments/20080826/d187370 6/attachment-0001.bin ------------------------------ Message: 6 Date: Tue, 26 Aug 2008 18:13:14 -0500 From: "Dorothea Salo" <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Week 2: Statistics To: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=UTF-8 2008/8/26 Mark H. Wood <[EMAIL PROTECTED]>: > Well, it depends on which numbers are wanted. I do think there will > be some reports that are popular enough, and easy enough to get right, > that they should be built in. The support for external tools would be > aimed at people who do want to use them. What sort of data would be > useful to the manager who isn't into heavy statistical analysis, which > aren't likely to be provided as built-ins? Well, I hope that's where the discussion this week has been pointing. If not, we'll have to find a different way to gather that information. Looking at existing implementations of statistics (e.g. EPrints, SSRN) might be a start. > o the tail should not be allowed to wag the dog -- we want > statistics, but that's subordinate to building excellent document > repository software. Part of, important, but in a supporting role. This is such an interesting statement that I think I will make it next week's topic! What *is* excellent document repository software? I have a feeling that the non-developer community may have a rather different take on it from most developers... we'll see if I'm right. > So I am hoping that we can mostly satisfy most people with relatively > modest built-in statistical support, and take care of the other cases > with modest support for the development of external reporting > mechanisms. I'd be interested to know how the proposals that have been put forward this week place on a modesty scale. Developers? > This is one reason why I think that it should be as easy as possible > for multiple stat. projects to tap into built-in streams of > observations. Different sites have different needs, and I think we > need to be able to easily play with various ways of doing stat.s. Agreed, but just to toss this out: I foresee a countervailing pressure in future toward standardized and aggregated statistics across repositories. I have heard a number of statements to the effect that faculty are using download counts from disciplinary repositories in tenure-and-promotion packages. As their work becomes scattered and/or duplicated across various repositories, they're going to want to aggregate that information. > There are a > number of good attempts now, but it's not easy to install them and > that limits the amount of experience we can gather. +1. This is a problem for more than just statistics! Dorothea -- Dorothea Salo [EMAIL PROTECTED] Digital Repository Librarian AIM: mindsatuw University of Wisconsin Rm 218, Memorial Library (608) 262-5493 ------------------------------ Message: 7 Date: Wed, 27 Aug 2008 10:37:12 +0200 From: Christophe Dupriez <[EMAIL PROTECTED]> Subject: Re: [Dspace-general] Week 2: Statistics To: Dorothea Salo <[EMAIL PROTECTED]> Cc: dspace <[email protected]> Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset="iso-8859-1" Hi Dorothea and participants to this discussion! I would like to say that statistics are there for different purposes: 1) detect errors (why nobody looked at my site last sunday?) 2) provide KPI (Key Performance Indicators), measures that a manager follows on the medium term to take organisational decisions 3) investigate new hypothesis before investing to change the organisation. For purpose (3), by essence, you need to "open" to analysis the detailed logs of the events and the data stored in DSpace. Generic programs like SAS or reports generators are the best to dig in data and answer to new, unforeseen questions. Everybody in the community will be happy to have this "back door" available. For purpose (2), we need to know what KPIs are needed by IR managers. I will go further, new IRs and their managers would be very happy not to reinvent KPIs and to have good ones already proposed to sustain a documented IR development process. A very big part of DSpace attractiveness is (and should be implemented really!) that it provides "best practices" for IR management (and not only computing). For purpose (2), Use cases, practices, measures must be designed upfront. It will contribute strongly to the overall specifications of DSpace. For purpose (1), a more formal, bottom up, data driven approach may be sufficient to install validation tools (like the checksum checker) to ensure that DSpace operations are "in line". So we have no choice: we have to listen IR managers (please come by!) to know the good practices DSpace must support... Have a nice day! Christophe (peeking on the list when I should not during my holidays!) Dorothea Salo a ?crit : > Greetings, DSpace community, > > I want to thank everyone once again for last week's stimulating > discussion and impressive chat turnout! I have a new question for > everyone this week, pursuant to some discussion on the lists: > > "Statistics" are one of the commonest requests for a new DSpace > feature. Without further specification, however, it's hard to know > what data to present, since there are no standards or even clear best > practices in this area. What statistics do the following groups of > DSpace users need to see, and in what form are the statistics best > presented to them? > > Depositors > End-users (defined as "people examining items and downloading > bitstreams from a DSpace instance;" we may have to refine this further > in discussion) > DSpace repository managers (as distinct from systems administrators) > > What else should developers keep in mind as they implement this feature? > > Because it would be nice to reach a working consensus on this (unlike > last week's question, which was intended to pull out as broad a > selection of needs as possible), I think we should start discussing > immediately. I encourage all respondents to respond TO THE MAILING > LIST instead of to me. > > I will be holding another chat to discuss the weekly question. It will > take place Wednesday 27 August in the DSpace IRC chatroom, #dspace on > irc.freenode.net. I apologize to West Coast (USA) community members > for last week's unconscionably early hour; we'll try 10 am US Central > (11 am Eastern, 4 pm GMT) this week, and we may go even later next > week if our European community members can stand it. > > For those who don't normally use IRC, there are two easy web gateways. > One is mibbit.com; the other is specific to our channel and can be > found at <http://dspace.testathon.net/cgi-bin/irc.cgi>. I encourage > all of us to become familiar with the channel; it is a source of > real-time technical information from DSpace developers, as well as a > community in its own right. > > Dorothea > > -------------- next part -------------- A non-text attachment was scrubbed... Name: christophe_dupriez.vcf Type: text/x-vcard Size: 454 bytes Desc: not available Url : http://mailman.mit.edu/pipermail/dspace-general/attachments/20080827/3784f64 3/christophe_dupriez.vcf ------------------------------ _______________________________________________ Dspace-general mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/dspace-general End of Dspace-general Digest, Vol 61, Issue 22 ********************************************** _______________________________________________ Dspace-general mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/dspace-general
