[Dspace-general] Statistics

Leonie Hayes Mon, 25 Aug 2008 15:52:59 -0700

Dear DSpace Community

Statistics


1. From a what works perspective there is already beautiful statistics
implementations addressing the minimum requirements, I think the IDEALS
repository has what I would be very happy with, these guys seem to be
one step ahead http://www.ideals.uiuc.edu I can remember asking Tim
Donohue about their implementation a few years ago, he said it was a
very customised solution, please correct me if wrong. I also find the
eprints and Fez Fedora stats are pretty good.

2. Develop a package that delivers both via the JSP and XML Manakin
interface.

3. Keep it fairly compartmentalised/simple? if possible and quarantine
the requirements into 3 distinct areas
a) Item Statistics - downloads with other additional extras like authors
and collections 
b) Site Trends - traffic sources, countries etc piggy back on tools like
Google Analytics, or other web analyser tools that Mark Wood mentions 
c) More complex reporting that meets a specific requirements.

Many thanks for the opportunity to be part of the discussion, we are
very isolated in New Zealand but struggling with all the same problems
everyone else is experiencing... it helps to move forward. Time zones
don't allow any online interaction it will be 4am here.


Leonie Hayes
Research Repository Librarian
http://www.library.auckland.ac.nz/contacts/?firstname=&lastname=hayes
http://researchspace.auckland.ac.nz  
 


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
[EMAIL PROTECTED]
Sent: Tuesday, 26 August 2008 4:03 a.m.
To: [email protected]
Subject: Dspace-general Digest, Vol 61, Issue 19

Send Dspace-general mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.mit.edu/mailman/listinfo/dspace-general
or, via email, send a message with subject or body 'help' to
        [EMAIL PROTECTED]

You can reach the person managing the list at
        [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Dspace-general digest..."


Today's Topics:

   1. Week 2: Statistics (Dorothea Salo)
   2. Re: Week 2: Statistics (Dorothea Salo)
   3. Re: Week 2: Statistics (Mark H. Wood)


----------------------------------------------------------------------

Message: 1
Date: Mon, 25 Aug 2008 08:08:47 -0500
From: "Dorothea Salo" <[EMAIL PROTECTED]>
Subject: [Dspace-general] Week 2: Statistics
To: dspace <[email protected]>,    "DSpace Tech-List"
        <[EMAIL PROTECTED]>
Message-ID:
        <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=UTF-8

Greetings, DSpace community,

I want to thank everyone once again for last week's stimulating
discussion and impressive chat turnout! I have a new question for
everyone this week, pursuant to some discussion on the lists:

"Statistics" are one of the commonest requests for a new DSpace
feature. Without further specification, however, it's hard to know
what data to present, since there are no standards or even clear best
practices in this area. What statistics do the following groups of
DSpace users need to see, and in what form are the statistics best
presented to them?

Depositors
End-users (defined as "people examining items and downloading
bitstreams from a DSpace instance;" we may have to refine this further
in discussion)
DSpace repository managers (as distinct from systems administrators)

What else should developers keep in mind as they implement this feature?

Because it would be nice to reach a working consensus on this (unlike
last week's question, which was intended to pull out as broad a
selection of needs as possible), I think we should start discussing
immediately. I encourage all respondents to respond TO THE MAILING
LIST instead of to me.

I will be holding another chat to discuss the weekly question. It will
take place Wednesday 27 August in the DSpace IRC chatroom, #dspace on
irc.freenode.net. I apologize to West Coast (USA) community members
for last week's unconscionably early hour; we'll try 10 am US Central
(11 am Eastern, 4 pm GMT) this week, and we may go even later next
week if our European community members can stand it.

For those who don't normally use IRC, there are two easy web gateways.
One is mibbit.com; the other is specific to our channel and can be
found at <http://dspace.testathon.net/cgi-bin/irc.cgi>. I encourage
all of us to become familiar with the channel; it is a source of
real-time technical information from DSpace developers, as well as a
community in its own right.

Dorothea

-- 
Dorothea Salo [EMAIL PROTECTED]
Digital Repository Librarian AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493


------------------------------

Message: 2
Date: Mon, 25 Aug 2008 09:07:43 -0500
From: "Dorothea Salo" <[EMAIL PROTECTED]>
Subject: Re: [Dspace-general] Week 2: Statistics
To: dspace <[email protected]>
Message-ID:
        <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=UTF-8

My answers:

> What statistics do the following groups of
> DSpace users need to see, and in what form are the statistics best
> presented to them?
>
> Depositors

At a minimum, I would like depositors to see the number of times an
item's splash page has been visited, and the number of times each
content bitstream (as distinct from e.g. thumbnails) has been
downloaded. I would also like aggregate statistics available for each
author in the system, though I recognize that this creates
authority-control and role-evaluation issues. (For example, if Dr.
Helen Troia is the author of articles in the repository, the editor of
a journal whose backfiles are in the repository, as well as a thesis
advisor for some theses in the thesis collection, the journal and the
theses should NOT count toward her downloads.)

HTML items (and similar aggregates, once we can work with them; e.g.
Flash objects) cause trouble for bitstream analysis. To cut through
the jungle, I suggest that only the primary bitstream have its
accesses counted. If possible, it would be nice to count accesses for
all HTML bitstreams, but that can be lived without if need be.

I don't believe these statistics need to be real-time; a daily or even
weekly cron-job would suffice. I do believe we need to take into
account when an item was ingested, recognizing that older items will
pile up the downloads over time. In addition to total-aggregates,
then, I would recommend "in the last week," "in the last month," and
"in the last year/since ingest" information. Per-calendar-year
information should be kept and displayed indefinitely, even if the
underlying data are eventually purged, because authors will use this
in tenure-and-promotion packages. A sense of delta would be nice as
well -- depositors would LOVE to know if suddenly an item's downloads
spike.

Other desiderata, less important: broad-brush geographic information
(country of origin? Google Maps mashup?) for accesses, per-collection
and per-community access counts (because it NEVER hurts to get a sense
of competition going), search terms (in DSpace itself or from search
engines) that land people at a particular item.

> End-users (defined as "people examining items and downloading
> bitstreams from a DSpace instance;" we may have to refine this further
> in discussion)

I think end-users can usefully be shown the per-item and per-bitstream
information discussed above. They don't need to see per-author
information -- or at the very least, authors should be able to decide
whether to make this information public. (We do NOT want to embarrass
anyone; that's a serious turnoff for our potential depositors.)

> DSpace repository managers (as distinct from systems administrators)

I get survey after survey asking for activity information on the
repository. I can't answer them. To do so, I need download information
on the whole repository. (Current JSPUI statistics offer an
approximation to this, but I'm very leery of trusting it; I don't
understand how it's calculated, and the numbers seem incredibly off to
me.) I am sometimes asked about growth rate in accesses, so it would
be useful to break this down by year. Some algorithm for breaking it
down by amount of content in the repository ("downloads-per-item,"
where "item" would have to be some kind of average of
items-in-repository over the period examined) would be useful as well.

(And yes, I absolutely loathe those surveys too, but when they come
from ARL, I don't have the luxury of ignoring them.)

Some "wow" numbers would be useful for marketing purposes. A lot of
what I've already described would do the trick there.

I would also like to be able to track deposits per
collection/community over time; this helps me know where to focus
marketing and collection-development efforts, as well as helping me
report progress to the appropriate administrators. (I run a
system-wide repository, so I have to track deposits by campus; each
campus has its own community.)

> What else should developers keep in mind as they implement this
feature?

Search-engine crawlers. Excluding them provides a much more realistic
sense of interest. We need to make clear this is happening, though, or
we will be at a perceived disadvantage relative to repositories that
don't strip out these accesses.

Dorothea

-- 
Dorothea Salo [EMAIL PROTECTED]
Digital Repository Librarian AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493


------------------------------

Message: 3
Date: Mon, 25 Aug 2008 10:55:20 -0400
From: "Mark H. Wood" <[EMAIL PROTECTED]>
Subject: Re: [Dspace-general] Week 2: Statistics
To: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset="us-ascii"

One thing to keep in mind about whole-site statistical tables is that
there are already tools to do this for web sites in general, such as
AWStats or Webalizer or whatever your favorite may be.  We probably
should not spend effort to try to duplicate those.

Another consideration is that there are stat.s which would be useful
anytime, and stat.s that you dream up once and may never use again, or
may only find interesting at irregular intervals.  So I think we
should be careful not to try to do too much ourselves.  We can have
some generally-useful stuff built in, but we also need ways to expose
the raw cases in a useful form for ad-hoc analysis with
general-purpose statistical tools (SPSS/BMD/SAS/Stata/R/whatever).

Stuff to be inserted as one component of e.g. an item page probably
needs to be built in.  Stuff that would be a page on its own should
perhaps not be part of DSpace at all, but rather something we make
easy to do with other tools.

We need to keep clearly in mind the distinction between capturing raw
cases (someone fetched a bitstream) and abstracting useful patterns
from the collected cases (frequency histogram of this collection's
fetches over time, last month's fetches broken down by nation of
origin).

What might be helpful is to provide some views or stored procedures
that stat. tools could use to classify observations.  Such tools
usually have good facilities for poking around in databases, but could
perhaps use help in getting the information they need without having to
understand (and track changes to!) the fulness of DSpace's schema.

-- 
Mark H. Wood, Lead System Programmer   [EMAIL PROTECTED]
Typically when a software vendor says that a product is "intuitive" he
means the exact opposite.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url :
http://mailman.mit.edu/pipermail/dspace-general/attachments/20080825/147
7891f/attachment-0001.bin

------------------------------

_______________________________________________
Dspace-general mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/dspace-general


End of Dspace-general Digest, Vol 61, Issue 19
**********************************************

_______________________________________________
Dspace-general mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/dspace-general

[Dspace-general] Statistics

Reply via email to