I am working on a proprietary digital library whilst at the same time
considering how dspace might have been used to solve the same problems (it
won't be, but that's another story). When I consider usage event reporting
there are some concerns that arise when the number of articles and visitors
is very large. The current site has around 6 million articles and roughly 20
million hits per day. With these sorts of volumes, weblogs are IMHO not the
way to go. Also there are problems of scale when using event logs. Basically
a file-based approach is only suitable for small volumes of data.

I considered using a RDBMS and this goes get you further but unfortunately,
not far enough. A RDBMS can cope with millions of rows but starts to
struggle when you reach tens of millions or hundreds of millions. Let's do
some maths. In these calculations, there is a requirement to produce year to
date (YTD) figures (this is a requirement of COUNTER). I will assume that a
RDBMS system will calculate the YTD, rather than store a running total when
the current month is processed. This means that figures for 12 months needs
to be retained. Of those 20 million hits, some will be for the same
article(s). So after article level aggregation has been performed there will
be a maximum of 6 million rows for one day. This is 180 million for one
month, 2,160 million for 12 months. Now 2 billion rows seems a bit on the
large side to me :-)

One way around this would be to have a table for each month. Thus a table
might have to cope with 180 million rows, which is managable, even though it
is large. But in calculating the YTD figures one would need to do a 12 table
join. That's a bit unwieldy.

I am beginning to wonder if a different technology would provide better
scalability. I have in mind the open source column store, LucidDB. What do
people think?
-- 
Regards,

Andrew M.
http://www.andrewpetermarlow.co.uk
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to