Hello everbody,

every now and then somebody rises questions regarding clock speed and disk 
space and the kind on this list during planning for a new project or when going 
from development to production. I just skimmed through the archived messages of 
the last year and found several requests for advice. Generally, the answer is 
"well, it depends". All in all, threads on this topic remain quite superficial 
and people rarely come back with this after some time. My conclusion is that it 
is really not much of an issue. What do you think?

Still then, as I am just planning for a new project, Id just like to raise this 
issue again, maybe in a slightly different form than up to now. Maybe, we can 
find some aspects worth to remember and some questions to ask the next time we 
as a community start to augment information on running instances. There have 
been already several initiatives in this direction. I remember Valorie asking 
for information about new instances two years ago; there is the DSpace 
instances list in the DSpace Wiki, some information is in the Fedora Commons 
Examples and Solution Communities sections, but it none of these resources 
looks really comprehensive for me. Lets keep that in mind.

About my new project, I can tell that I really have little information at hand 
for serious planning as of now, so I wont come up with item counts, Gigabytes 
of storage requirements and so on here. What I do know already is, that I will 
have a considerable amount of automatic processes running like mass ingest. 
There will be a lot of scanned and OCRed, say *huge* PDFs to index, which will 
make me learn about the intricacies of PDFBox, I suspect.


My focus is on building a really reliable setup, rather then tweaking 
performance. Though, performance, such as database query performance is one 
aspect of reliability. Id like to hear your thoughts on either architecture and 
specific components, rather then numbers. Here is a list of areas I ponder 
about.

Conventional setups seem to use two separate boxes, one for development, one 
for production, each running the whohle software stack including database and 
JSP container. But what about a dedicated database machine and Tomcat machine? 
Maybe with optimized mass storage for each purpose, id est smaller but faster 
expensive drives such as SAS or even Solid State Drives for the database, and 
huge cheap SATA drives, but with a hardware RAID controller (RAID 5, RAID 10?), 
whereas the database machine is fine with a software RAID? Or is more RAM 
always a better choice for the database host compared to fast drives?

RAM seems to be the most limiting factor. And Java Performance seems to be more 
limiting then postgres performance, at least up to a certain table size. Now, I 
am going into numbers again. 2 Gigabytes have been seemingly a standard for 
small servers for quite a while now. Isnt this outdated? Should one go for a 
standard of lets say 8 GB or even much more and tweak Tomcat/Postgres settings 
to make use of it? Is there a simple rule like spending enough RAM on postgres 
to keep the whole db in RAM, assigning everything left to tomcat?

Are there still advantages of a physical machine over a virtualized 
environment, provided the virtual machine gets the same amount of dedicated 
memory? 

Are there preferences anywhere to move towards a different JSP container, say 
Jetty over the preferred Tomcat? In the area of http servers I see a turn to 
lighthttpd, favouring speed and simplicity over features. Id expect something 
similar to happen here. If there is no need to host anything besides DSpace on 
the machine, might one skip Apache completely, handling even https through 
tomcat itself?

Has anybody spent thought or gathered practical experience with load balancing 
setups or is a single machine of quality brand always sufficient? Would one 
start with clustering databases or duplicating the Tomcat Box? I guess 
duplicating the frontend is probably more demanding in terms of session 
handling, compared to configuring and maintaining a database cluster. On the 
other hand, the JSP container is the place where performance usually strikes 
first I guess. I do not consider such a setup for my project as of now, but I 
would like to hear whether DSpace reaches the ceiling when it comes to HA 
environments.

Ok, I guess, this is enough as a starter?

Bye, Christian


------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to