On Mon, 2007-06-18 at 15:10 -0400, Jose Blanco wrote: > 2. Does any one have any idea what the performance would be like with > 12 million records in a lucene environment with or without an > accompanying database? And would a dual storage system ( Lucene and > database ) work well when you have to handle 12 million records > (performance) ?
I've said it before, and I'll say it again - monolithic systems (and that includes the way most RDBMS are set up) don't handle large datasets particularly well. OK, that's a sweeping, headline grabbing, statement. The reality is a fair bit more complicated. But on a quite powerful Oracle setup, it's possible to have performance issues querying only indexed columns on tables with as little as 50,000 records. It all depends on the type of queries you need to perform - how many components, how selective, ordering requirements, etc. > 4. Has any one out there had to do something like this, and if so > what have you found that works. One solution that comes to mind is > Zebra. It is suppose to handle large repositories quite well. Are > there any users of Zebra out there that might have an opinion on this? Haven't heard of this before. But looking at the claimed performance is interesting. For example, they claim to handle in the region of 50 million records, @ around 100GB data size - that's 2KB per record. How large are your records? Performance for 'very large databases' (doesn't specify what a very large DB - let's just assume it's 50million records for now), is good/acceptable - providing your queries only result in hits of around 1000 to 5000 records. Even at the upper limit, that's 0.01% of the database. That's pretty damn specific, and I personally wouldn't be surprised if an average user query was at least 10x less specific - how does that impact on performance? If you really want to look at scaling to millions of records, you will almost certainly want to look at a divide-and-conquer solution. The most obvious place for you to start would probably be Jargon and GridLucene. G This e-mail is confidential and should not be used by anyone who is not the original intended recipient. BioMed Central Limited does not accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of BioMed Central Limited. No contracts may be concluded on behalf of BioMed Central Limited by means of e-mail communication. BioMed Central Limited Registered in England and Wales with registered number 3680030 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech