On Mon, 2007-06-18 at 15:10 -0400, Jose Blanco wrote:
> 2.  Does any one have any idea what the performance would be like with
> 12 million records in a lucene environment with or without an
> accompanying database?  And would a dual storage system ( Lucene and
> database ) work well when you have to handle 12 million records
> (performance) ?  

I've said it before, and I'll say it again - monolithic systems (and
that includes the way most RDBMS are set up) don't handle large datasets
particularly well.

OK, that's a sweeping, headline grabbing, statement. The reality is a
fair bit more complicated. But on a quite powerful Oracle setup, it's
possible to have performance issues querying only indexed columns on
tables with as little as 50,000 records.

It all depends on the type of queries you need to perform - how many
components, how selective, ordering requirements, etc.

> 4.  Has any one out there had to do something like this, and if so
> what have you found that works.  One solution that comes to mind is
> Zebra. It is suppose to handle large repositories quite well.  Are
> there any users of Zebra out there that might have an opinion on this?

Haven't heard of this before. But looking at the claimed performance is 
interesting.

For example, they claim to handle in the region of 50 million records, @
around 100GB data size - that's 2KB per record. How large are your
records?

Performance for 'very large databases' (doesn't specify what a very
large DB - let's just assume it's 50million records for now), is
good/acceptable - providing your queries only result in hits of around
1000 to 5000 records. Even at the upper limit, that's 0.01% of the
database. That's pretty damn specific, and I personally wouldn't be
surprised if an average user query was at least 10x less specific - how
does that impact on performance?

If you really want to look at scaling to millions of records, you will
almost certainly want to look at a divide-and-conquer solution. The most
obvious place for you to start would probably be Jargon and GridLucene.

G 
 
 
This e-mail is confidential and should not be used by anyone who is not the 
original intended recipient. BioMed Central Limited does not accept liability 
for any statements made which are clearly the sender's own and not expressly 
made on behalf of BioMed Central Limited. No contracts may be concluded on 
behalf of BioMed Central Limited by means of e-mail communication. BioMed 
Central Limited Registered in England and Wales with registered number 3680030 
Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to