Re: [Dspace-tech] Lucene/Postgre and scalability

2007-06-19 Thread Graham Triggs
On Mon, 2007-06-18 at 15:10 -0400, Jose Blanco wrote:
> 2.  Does any one have any idea what the performance would be like with
> 12 million records in a lucene environment with or without an
> accompanying database?  And would a dual storage system ( Lucene and
> database ) work well when you have to handle 12 million records
> (performance) ?  

I've said it before, and I'll say it again - monolithic systems (and
that includes the way most RDBMS are set up) don't handle large datasets
particularly well.

OK, that's a sweeping, headline grabbing, statement. The reality is a
fair bit more complicated. But on a quite powerful Oracle setup, it's
possible to have performance issues querying only indexed columns on
tables with as little as 50,000 records.

It all depends on the type of queries you need to perform - how many
components, how selective, ordering requirements, etc.

> 4.  Has any one out there had to do something like this, and if so
> what have you found that works.  One solution that comes to mind is
> Zebra. It is suppose to handle large repositories quite well.  Are
> there any users of Zebra out there that might have an opinion on this?

Haven't heard of this before. But looking at the claimed performance is 
interesting.

For example, they claim to handle in the region of 50 million records, @
around 100GB data size - that's 2KB per record. How large are your
records?

Performance for 'very large databases' (doesn't specify what a very
large DB - let's just assume it's 50million records for now), is
good/acceptable - providing your queries only result in hits of around
1000 to 5000 records. Even at the upper limit, that's 0.01% of the
database. That's pretty damn specific, and I personally wouldn't be
surprised if an average user query was at least 10x less specific - how
does that impact on performance?

If you really want to look at scaling to millions of records, you will
almost certainly want to look at a divide-and-conquer solution. The most
obvious place for you to start would probably be Jargon and GridLucene.

G 
 
 
This e-mail is confidential and should not be used by anyone who is not the 
original intended recipient. BioMed Central Limited does not accept liability 
for any statements made which are clearly the sender's own and not expressly 
made on behalf of BioMed Central Limited. No contracts may be concluded on 
behalf of BioMed Central Limited by means of e-mail communication. BioMed 
Central Limited Registered in England and Wales with registered number 3680030 
Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Lucene/Postgre and scalability

2007-06-18 Thread Robert Tansley
Hi Jose,

On 18/06/07, Jose Blanco <[EMAIL PROTECTED]> wrote:
> 1.  DSpace uses Lucene to search the bitstreams and the metadata, and uses
> the database to retrieve metadata for display.  Is there any reason why the
> metadata could not be retrieved using Lucene?  Why use the db at all? I have
> a feeling I know the answer to this question, but it would be nice to hear
> it from one of the architects.

It's true that the search result displays could get metadata straight
from Lucene, there are a few potential reasons I can think of why that
isn't the case right now:

1/ The DB version (in a transactionally-safe system) is authoritative;
the Lucene index may get out of date
2/ To maintain code separation (the code to display item metadata
doesn't know about the search system)
3/ The version in Lucene may be stemmed and have stopwords removed.

In retrospect I think 1 and 3 probably aren't valid reasons, and 2 is
down to the fact that no one has taken the time to figure out a way to
get the search (and browse) results to use Lucene documents instead of
the DB in a way that maintains code separation.  Using the metadata
straight from Lucene for efficient display would be a good way forward
for DSpace with the understanding that it's essentially a cache of the
metadata and the authoritative version is in the database.

> 2.  Does any one have any idea what the performance would be like with 12
> million records in a lucene environment with or without an accompanying
> database?  And would a dual storage system ( Lucene and database ) work well
> when you have to handle 12 million records (performance) ?

For search/browse, you'd probably get better performance using just
Lucene, as you can pull the Lucene Document from the result set.  No
experience with Lucene at that scale though.

> 3.  We will also have the need to update these records periodically, and so
> it seems like following a similar architecture as the one DSpace uses, it
> would take a very unreasonable amount of time to update 12 million records.
> Just last week I used the ItemImporter to load 3,600 records and I believe
> it took about 7 hours for the load to complete.  I'm assuming that the
> reason it took so long was because my repository already has about 35,000
> items and the inserts to the database were taking time, more than anything
> related to lucene.  When I loaded the same number of records in my
> development instance it took less than 2 hours and I have very few records
> there ( probably about 1000 ).  Any thoughts on this?

There are some known issues with the item importer, e.g. doing
indexing in serial with import (and the browse indexing scales very
poorly, and has been waiting for 4+ years for someone to take time and
replace it).  There is no centrally funded development team with a
remit to do work like this, so the essentially 'volunteer' efforts of
the community have tended to focus on adding new features and more
'interesting' aspects.

So the problem is not architectural per se (other than parallelising
indexing) but with the implementation of the various components.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Lucene/Postgre and scalability

2007-06-18 Thread Richard MAHONEY
Dear Jose,

On Tue, 2007-06-19 at 07:10, Jose Blanco wrote:


> 3.  We will also have the need to update these records periodically,
>  and so it seems like following a similar architecture as the one
>  DSpace uses, it would take a very unreasonable amount of time to
>  update 12 million records.  Just last week I used the ItemImporter to
>  load 3,600 records and I believe it took about 7 hours for the load to
>  complete.  I'm assuming that the reason it took so long was because my
>  repository already has about 35,000 items and the inserts to the
>  database were taking time, more than anything related to lucene.  When
>  I loaded the same number of records in my development instance it took
>  less than 2 hours and I have very few records there ( probably about
>  1000 ).  Any thoughts on this?

This issue has been raised many times on the various DSpace lists, but
I am yet to see any substantive action on the part of the core
developers to address it. Actually setting up some dedicated test
servers with a decent amount of representative and scalable test data
would be a start. One would take this for granted with any well
organised test and release cycle, but to my knowledge DSpace releases
are not subjected to serious performance profiling, scalability or
stress testing. I have been hoping for some time that this deficiency
would be corrected, but I am beginning to doubt that it will be
addressed in the medium term.


> 4.  Has any one out there had to do something like this, and if so what
>  have you found that works.  One solution that comes to mind is Zebra.
>  It is suppose to handle large repositories quite well.  Are there any
>  users of Zebra out there that might have an opinion on this?

I definitely suggest that you mail your requirements to the IndexData
list.

http://www.indexdata.dk/zebra/

http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist

Over the years I have found IndexData's developers to be extremely
helpful and responsive. An anecdote: Only the other day I found a bug
in `yaz-client'. Not only was it fixed within a couple of days, but
after consultation, new functionality was added. Just splendid. If only
this attitude was more widespread.



Best regards,
 
 Richard MAHONEY




-- 
Richard MAHONEY | internet: http://indica-et-buddhica.org/
Littledene  | telephone/telefax (man.): +64 3 312 1699
Bay Road| cellular: +64 27 482 9986
OXFORD, NZ  | email: [EMAIL PROTECTED]
~~~
Indica et Buddhica: Materials for Indology and Buddhology
Repositorium: http://indica-et-buddhica.org/repositorium/
Philologica: http://indica-et-buddhica.org/philologica/
Subscriptions: http://subscriptions.indica-et-buddhica.org/


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech