Re: [Dspace-tech] Lucene/Postgre and scalability

Robert Tansley Mon, 18 Jun 2007 13:42:39 -0700

Hi Jose,

On 18/06/07, Jose Blanco <[EMAIL PROTECTED]> wrote:
> 1.  DSpace uses Lucene to search the bitstreams and the metadata, and uses
> the database to retrieve metadata for display.  Is there any reason why the
> metadata could not be retrieved using Lucene?  Why use the db at all? I have
> a feeling I know the answer to this question, but it would be nice to hear
> it from one of the architects.


It's true that the search result displays could get metadata straight
from Lucene, there are a few potential reasons I can think of why that
isn't the case right now:

1/ The DB version (in a transactionally-safe system) is authoritative;
the Lucene index may get out of date
2/ To maintain code separation (the code to display item metadata
doesn't know about the search system)
3/ The version in Lucene may be stemmed and have stopwords removed.

In retrospect I think 1 and 3 probably aren't valid reasons, and 2 is
down to the fact that no one has taken the time to figure out a way to
get the search (and browse) results to use Lucene documents instead of
the DB in a way that maintains code separation.  Using the metadata
straight from Lucene for efficient display would be a good way forward
for DSpace with the understanding that it's essentially a cache of the
metadata and the authoritative version is in the database.

> 2.  Does any one have any idea what the performance would be like with 12
> million records in a lucene environment with or without an accompanying
> database?  And would a dual storage system ( Lucene and database ) work well
> when you have to handle 12 million records (performance) ?

For search/browse, you'd probably get better performance using just
Lucene, as you can pull the Lucene Document from the result set.  No
experience with Lucene at that scale though.

> 3.  We will also have the need to update these records periodically, and so
> it seems like following a similar architecture as the one DSpace uses, it
> would take a very unreasonable amount of time to update 12 million records.
> Just last week I used the ItemImporter to load 3,600 records and I believe
> it took about 7 hours for the load to complete.  I'm assuming that the
> reason it took so long was because my repository already has about 35,000
> items and the inserts to the database were taking time, more than anything
> related to lucene.  When I loaded the same number of records in my
> development instance it took less than 2 hours and I have very few records
> there ( probably about 1000 ).  Any thoughts on this?

There are some known issues with the item importer, e.g. doing
indexing in serial with import (and the browse indexing scales very
poorly, and has been waiting for 4+ years for someone to take time and
replace it).  There is no centrally funded development team with a
remit to do work like this, so the essentially 'volunteer' efforts of
the community have tended to focus on adding new features and more
'interesting' aspects.

So the problem is not architectural per se (other than parallelising
indexing) but with the implementation of the various components.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Lucene/Postgre and scalability

Reply via email to