At our institution we are working on the problem of making 12 MILLION
bibliographic records searchable.  One the ideas that has been tossed around
is that of using Lucene to accomplish this.  And it seems to me that we
could use a DSpace instance ( or something like it ) to do this.  

 

Now, I'm pretty sure we're not going to be using DSpace for this, but if we
use lucene, whatever we develop will be a little like DSpace.  Because of
this, I have a few questions that will help us figure out what we do next.

 

1.  DSpace uses Lucene to search the bitstreams and the metadata, and uses
the database to retrieve metadata for display.  Is there any reason why the
metadata could not be retrieved using Lucene?  Why use the db at all? I have
a feeling I know the answer to this question, but it would be nice to hear
it from one of the architects.

 

2.  Does any one have any idea what the performance would be like with 12
million records in a lucene environment with or without an accompanying
database?  And would a dual storage system ( Lucene and database ) work well
when you have to handle 12 million records (performance) ?  

 

3.  We will also have the need to update these records periodically, and so
it seems like following a similar architecture as the one DSpace uses, it
would take a very unreasonable amount of time to update 12 million records.
Just last week I used the ItemImporter to load 3,600 records and I believe
it took about 7 hours for the load to complete.  I'm assuming that the
reason it took so long was because my repository already has about 35,000
items and the inserts to the database were taking time, more than anything
related to lucene.  When I loaded the same number of records in my
development instance it took less than 2 hours and I have very few records
there ( probably about 1000 ).  Any thoughts on this?

 

4.  Has any one out there had to do something like this, and if so what have
you found that works.  One solution that comes to mind is Zebra. It is
suppose to handle large repositories quite well.  Are there any users of
Zebra out there that might have an opinion on this?

 

Your thoughts on this would be greatly appreciated.  Thank you for taking
the time to consider these questions.

 

-Jose

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to