RE: Managing a large archival (and constantly changing) database

Rob Staveley (Tom) Fri, 07 Jul 2006 02:03:08 -0700

I should probably direct this to Doug Cutting, but following that thread I
come to Doug's post at
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12709.html .


Doug says:

> 1. On the index master, periodically checkpoint the index. Every minute or
so the IndexWriter is closed and a 'cp -lr index index.DATE' command is
executed from Java, where DATE is the current date and time. This
efficiently makes a copy of the index when its in a consistent state by
constructing a tree of hard links. If Lucene re-writes any files (e.g., the
segments file) a new inode is created and the copy is unchanged. 

How can that be so? When the segments file is re-written it will surely
clobber the copy rather than creating a new INODE, because it has the same
name... wouldn't it?

What makes it different from (say)...

        mkdir x
        echo original > x/x.txt
        cp -lr x x.copy
        echo update > x/x.txt
        diff x/x.txt x.copy/x.txt

...where x.copy/x.txt has "update" rather than "original" (certainly on
Linux).

-----Original Message-----
From: James Pine [mailto:[EMAIL PROTECTED] 
Sent: 06 July 2006 20:09
To: java-user@lucene.apache.org
Subject: RE: Managing a large archival (and constantly changing) database

Hey,

I found this thread to be very useful when deciding upon an indexing
strategy. 

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html

The system I work on has 3 million or so documents and it was (until a
non-lucene performance issue came up) setup to add/delete new documents
every 15 minutes in a similar manner as described in the thread. We were
adding/deleting a few thousand documents every 15 minutes, during peak
traffic. We have a dedicated indexing machine and distribute portions of our
index across multiple machines, but you could still follow the pattern all
on one box, just with separate processes/threads. 

Even though lucene allows certain types of index operations to happen
concurrently with search activity, IMHO, if you can decouple the indexing
process from the searching process your system as a whole will be more
flexible and scalable with only a little extra maintenance overhead.

JAMES

--- Larry Ogrodnek <[EMAIL PROTECTED]> wrote:

> We have a similar setup, although probably only 1/5th the number of 
> documents and updates.  I'd suggest just making periodic index 
> backups.
> 
> I've been storing my index as follows:
> 
> <workdir>/<index-name>/data/ (lucene index
> directory)
> <workdir>/<index-name>/backups/
> 
> The "data" is what's passed into
> IndexWriter/IndexReader.  Additionally, I create/update a .last_update 
> file, which just contains the timestamp of when the last update was 
> started, so when the app starts up it only needs to retrieve updates 
> from the db since then.
> 
> Periodically the app copies the contents of data into a new directory 
> in backups named by the date/time, e.g.
> backups/2007-07-04.110051.  If
> needed, I can delete data and replace the contents with the latest 
> backup, and the app will only retrieve records updated since the 
> backup was made (using the backup's .last_update)...
> 
> I'd recommend making the complete index creation from scratch a normal 
> operation as much as possible (but you're right, for that number of 
> documents it will take awhile).  It's been really helpful here when 
> doing additional deploys for testing, or deciding we want to index 
> things differently, etc...
> 
> -larry
> 
> 
> -----Original Message-----
> From: Scott Smith [mailto:[EMAIL PROTECTED]
> 
> Sent: Thursday, July 06, 2006 1:48 PM
> To: lucene-user@jakarta.apache.org
> Subject: Managing a large archival (and constantly
> changing) database
> 
> I've been asked to do a project which provides full-text search for a 
> large database of articles.  The expectation is that most of the 
> articles are fairly small (<2k bytes).  There will be an initial 
> population of around 400,000 articles.  There will then be 
> approximately 2000 new articles added each day (they need to be added 
> in "real time"
> (within a few minutes of arrival), but will be spread out during the 
> day).  So, roughly another 700,000 articles each year.
> 
>  
> 
> I've read enough to believe that having a lucene database of several 
> million articles is doable.  And, adding 2000 articles per day 
> wouldn't seem to be that many.  My concern is the real-time nature of 
> the application.  I'm a bit nervous (perhaps without
> justification) at
> simply growing one monolithic lucene database. 
> Should there be a crash,
> the database will be unusable and I'll have to rebuild from scratch 
> (which, based on my experience, would be hours of time).
> 
>  
> 
> Some of my thoughts were:
> 
> 1)     having monthly databases and using
> MultiSearcher to search across
> them.  That way my exposure for a corrupted database is limited to 
> this month's database.  This would also seem to give me somewhat 
> better control--meaning a) if the search was generating lots of hits, 
> I could display the results a month at a time and not bury them with 
> output.  It would also spread their search CPU out better and not 
> prevent other individuals from doing a search.  If there were very few 
> results, I could sleep between each month's search and again, not lock 
> everyone else out from searches.
> 
> 2)     Have a "this month's" searchable and an
> "everything else"
> searchable.  At the beginning of each month, I would consolidate the 
> previous month's database into the "everything else"
> searchable.  This
> would give more consistent results for relevancy ranked searches.  
> But, it means that a bad search could return lots of results.
> 
>  
> 
> Has anyone else dealt with a similar problem?  Am I
> expecting too much
> from Lucene running on a single machine (or should I
> be looking at
> Hadoop?).  Any comments or links to previous
> discussions on this topic
> would be appreciated.
> 
>  
> 
> Scott
> 
>  
> 
>  
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

smime.p7s
Description: S/MIME cryptographic signature

RE: Managing a large archival (and constantly changing) database

Reply via email to