RE: sanity check - large, long running index updates and concurrent read-only service

Naomi Dushay Wed, 11 May 2005 06:38:12 -0700

It's my impression that with optimize running so long, there will be a
significant period of time (many minutes) when the old IndexReader will not
be able to find the segment/documents it needs.  Am I wrong about that?


- Naomi

> Could you explain why you need to copy the index?  It doesn't seem
> like that buys you anything (except maybe if the copy is to a
> physically separate disk)
> 
> -Yonik
> 
> 
> On 5/10/05, Naomi Dushay <[EMAIL PROTECTED]> wrote:
> > Context:  our index is currently around 6 gig and takes about an hour
> just to
> > optimize.  Updating it, even in batches, can involve active updating for
> 15
> > or more minutes.
> >
> > Index updates are done with two different batch processes as there are
> > currently two different workflows to update the index.  No obvious
> indexing
> > partitioning is suggested by our workflows.** The index is used in a
> read
> > only fashion by our REST search service, which runs under tomcat.
> >
> > Issues:  we don't want our REST service to have slow or strange results
> while
> > the index is being updated.
> >
> > Proposed solution:
> >
> > The world starts with REST service pointing to index A .   Our REST
> service
> > is read only, so the index file(s) themselves can be read only.
> >
> > To update the index, the following sequence occurs:
> >
> > 1.      lock index for update (via a lock/flag file)
> > 2.      copy index A into another directory, and make it writable.  This
> new
> > index will become A'
> > 3.      update index A', including optimization
> > 4.      set index A' to read only
> > 5.      gracefully change REST service to point use index A';  verify
> that
> > REST service is working properly with A'.
> > 6.      optional, but likely:  remove old, out-of-date index A.
> > 7.      unlock index for update (remove lock/flag file)
> >
> > The index updates could keep re-using the same two index directories, or
> they
> > could create new directories.
> >
> > Does anyone see any problems or have any suggestions for how to improve
> this?
> >
> > - Naomi Dushay
> >
> > National Science Digital Library - Core Integration
> >
> > Cornell University
> >
> > ** workflow 1:  we get bibliographic metadata that needs to go in the
> index.
> > This metadata may be new, or it may be an update to existing metadata.
> The
> > metadata records refer to resource URLs.   Workflow 2:  we fetch the
> content
> > for the resource URLs, using Nutch as our crawler and content database.
> > Fetched content may be for a new URL or updated content for a known URL.
> The
> > Lucene Documents are a combination of bibliographic metadata and fetched
> > content.
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: sanity check - large, long running index updates and concurrent read-only service

Reply via email to