Re: Lock handling and Lucene 1.9 / 2.0

Pete Lewis Mon, 13 Sep 2004 15:11:05 -0700

Hi Christoph

The directory caching is applied *across* class instances (the directory
is instanced once) - this cache exists singularily and is updated if the
FSDirectory is called against a different index.


Multiple indexes will *always* cause directory caching upon calls to
FSDirectory - our searches are made sequentially against all libraries
(or a selection of libraries) and this sequential call to FSDirectory
causes the cache to be updated - its very, very rare that the cache will
remain the same between two calls to get FSDirectories. This caching
*is* synchronized using the commit.lock (see the code) and two processes
(independent JVM's) will attain two different caches (completely
separate) *but* are tied together by the commit lock. This is what
causes the spin.

> FSDirectory.getDirectory has nothing to do with a commit.lock!

Err, wrong. The directory.makeLock(IndexWriter.COMMIT_LOCK_NAME) call
from within the IndexReader.open routine ties the commit.lock to the
FSDirectory by synchronising the code around a *static* instance of the
directory object (see the code!!).

Cheers
Pete Lewis

----- Original Message ----- 
From: "Christoph Goller" <[EMAIL PROTECTED]>
To: "Lucene Developers List" <[EMAIL PROTECTED]>
Sent: Monday, September 13, 2004 11:34 AM
Subject: Re: Lock handling and Lucene 1.9 / 2.0


> Pete Lewis wrote:
> > Hi Christoph
>
> > Long answer - theres a heap of horrible, horrible code in the
FSDirectory that tries to be clever and I think its not quite working
correctly.
> >
> > Two types of lock - write.lock and commit.lock. The write.lock is used
exclusively for synchronising the indexing of documents and has *no* impact
on searching whatsoever.
> >
> > Commit.lock is another little story. Commit.lock is used for two
things - stopping indexing processes from overwriting segments that another
one is currently using, and stopping IndexReaders from overwriting each
other when they delete entries (dcon't even start asking my why a bloody
IndexReader can delete documents).
>
> Commit.lock is used to synchronize comittment of changes to an index
> with the process of opening an IndexReader. These changes my come from
> an IndexWriter or an IndexReader. There are good reasons for having the
> delete functionality in IndexReader (see developer mailing list around
> July 16). Write.lock is used to gurantee that there always is only one
> writer.
>
> >
> > *However*, theres another naughty little usage that isn't listed in any
of the documentation, and here it is....
> >
> > Doug Cutting wrote FSDirectory in such a way that it caches a directory.
Hence, if FSDirectory is called more than once with the same directory, the
FSDirectory class uses a static Hashtable to return the current values.
However, if FSDirectory is called with a *different* directory, it engages a
commit.lock while it updates the values. It *also* makes that Hashtable
(sychronised).
>
> FSDirectory.getDirectory has nothing to do with a commit.lock!
> Lucene currently uses 2 locking mechanisms, the interprocess
> mechanism with the commit.lock file and an intraprocess mechanism
> based on synchronization on directory instances. The 2nd mechanism
> needs unique directory instances and this is achieved by caching
> directory instances in FSDirectory.
>
> >
> > Creating an IndexSearcher creates (within itself) an IndexReader to read
the index. The first thing the IndexReader does is grab an FSDirectory for
the index directory - if you are using LUCENE with a single index, theres is
never a problem - it is read once, then cached.
> >
> > Our search process works by searching across all the libraries selected
sequentially, building a results list and then culling the results it
doesn't need. To search it loops through each library and creates an
IndexSearcher to get at the data.
> >
> > Starting to see the issue yet? Because each library is in a different
directory, the internal call to the IndexReader which then gets an
FSDirectory causes the FSDirectory to update its singular cache. Which
forces a commit.lock to appear.
> >
> > Doug Cuttings little bit of 'neat' code for caching singularily the data
within an FSDirectory is causing us headaches immense. The code is horrible:
> >
> > /** Returns an IndexReader reading the index in the given Directory. */
> >   public static IndexReader open(final Directory directory) throws
IOException{
> >     synchronized (directory) {     // in- & inter-process sync
> >       return (IndexReader)new Lock.With(
> >           directory.makeLock(IndexWriter.COMMIT_LOCK_NAME),
> >           IndexWriter.COMMIT_LOCK_TIMEOUT) {
> >           public Object doBody() throws IOException {
> >             SegmentInfos infos = new SegmentInfos();
> >             infos.read(directory);
> >             if (infos.size() == 1) {    // index is optimized
> >               return new SegmentReader(infos, infos.info(0), true);
> >             } else {
> >                 SegmentReader[] readers = new
SegmentReader[infos.size()];
> >                 for (int i = 0; i < infos.size(); i++)
> >                   readers[i] = new SegmentReader(infos, infos.info(i),
i==infos.size()-1);
> >                 return new SegmentsReader(infos, directory, readers);
> >             }
> >           }
> >         }.run();
> >     }
> >   }
> >
> > Where directory is passed in from the constructor to IndexReader thus:
> >
> >   return open( FSDirectory.getDirectory( path, false ) );
>
> All threads that open an IndexReader and that don't get a directory
instance
> directly have to compete for FSDirectory.getDirectory synchronization
> independent of the index you are trying to open. So you are right. This
> is a bottleneck.
>
> After that, threads opening an IndexReader only compete with each other
> if they try to read the same index. This is handled by the two above
> mentioned locking mechanisms.
>
> Here are two ideas that could help:
> The bottleneck only occurs if you always start a new process for every
search,
> doesn't it? If you make a second search within the same process,
> the directory instances will already be cached and the bottleneck won't be
a
> problem? Furthermore, you do not have to always open new searchers for
every
> search. Can't you use your Searcher instances for multiple searches.
>
> A question for Lucene 1.9/2.0 is, whether we really need intraprocess and
> interprocess synchonization. Maybe these two mechanisms exist for purely
> historical reasons and the interprocess mechanism alone would be enough?
>
> Christoph
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lock handling and Lucene 1.9 / 2.0

Reply via email to