Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

William Moss Tue, 27 Sep 2016 18:06:12 -0700

Thank you both for your quick reply!

Responding to a few of Michael's points first:
* We actually tried the upgrade to 6.0 a few months back (when that was the
newest) and were getting similar errors to the ones I'm seeing now. We were
not able to track them down, which is part of the motivation for me asking
all these questions. We'll get there though :-)


* The last time we tested this (which I think was still post
ConcurrentMergePolicy) we saw that the read speed would slowly degrade over
time. My understanding was that forceMerge was very expensive, but would
make reads faster once complete. Is this not correct? Also, we never
attempted to tune the MergePolicy at all, so while were on the subject, is
there good documentation on how to do that? I'm much prefer to get away
from calling forceMerge. If it's useful information, we've got a relatively
small corpus, only ~2+M documents.

* We want to be able to ensure that if a machine or JVM crashes we are in a
coherent state. To that end, we need to call commit on Lucene and then
commit back what we've read so far to Kafka. Calling commit is the only way
to ensure this, right?

* To make sure I understand how maybeRefresh works, ignoring whether or not
we commit for a second, if I add a document via IndexWriter, it will not be
reflected in IndexSearchers I get by calling acquire on SearcherAndTaxonomy
until I call maybeRefresh?

Now, on to the concurrency issue. I was thinking a little more about this
and I think the fundamental issue is that while IndexWriter and
DirectoryTaxonomyWriter are each thread safe, them together are not. As
suggested by the documentation, we use one instance each of IndexWriter,
DirectoryTaxonomyWriter and SearcherTaxonomyManager. Imagine the following
scenario:
[Thread 1] Add document to DirectoryTaxonomyWriter
[Thread 1] Add document to IndexWriter
[Thread 1] Call commit on DirectoryTaxonomyWriter
[Thread 2] Add document to DirectoryTaxonomyWriter
[Thread 2] Add document to IndexWriter
[Thread 1] Call commit on IndexWriter
The on disc representation now should contain things in the IndexWriter
that are not contained in the DirectoryTaxonomyWriter, right?

Assuming maybeRefresh looks at the state on disk when it's doing it's
update (if this not true I don't understand why it was throwing
NoSuchFileException) then it can be out of sync as well?

I apparently never made a full copy of the stack trace. I'll attempt to
regenerate it and post it here once I have it.

Thanks again!
Will


On Mon, Sep 26, 2016 at 10:05 PM Shai Erera <[email protected]> wrote:

> Hmm ... the commit part of the two indexes is always tricky. The javadocs
> are correct because the order of indexing is as follows: when you index a
> document with facets, the facets are first added to the taxonomy index and
> only then the document is indexed in IW.
>
> Therefore if you concurrently index and commit, then committing TIW first
> ensures that all "known" facets up to this point are committed. Then when
> you commit IW, the documents in there are guaranteed to have their facet
> ordinals already in the committed TIW (which may at this point include more
> facets than are indexed in IW, but that's OK).
>
> When you then refresh the searchers, you can count on the refreshed IR to
> have all its ordinals in the refreshed TIR (which again, may include more
> ordinals, but that's OK).
>
> Hope this helps!
>
> Shai
>
> On Tue, Sep 27, 2016 at 6:57 AM Michael McCandless <
> [email protected]> wrote:
>
> > Hitting NoSuchFileException is no good!  Something serious is wrong.
> > Can you include the full stack trace?
> >
> > Responses inlined below:
> >
> > On Tue, Sep 27, 2016 at 2:08 AM, William Moss
> > <[email protected]> wrote:
> > > We're using Lucene 5.2.0 (I know it's old, we're in the process of
> > > upgrading) to handle searching over our listings here at Airbnb.
> >
> > 6.2.1 is a compelling upgrade because of more efficient indexing and
> > searching of numerics (among many other things!)...
> >
> > > I've been
> > > digging into our realtime indexing code and how we use Lucene and I
> > wanted
> > > to check a few assumptions around synchronization, since we see some
> > > periodic exceptions[1] that I can't quite explain.
> > >
> > > First, a tiny bit of background
> > > 1. We use facets and therefore are writing realtime updates using both
> > > a IndexWriter and DirectoryTaxonomyWriter.
> > > 2. We have multiple update threads, consuming messages (from Kafka) and
> > > updating the index.
> > > 3. Once we process a batch of messages, we call commit (first on
> > > DirectoryTaxonomyWriter then on IndexWriter).
> >
> > I see TaxonomyWriter's javadocs say that is the correct order, but I
> > would have expected the opposite, if you are concurrently indexing
> > documents.
> >
> > > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
> > > 5. We periodically call forceMerge on our IndexWriter (to improve
> > > performance).
> >
> > This is dubious: if your index continues to receive changes, you
> > should skip forceMerge and let Lucene's natural merging run at
> > defaults.  forceMerge is an incredibly costly operation and it's
> > unclear you get that much speedup at search time.
> >
> > > So, now to a few questions:
> > > 1. My understand is the right way to handle a DirectoryTaxonomyWriter
> and
> > > an IndexWriter is to call commit on DirectoryTaxonomyWriter before
> > > IndexWriter. Is this correct? Since we're using multiple threads, we
> need
> > > to synchronize these calls within the process regardless, but curious
> to
> > > understand the design.
> >
> > You should not have to block index updates while committing, if you
> > don't need/want to.
> >
> > If you don't block updates, I would think you need to commit the
> > DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
> > tree, referenced by the main index, are guaranteed to be present in
> > the DirectoryTaxonomyWriter's commit.
> >
> > Maybe Shai can shed some more light here...
> >
> > > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do
> those
> > > need to be synchronized with the commit calls to either IndexWriter or
> > > DirectoryTaxonomyWriter?
> >
> > No.
> >
> > Commit can be a costly, slow operation (calling fsync on N files), and
> > it's designed internally in IndexWriter to not block operations like
> > merging and refreshing.
> >
> > > Do we need to call it after ever time we call
> > > commit?  The comment suggests we call it "periodically," but I'm not
> > clear
> > > on how often that should be or what conditions trigger the index to
> > change
> > > in way that this would be required.
> >
> > You don't have to call refresh on every commit.  When you call it is
> > entirely up to you.
> >
> > Commit makes changes durable on disk, so an OS crash, power loss,
> > etc., won't lose those changes (a bad disk WILL lose them of course).
> >
> > Refresh makes changes visible for searching.
> >
> > The two ops are entirely separate.
> >
> > Some apps call commit periodically and never refresh, others call
> > refresh periodically and never commit :)  It's your call.
> >
> > > 3. Lastly, what about forceMerge? Is there any worry there or can that
> > just
> > > safely happen in the background? Is there any need to call commit
> > > afterward? Or does forceMerge effectively do that?
> >
> > Force merge does not call commit itself.
> >
> > If you do force merge, then it is a good idea to both commit and
> > refresh afterwards, as this will let Lucene free up resources (files,
> > file descriptors) with the old un-merged segments.
> >
> > > Presumably, we would not
> > > see the new index until maybeRefresh was called the next time?
> >
> > Exactly.
> >
> > > Sorry, that was a lot of questions, would love help on any and all of
> > them.
> >
> > No worries, keep them coming!
> >
> > > [1] When calling maybeRefresh, we've seen error that look like:
> > > java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe
> >
> > Need the full stack trace / context here to understand what's
> happening...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Reply via email to