Re: This may be a bug

Simon Svensson Wed, 21 Nov 2012 21:11:06 -0800

Hi,

I'm using my mail client's reply-all functionality to answer both to thedeveloper mailing list and to you personally, in case you're not part ofthe mailing list. You only need to respond to the mailing list. (I'm notsure which functionality you use, I got several copies of the answerbelow.)

I'll refer the explanation of the segments_N and segments.gen file tohttp://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/fileformats.html


   The active segments in the index are stored in the segment info
   file, segments_N. There may be one or more segments_Nfiles in the
   index; however, the one with the largest generation is the active
   one (when older segments_N files are present it's because they
   temporarily cannot be deleted, or, a writer is in the process of
   committing, or a customIndexDeletionPolicy is in use). [...]

   As of 2.1, there is also a file segments.gen. This file contains the
   current generation (the _N in segments_N) of the index. This is used
   only as a fallback in case the current generation cannot be
   accurately determined by directory listing alone (as is the case for
   some NFS clients with time-based directory cache expiraation). [...]

One of the critical issues is your _PeriodicDocFlusher. Could you shareit with us? Does it call IndexWriter.Flush, as indicated by the name, orIndexWriter.Commit? Does it reopen your nrt-reader, using eitherIndexWriter.GetReader or IndexReader.Reopen? If so, does it alsorecreate a new IndexSearcher?

Everything so far sounds like you forgot to call IndexWriter.Commit.Flush will move move memory buffers and what-not to disk, but will notcommit it for others to read (only a reader from IndexWriter.GetReaderwill be able to read it). All this data will be removed as acleanup-procedure, a kind of rollback, when another writer is openedagainst the directory after an application restart.


// Simon

On 2012-11-21 21:29, Gerry Suggitt wrote:

Yes, I am using NRT. (Or I should say, I am trying to!). I commitwithin 10 seconds after a document is added (When a document arrives Istart a timer to allow more documents to come in before making thecommitment).And before I get into more details of the bug I reported, may I askyou a question about NRT?---------------------------------------------------------------------------------------------------------------------------------------------------------start of NRT questionAccording to the documentation (at least as it is described for Java),NRT should allow "/updates to be efficiently searched hopefully withinmilliseconds after an update is complete/". I have found that after anupdate, the document is not found until I have performed a commit.Here is the code that creates the reader, writer and searcher:_flusher is the 10 second commit timer.
        publicvoidStart()
        {
            _logger.Info( ()=> "LuceneEngine.Start "+ _pathname );
            System.IO.DirectoryInfodir = newSystem.IO.DirectoryInfo(
        _pathname );
            _directory = FSDirectory.Open( dir, newNoLockFactory() );
        // nolock is OK for now because we have a single thread
        accessing the directory
            _analyzer = newPerFieldAnalyzerWrapper(
        newWhitespaceAnalyzer() );
            _writer = newIndexWriter( _directory, _analyzer,
        IndexWriter.MaxFieldLength.UNLIMITED );
            _logger.Info( ()=> "LuceneEngine.Start begin Optimize");
        _writer.Optimize();
            _logger.Info( ()=> "LuceneEngine.Start Optimize finished");
            _reader = _writer.GetReader();
            _searcher = newIndexSearcher( _reader );
            _flusher = new_PeriodicDocFlusher( 10000, _maxDocsInCache,
        _OnFlushTimer, _logger );
            _logger.Info( ()=> "LuceneEngine.Start "+ _pathname + " -
        up and running");
        }
　And uses it: _DocAdded() just starts the commit timer if is it notalready running.
        publicvoidUpdateDocument( stringid, Documentdoc )
        {
            _writer.UpdateDocument( newTerm("id", id), doc, _analyzer );
            _DocAdded();
        }
        publicTopDocsSearch( Queryquery, intmaxDocsToReturn )
        {
        lock( this)
            {
                return_searcher.Search( query, maxDocsToReturn );
            }
        }
To test it, after making a call to UpdateDocument with id = xxx, Ilooped making repeated calls to Search where the query was id:xxx.This continually returned 0 documents until the timer kicked in andperformed the commit. And then Search returned 1 hit.
So is this expected? I didn't think so, but maybe I justmisinterpreted the documentation.
-----------------------------------------------------------------------------------------------------------------------------------------------------end of NRT question
Back to the issue at hand ...
I tried to reproduce the problem as I described and was unable to. SoI am uncertain what was happening there.But I have some more information about the lost databases on our testmachines.When I say the databases were completely empty, the directory actuallyheld two files:
segments.gen
segments_1
On one machine we restored the data from an external backup (actuallya SQL database!) and everything worked fine from then on. We could seeseveral files in the database directory.The other lab machine was untouched and here we discovered somethingthat might be important.We noticed on the first machine, after restoring (which essentiallyperformed a series of _writer.UpdateDocument) after we stopped andstarted the Lucene service, the timestamp on the segments.gen hadchanged and we now had a file segments_2. (I know, I know, you aregoing "well, duh", but hold on a sec)On the second machine we had not touched it. And the time on thesegments.gen file was November 12, 12:31 PM.
But the reboot of the machines occurred on November 17 at 3:30 PM.
So why wasn't the timestamp updated? My guess: Because there were noindex files in the directory!But ... I have logs that show 500 documents being added successfullyto the database AFTER November 12, 12:31 PM. And these logs showcommits being performed.
Furthermore, searches are returning documents.
So it appears (and this is just my guess) that the commits were makingthe necessary updates to the in-memory data structures that allowedsearches to work, but the data was never saved to the disk. Noexception occurred which may have been thrown as a result of a failureto write to the disk, so at this point I am baffled.Now why the data was not saved to the disk last week but are beingsaved this week is beyond me.I know we don't have much to work with. I will continue to see if Ican reproduce the problem. If there is anything else you would like meto check, please ask.
Thanks - Gerry
----- Original Message -----

    *From:* Simon Svensson <mailto:[email protected]>
    *To:* [email protected] <mailto:[email protected]>
    *Cc:* Gerry Suggitt <mailto:[email protected]>
    *Sent:* Wednesday, November 21, 2012 3:05 AM
    *Subject:* Re: This may be a bug

    Hi,

    This does indeed sound serious. Are you saying that you have a
    snapshot
    (with committed documents) that is cleared when calling
    IndexWriter.Optimize? Can you share it for reproduction purposes?

    Are you using near-realtime indexing? What you describe could
    happen if
    you were using nrt, and never called IndexWriter.Commit. The index
    would
    indeed be cleared next time an writer is opened against the
    directory, a
    step in clearing out unused index files. A kind of rollback of
    non-commited changes.

    // Simon


    On 2012-11-20 16:45, Gerry Suggitt wrote:
    > Sorry to send this email directly to the developers, but I
    couldn't see any other way of entering a defect.
    >
    > My name is Gerry Suggitt and I work for Leafsprout Technologies,
    a company that creates products for the Medical Information sector.
    >
    > We have created a Master Patient Index using Lucene that works
    very well - we are able to perform fuzzy matching and all the nice
    things that you want in a MPI.
    >
    > But something terrible just happened. Fortunately this occured
    in our own lab - we have not yet released the product to the field.
    >
    > Sometime over the weekend, the computers holding the Lucene
    database rebooted (probably from a Windows upgrade). All of the
    Lucene databases were blown away! Completely empty!
    >
    > Recently, I had noticed the same thing when I was doing some
    testing, so it may be related.
    >
    > We are currently using version 2.9.4.1
    >
    > What I was doing in my testing was taking a snapshot of the
    Lucene database files (just a copy to another directory). I would
    run some tests which would affect the database, so before
    continuing I would copy the snapshot back.
    >
    > When I started the Lucene service, the database was blown away!
    Completely empty!
    >
    > I was able to determine what was doing this. At startup, I was
    performing an optimize. This seems like a good time for me: At
    startup we know no client is making demands on the system. When I
    commented out the call to optimize, the database remained intact
    up startup.
    >
    > The systems that lost their databases still had the call to
    optimize in them.
    >
    > Please help!
    >

Re: This may be a bug

Reply via email to