Re: This may be a bug - NOPE IT'S NOT (sorry)

Gerry Suggitt Fri, 23 Nov 2012 02:17:03 -0800

You are absolutely right. I wasn't doing a commit in my "flusher" function - I 
was just closing and reopening the reader and searcher.


I kinda recall that I had performance issues when I used commit (and before I 
was using NRT) when I had commit in the flusher function. After moving to NRT I 
noticed that everything worked fine without the commit (because now the reader 
was not relying on the data on the file system).

I did however have a commit in the service Shutdown code.

But upon reviewing the logs from the Windows Upgrade reboot that occurred last 
Saturday at 3:00, I noticed that the services were not shutdown gracefully, so 
the Shutdown code and hence the commit was not called.

Both databases that were lost contained a small number of documents, so they 
were probably never committed to the file system automatically.

So sorry to have troubled you, but thanks for all your help.

The good thing is as a result of the abrupt shutdown a bug was revealed in our 
code. And furthermore a recovery mechanism was devised that would protect us 
from another disastrous termination.

Thank-you once again. 

  ----- Original Message ----- 
  From: Simon Svensson 
  To: Gerry Suggitt 
  Cc: [email protected] 
  Sent: Thursday, November 22, 2012 12:09 AM
  Subject: Re: This may be a bug


  Hi, 

  I'm using my mail client's reply-all functionality to answer both to the 
developer mailing list and to you personally, in case you're not part of the 
mailing list. You only need to respond to the mailing list. (I'm not sure which 
functionality you use, I got several copies of the answer below.) 

  I'll refer the explanation of the segments_N and segments.gen file to 
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/fileformats.html
 

    The active segments in the index are stored in the segment info file, 
segments_N. There may be one or more segments_Nfiles in the index; however, the 
one with the largest generation is the active one (when older segments_N files 
are present it's because they temporarily cannot be deleted, or, a writer is in 
the process of committing, or a customIndexDeletionPolicy is in use). [...] 

    As of 2.1, there is also a file segments.gen. This file contains the 
current generation (the _N in segments_N) of the index. This is used only as a 
fallback in case the current generation cannot be accurately determined by 
directory listing alone (as is the case for some NFS clients with time-based 
directory cache expiraation). [...] 

  One of the critical issues is your _PeriodicDocFlusher. Could you share it 
with us? Does it call IndexWriter.Flush, as indicated by the name, or 
IndexWriter.Commit? Does it reopen your nrt-reader, using either 
IndexWriter.GetReader or IndexReader.Reopen? If so, does it also recreate a new 
IndexSearcher? 

  Everything so far sounds like you forgot to call IndexWriter.Commit. Flush 
will move move memory buffers and what-not to disk, but will not commit it for 
others to read (only a reader from IndexWriter.GetReader will be able to read 
it). All this data will be removed as a cleanup-procedure, a kind of rollback, 
when another writer is opened against the directory after an application 
restart. 

  // Simon 

  On 2012-11-21 21:29, Gerry Suggitt wrote: 



    Yes, I am using NRT. (Or I should say, I am trying to!). I commit within 10 
seconds after a document is added (When a document arrives I start a timer to 
allow more documents to come in before making the commitment).

    And before I get into more details of the bug I reported, may I ask you a 
question about NRT?

    
---------------------------------------------------------------------------------------------------------------------------------------------------------
 start of NRT question

    According to the documentation (at least as it is described for Java), NRT 
should allow "updates to be efficiently searched hopefully within milliseconds 
after an update is complete". I have found that after an update, the document 
is not found until I have performed a commit.

    Here is the code that creates the reader, writer and searcher: _flusher is 
the 10 second commit timer. 
        public void Start()
        {
            _logger.Info( ()=> "LuceneEngine.Start " + _pathname );
            System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo( 
_pathname );
            _directory = FSDirectory.Open( dir, new NoLockFactory() ); // 
nolock is OK for now because we have a single thread accessing the directory
            _analyzer = new PerFieldAnalyzerWrapper( new WhitespaceAnalyzer() );
            _writer = new IndexWriter( _directory, _analyzer, 
IndexWriter.MaxFieldLength.UNLIMITED );
            _logger.Info( ()=> "LuceneEngine.Start begin Optimize" );
            _writer.Optimize();
            _logger.Info( ()=> "LuceneEngine.Start Optimize finished" );
            _reader = _writer.GetReader();
            _searcher = new IndexSearcher( _reader );
            _flusher = new _PeriodicDocFlusher( 10000, _maxDocsInCache, 
_OnFlushTimer, _logger );
            _logger.Info( ()=> "LuceneEngine.Start " + _pathname + " - up and 
running" );
        }
    　And uses it: _DocAdded() just starts the commit timer if is it not already 
running.

        public void UpdateDocument( string id, Document doc )
        {
            _writer.UpdateDocument( new Term("id", id), doc, _analyzer );
            _DocAdded();
        }

        public TopDocs Search( Query query, int maxDocsToReturn )
        {
            lock( this )
            {    
                return _searcher.Search( query, maxDocsToReturn );
            }
        }
    To test it, after making a call to UpdateDocument with id = xxx, I looped 
making repeated calls to Search where the query was id:xxx. This continually 
returned 0 documents until the timer kicked in and performed the commit. And 
then Search returned 1 hit.

    So is this expected? I didn't think so, but maybe I just misinterpreted the 
documentation.

    
-----------------------------------------------------------------------------------------------------------------------------------------------------
 end of NRT question

    Back to the issue at hand ...

    I tried to reproduce the problem as I described and was unable to. So I am 
uncertain what was happening there. 

    But I have some more information about the lost databases on our test 
machines. 

    When I say the databases were completely empty, the directory actually held 
two files:
    segments.gen
    segments_1

    On one machine we restored the data from an external backup (actually a SQL 
database!) and everything worked fine from then on. We could see several files 
in the database directory.

    The other lab machine was untouched and here we discovered something that 
might be important.

    We noticed on the first machine, after restoring (which essentially 
performed a series of _writer.UpdateDocument) after we stopped and started the 
Lucene service, the timestamp on the segments.gen had changed and we now had a 
file segments_2. (I know, I know, you are going "well, duh", but hold on a sec)

    On the second machine we had not touched it. And the time on the 
segments.gen file was November 12, 12:31 PM.

    But the reboot of the machines occurred on November 17 at 3:30 PM.

    So why wasn't the timestamp updated? My guess: Because there were no index 
files in the directory!

    But ... I have logs that show 500 documents being added successfully to the 
database AFTER November 12, 12:31 PM. And these logs show commits being 
performed. 

    Furthermore, searches are returning documents.

    So it appears (and this is just my guess) that the commits were making the 
necessary updates to the in-memory data structures that allowed searches to 
work, but the data was never saved to the disk. No exception occurred which may 
have been thrown as a result of a failure to write to the disk, so at this 
point I am baffled.

    Now why the data was not saved to the disk last week but are being saved 
this week is beyond me.

    I know we don't have much to work with. I will continue to see if I can 
reproduce the problem. If there is anything else you would like me to check, 
please ask.

    Thanks - Gerry




    ----- Original Message ----- 
      From: Simon Svensson 
      To: [email protected] 
      Cc: Gerry Suggitt 
      Sent: Wednesday, November 21, 2012 3:05 AM
      Subject: Re: This may be a bug


      Hi,

      This does indeed sound serious. Are you saying that you have a snapshot 
      (with committed documents) that is cleared when calling 
      IndexWriter.Optimize? Can you share it for reproduction purposes?

      Are you using near-realtime indexing? What you describe could happen if 
      you were using nrt, and never called IndexWriter.Commit. The index would 
      indeed be cleared next time an writer is opened against the directory, a 
      step in clearing out unused index files. A kind of rollback of 
      non-commited changes.

      // Simon


      On 2012-11-20 16:45, Gerry Suggitt wrote:
      > Sorry to send this email directly to the developers, but I couldn't see 
any other way of entering a defect.
      >
      > My name is Gerry Suggitt and I work for Leafsprout Technologies, a 
company that creates products for the Medical Information sector.
      >
      > We have created a Master Patient Index using Lucene that works very 
well - we are able to perform fuzzy matching and all the nice things that you 
want in a MPI.
      >
      > But something terrible just happened. Fortunately this occured in our 
own lab - we have not yet released the product to the field.
      >
      > Sometime over the weekend, the computers holding the Lucene database 
rebooted (probably from a Windows upgrade). All of the Lucene databases were 
blown away! Completely empty!
      >
      > Recently, I had noticed the same thing when I was doing some testing, 
so it may be related.
      >
      > We are currently using version 2.9.4.1
      >
      > What I was doing in my testing was taking a snapshot of the Lucene 
database files (just a copy to another directory). I would run some tests which 
would affect the database, so before continuing I would copy the snapshot back.
      >
      > When I started the Lucene service, the database was blown away! 
Completely empty!
      >
      > I was able to determine what was doing this. At startup, I was 
performing an optimize. This seems like a good time for me: At startup we know 
no client is making demands on the system. When I commented out the call to 
optimize, the database remained intact up startup.
      >
      > The systems that lost their databases still had the call to optimize in 
them.
      >
      > Please help!
      >

Re: This may be a bug - NOPE IT'S NOT (sorry)

Reply via email to