Re: Compound / non-compound index files and SIGKILL

Volodymyr Bychkoviak Tue, 06 Jun 2006 06:02:35 -0700

In my application I was queuing IDs of appropriate record in Databasenot whole document. Document was created right before adding it toindex. All this work was done in separated thread, so other threadsresponded very quickly.


It depends on your application and at what speed your new data comes.

If application is killed or crashes then you'll lost all documents thatwas buffered in memory (IndexWriter.minMergeDocs property).

Actually I don't know how JVM's shutdown hooks reacts to different killsignals. My though is that application should not be killed. :)Also if you need to enshure that your data is indexed properly then onstartup you can do special check what data was already added to databasebut wasn't added to index and enqueue it into queue... But it alsodepends...


Rob Staveley (Tom) wrote:

This is a good idea. I had been worried about the additional heap
requirement maintaining a queue, without being able to serialize/deserialize
Documents (i.e. a build up of Lucene Documents in RAM). I have been
marshalling addDocument() calls using a synchronized object; the same
threads have been taking responsibility for creating Documents
(unsynchronized) and adding them to the index writer (synchronized). I guess
I could have a one Document queue feeding a single addDocument thread, which
would effectively be the same approach, but which would make it easier to
ensure that only the create Document thread is killed when I get a SIGTERM
and the addDocument thread is left to run its course (assuming it hasn't
hanged!).

Having said that, I'm not sure what I could do in a shutdown hook, which
wouldn't already have been done by a SIGTERM to get the hanged thread to
terminate. The reason for SIGKILL was that the daemon wouldn't be killed by
SIGTERM. I guess I'd feel more confident about using SIGKILL, if I knew that
the uninterruptible hanged thread was creating a Document, which I could
interrupt without corrupting the index, rather than adding the document to
the index, which is liable to result in orphaned files and/or a corrupted
index, if killed.

-----Original Message-----
From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED]Sent: 06 June 2006 10:54
To: [email protected]
Subject: Re: Compound / non-compound index files and SIGKILL

If your content handlers should respond quickly then you should move
indexing process to separate thread and maintain items in queue.

Rob Staveley (Tom) wrote:
This is a real eye-opener, Volodymyr. Many thanks. I guess that meansthat my orphan-producing hangs must be addDocument() calls, and not inthe content handlers, as I'd previously assumed. I'll put some debugbefore and after my addDocument() calls to confirm (and point mywriter's infoStream to System.out).
-----Original Message-----
From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED]
Sent: 05 June 2006 18:33
To: [email protected]
Subject: Re: Compound / non-compound index files and SIGKILL

Hi.
My five cents :)
It might be helpful to know how lucene is working with compound files.When segment is flushed to disk it is written uncompound and afterthat is merged into single .cfs file. If you don't change defaultsetting for using compound files (which is on) this is only place (Iguess) for these files to appear.
If you're working with large indexes, than merging segments can take awhile (Maybe here is your problem? :) ) (merging happens onaddDocument() call). If you will kill indexing process during suchmerge you'll get many orphaned files...
You can just run optimize on this index. You'll get three files:segments, deletable, <segment>.cfs; you can look name of segment in'segments' file. Everything else is 'garbage' - you can delete it.
Rob Staveley (Tom) wrote:
I've been indexing live data into a compound index from an MTA. I'mresolving a bunch of problems unrelated to Lucene (disparate hangs inmy content handlers). When I get a hang, I typically need to kill mydaemon, alas more often than not using kill -9 (SIGKILL).
However, these SIGKILLs are leaving large temporary(?) files, which I
guess
are non-compound index files transiently extracted from the working.cfs
files:

-rw-r--r--    1  373138432 Jun  2 13:42 _18hup.fdt
-rw-r--r--    1      5054464 Jun  2 13:42 _18hup.fdx
-rw-r--r--    1              426 Jun  2 13:42 _18hup.fnm

-rw-r--r--    1  457253888 Jun  2 09:22 _15djq.fdt
-rw-r--r--    1      6205440 Jun  2 09:22 _15djq.fdx
-rw-r--r--    1              426 Jun  2 09:21 _15djq.fnm
They are left intact after restarting my daemon. Presumably they arenot treated as being part of the compound index. I see nocorresponding .cfs file for them.
As a consequence of these - I suspect - I am getting a very largeoverall disk requirement for my index, presumably because ofreplicated field
data.
My guess is that the field data in the orphaned .fdt files needs tobe regenerated.
In another index directory from a previous test run (again withSIGKILLs),
I
have 98 GB of index files, with only 12 BG devoted to compound filesfor
the
field index (.cfs). The rest of the disk space is used by orphaneduncompounded index files; I see 51 GB devoted to uncompounded fielddata (.fdt), 13 BG devoted to term positions (.prx) and 13 BG devotedto term frequencies (.frq).
Here's my question:
How can I attempt to merge these orphaned into the compound index,using IndexWriter.addIndexes(), or would I be foolish attempting this?
--
regards,
Volodymyr Bychkoviak


--
regards,
Volodymyr Bychkoviak

Re: Compound / non-compound index files and SIGKILL

Reply via email to