This is a good idea. I had been worried about the additional heap requirement maintaining a queue, without being able to serialize/deserialize Documents (i.e. a build up of Lucene Documents in RAM). I have been marshalling addDocument() calls using a synchronized object; the same threads have been taking responsibility for creating Documents (unsynchronized) and adding them to the index writer (synchronized). I guess I could have a one Document queue feeding a single addDocument thread, which would effectively be the same approach, but which would make it easier to ensure that only the create Document thread is killed when I get a SIGTERM and the addDocument thread is left to run its course (assuming it hasn't hanged!).
Having said that, I'm not sure what I could do in a shutdown hook, which wouldn't already have been done by a SIGTERM to get the hanged thread to terminate. The reason for SIGKILL was that the daemon wouldn't be killed by SIGTERM. I guess I'd feel more confident about using SIGKILL, if I knew that the uninterruptible hanged thread was creating a Document, which I could interrupt without corrupting the index, rather than adding the document to the index, which is liable to result in orphaned files and/or a corrupted index, if killed. -----Original Message----- From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] Sent: 06 June 2006 10:54 To: java-user@lucene.apache.org Subject: Re: Compound / non-compound index files and SIGKILL If your content handlers should respond quickly then you should move indexing process to separate thread and maintain items in queue. Rob Staveley (Tom) wrote: > This is a real eye-opener, Volodymyr. Many thanks. I guess that means > that my orphan-producing hangs must be addDocument() calls, and not in > the content handlers, as I'd previously assumed. I'll put some debug > before and after my addDocument() calls to confirm (and point my > writer's infoStream to System.out). > > -----Original Message----- > From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] > Sent: 05 June 2006 18:33 > To: java-user@lucene.apache.org > Subject: Re: Compound / non-compound index files and SIGKILL > > Hi. > My five cents :) > > It might be helpful to know how lucene is working with compound files. > When segment is flushed to disk it is written uncompound and after > that is merged into single .cfs file. If you don't change default > setting for using compound files (which is on) this is only place (I > guess) for these files to appear. > > If you're working with large indexes, than merging segments can take a > while (Maybe here is your problem? :) ) (merging happens on > addDocument() call). If you will kill indexing process during such > merge you'll get many orphaned files... > > You can just run optimize on this index. You'll get three files: > segments, deletable, <segment>.cfs; you can look name of segment in > 'segments' file. Everything else is 'garbage' - you can delete it. > > > Rob Staveley (Tom) wrote: > >> I've been indexing live data into a compound index from an MTA. I'm >> resolving a bunch of problems unrelated to Lucene (disparate hangs in >> my content handlers). When I get a hang, I typically need to kill my >> daemon, alas more often than not using kill -9 (SIGKILL). >> >> However, these SIGKILLs are leaving large temporary(?) files, which I >> > guess > >> are non-compound index files transiently extracted from the working >> .cfs >> files: >> >> -rw-r--r-- 1 373138432 Jun 2 13:42 _18hup.fdt >> -rw-r--r-- 1 5054464 Jun 2 13:42 _18hup.fdx >> -rw-r--r-- 1 426 Jun 2 13:42 _18hup.fnm >> >> -rw-r--r-- 1 457253888 Jun 2 09:22 _15djq.fdt >> -rw-r--r-- 1 6205440 Jun 2 09:22 _15djq.fdx >> -rw-r--r-- 1 426 Jun 2 09:21 _15djq.fnm >> >> They are left intact after restarting my daemon. Presumably they are >> not treated as being part of the compound index. I see no >> corresponding .cfs file for them. >> >> As a consequence of these - I suspect - I am getting a very large >> overall disk requirement for my index, presumably because of >> replicated field >> > data. > >> My guess is that the field data in the orphaned .fdt files needs to >> be regenerated. >> >> In another index directory from a previous test run (again with >> SIGKILLs), >> > I > >> have 98 GB of index files, with only 12 BG devoted to compound files >> for >> > the > >> field index (.cfs). The rest of the disk space is used by orphaned >> uncompounded index files; I see 51 GB devoted to uncompounded field >> data (.fdt), 13 BG devoted to term positions (.prx) and 13 BG devoted >> to term frequencies (.frq). >> >> Here's my question: >> >> How can I attempt to merge these orphaned into the compound index, >> using IndexWriter.addIndexes(), or would I be foolish attempting this? >> >> > > -- regards, Volodymyr Bychkoviak
smime.p7s
Description: S/MIME cryptographic signature