This is a real eye-opener, Volodymyr. Many thanks. I guess that means that
my orphan-producing hangs must be addDocument() calls, and not in the
content handlers, as I'd previously assumed. I'll put some debug before and
after my addDocument() calls to confirm (and point my writer's infoStream to
System.out).

-----Original Message-----
From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] 
Sent: 05 June 2006 18:33
To: java-user@lucene.apache.org
Subject: Re: Compound / non-compound index files and SIGKILL

Hi.
My five cents :)

It might be helpful to know how lucene is working with compound files. 
When segment is flushed to disk it is written uncompound and after that is
merged into single .cfs file. If you don't change default setting for using
compound files (which is on) this is only place (I guess) for these files to
appear.

If you're working with large indexes, than merging segments can take a while
(Maybe here is your problem? :) ) (merging happens on
addDocument() call).  If you will kill indexing process during such merge
you'll get many orphaned files...

You can just run optimize on this index. You'll get three files: 
segments, deletable, <segment>.cfs; you can look name of segment in
'segments' file. Everything else is 'garbage' - you can delete it.


Rob Staveley (Tom) wrote:
> I've been indexing live data into a compound index from an MTA. I'm
> resolving a bunch of problems unrelated to Lucene (disparate hangs in my
> content handlers). When I get a hang, I typically need to kill my daemon,
> alas more often than not using kill -9 (SIGKILL).
>
> However, these SIGKILLs are leaving large temporary(?) files, which I
guess
> are non-compound index files transiently extracted from the working .cfs
> files:
>
> -rw-r--r--    1  373138432 Jun  2 13:42 _18hup.fdt
> -rw-r--r--    1      5054464 Jun  2 13:42 _18hup.fdx
> -rw-r--r--    1              426 Jun  2 13:42 _18hup.fnm
>
> -rw-r--r--    1  457253888 Jun  2 09:22 _15djq.fdt
> -rw-r--r--    1      6205440 Jun  2 09:22 _15djq.fdx
> -rw-r--r--    1              426 Jun  2 09:21 _15djq.fnm
>
> They are left intact after restarting my daemon. Presumably they are not
> treated as being part of the compound index. I see no corresponding .cfs
> file for them. 
>
> As a consequence of these - I suspect - I am getting a very large overall
> disk requirement for my index, presumably because of replicated field
data.
> My guess is that the field data in the orphaned .fdt files needs to be
> regenerated.
>
> In another index directory from a previous test run (again with SIGKILLs),
I
> have 98 GB of index files, with only 12 BG devoted to compound files for
the
> field index (.cfs). The rest of the disk space is used by orphaned
> uncompounded index files; I see 51 GB devoted to uncompounded field data
> (.fdt), 13 BG devoted to term positions (.prx) and 13 BG devoted to term
> frequencies (.frq).
>
> Here's my question:
>
> How can I attempt to merge these orphaned into the compound index, using
> IndexWriter.addIndexes(), or would I be foolish attempting this?
>   

-- 
regards,
Volodymyr Bychkoviak

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to