Hi,

here is a concrete example. 

I am starting with an index that has 19017236 docs, which takes 58989 Mb 
on disk:

21.07.2011  15:21                20 segments.gen
21.07.2011  15:21             2'974 segments_2acy4
21.07.2011  13:58                 0 write.lock
16.07.2011  02:21    33'445'798'886 _52aho.fdt
16.07.2011  02:21       178'723'932 _52aho.fdx
16.07.2011  01:58             5'002 _52aho.fnm
16.07.2011  03:10     9'857'410'889 _52aho.frq
16.07.2011  03:10     4'538'234'846 _52aho.prx
16.07.2011  03:10        61'581'767 _52aho.tii
16.07.2011  03:10     5'505'039'790 _52aho.tis
21.07.2011  01:01         1'899'536 _52aho_5.del
21.07.2011  01:05     4'222'206'034 _6t61z.fdt
21.07.2011  01:05        21'424'556 _6t61z.fdx
21.07.2011  01:01             5'002 _6t61z.fnm
21.07.2011  01:12     1'170'370'187 _6t61z.frq
21.07.2011  01:12       598'373'388 _6t61z.prx
21.07.2011  01:12         7'574'912 _6t61z.tii
21.07.2011  01:12       678'766'206 _6t61z.tis
21.07.2011  13:46     1'458'592'058 _7d6me.cfs
21.07.2011  13:48        15'702'654 _7dhgz.cfs
21.07.2011  13:52        16'800'942 _7dphm.cfs
21.07.2011  13:55        16'714'431 _7dxht.cfs
21.07.2011  14:24        17'505'435 _7e0wz.cfs
21.07.2011  14:24         5'875'852 _7e0xu.cfs
21.07.2011  14:48        18'340'470 _7e1x5.cfs
21.07.2011  15:19        16'978'564 _7e3ck.cfs
21.07.2011  15:21         1'208'656 _7e3hv.cfs
21.07.2011  15:21            19'361 _7e3hw.cfs
              28 File(s) 61'855'156'350 bytes

I am doing a delete of some of the older documents. after the delete, I 
commit then I optimize down to 2 segments. at the end of the optimize the 
index contains 18702510 docs (314727 were deleted) and it takes now 58975 
Mb on disk:

21.07.2011  15:37                20 segments.gen
21.07.2011  15:37               524 segments_2acy6
21.07.2011  13:58                 0 write.lock
16.07.2011  02:21    33'445'798'886 _52aho.fdt
16.07.2011  02:21       178'723'932 _52aho.fdx
16.07.2011  01:58             5'002 _52aho.fnm
16.07.2011  03:10     9'857'410'889 _52aho.frq
16.07.2011  03:10     4'538'234'846 _52aho.prx
16.07.2011  03:10        61'581'767 _52aho.tii
16.07.2011  03:10     5'505'039'790 _52aho.tis
21.07.2011  15:23         1'999'945 _52aho_6.del
21.07.2011  15:31     5'194'848'138 _7e3hy.fdt
21.07.2011  15:31        28'613'668 _7e3hy.fdx
21.07.2011  15:25             5'002 _7e3hy.fnm
21.07.2011  15:37     1'529'771'296 _7e3hy.frq
21.07.2011  15:37       726'582'244 _7e3hy.prx
21.07.2011  15:37         8'518'198 _7e3hy.tii
21.07.2011  15:37       763'213'144 _7e3hy.tis
              18 File(s) 61'840'347'291 bytes

as you can see, size on disk did not really change. at this point I 
optimize down to 1 segment and at the end the index takes 48273 Mb on 
disk:

21.07.2011  16:46                20 segments.gen
21.07.2011  16:46               278 segments_2acy8
21.07.2011  13:58                 0 write.lock
21.07.2011  16:06    32'901'423'750 _7e3hz.fdt
21.07.2011  16:06       149'582'052 _7e3hz.fdx
21.07.2011  15:42             5'002 _7e3hz.fnm
21.07.2011  16:46     8'608'541'177 _7e3hz.frq
21.07.2011  16:46     4'392'616'115 _7e3hz.prx
21.07.2011  16:46        50'571'856 _7e3hz.tii
21.07.2011  16:46     4'515'914'658 _7e3hz.tis
              10 File(s) 50'618'654'908 bytes


this means that with the 1 segment optimize I was able to reclaim 10 Gb on 
disk that the 2 segments optimize could not achieve.

how can this be explained? is that a normal behavior?

thanks,

vince












Simon Willnauer <simon.willna...@googlemail.com> 
 
 
20.07.2011 23:11
Please respond to
java-user@lucene.apache.org



To
java-user@lucene.apache.org
cc

Subject
Re: optimize with num segments > 1 index keeps growing






On Wed, Jul 20, 2011 at 2:00 PM,  <v.se...@lombardodier.com> wrote:
> Hi,
>
> I index several millions small documents per day. each day, I remove 
some
> of the older documents to keep the index at a stable number of 
documents.
> after each purge, I commit then I optimize the index. what I found is 
that
> if I keep optimizing with max num segments = 2, then the index keeps
> growing on the disk. but as soon as I optimize with just 1 segment, the
> space gets reclaimed on the disk. so, I have currently adopted the
> following strategy : every night I optimize with 2 segments, except once
> per week where I optimize with just 1 segment.

what do you mean by keeps growing. you have n segments and you
optimize down to 2 and the index is bigger than the one with n
segments?

simon
>
> is that an expected behavior?
> I guess I am doing something special because I was not able to reproduce
> this behavior in a unit test. what could it be?
>
> it would be nice to get some explanatory services within the product to
> help get some understanding on its behavior. something that tells you 
some
> information about your index for instance (number of docs in the 
different
> states, how the space is being used, ...). lucene is a wonderful 
product,
> but to me this is almost like black magic, and when there is a specific
> behavior, I have got little clues to figure out something by myself. 
some
> user oriented logging would be nice as well (the index writer info 
stream
> is really verbose and very low level).
>
> thanks for your help,
>
>
> Vince
>
> ************************ DISCLAIMER ************************
> This message is intended only for use by the person to
> whom it is addressed. It may contain information that is
> privileged and confidential. Its content does not
> constitute a formal commitment by Lombard Odier
> Darier Hentsch & Cie or any of its branches or affiliates.
> If you are not the intended recipient of this message,
> kindly notify the sender immediately and destroy this
> message. Thank You.
> *****************************************************************
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************

Reply via email to