so the problem here is that you have one really big segment _52aho.* and several smaller ones _7e0wz.*, _7e0xu.*, _7e1x5.* .... if you optimize to 2 segmetns all the smaller segments are merged into one but all the large segment remains untouched. This means that all deleted documents in the large segment are not removed / freed while if you optimized to one segment they are removed. In the single seg. index there is no *.del file present meaning no deletes. Unless you merge the large segment all you deleted documents are only marked as delete but not yet removed.
simon On Thu, Jul 21, 2011 at 5:50 PM, <v.se...@lombardodier.com> wrote: > hi, > closing after the 2 segments optimize does not change it. > also I am running with lucene 3.1.0. > cheers, > vince > > > > > > > > > > Ian Lea <ian....@gmail.com> > > > 21.07.2011 17:30 > Please respond to > java-user@lucene.apache.org > > > > To > java-user@lucene.apache.org > cc > > Subject > Re: optimize with num segments > 1 index keeps growing > > > > > > > A write.lock file with timestamp of 13:58 is in all the listings. The > first thing I'd try is to add some IndexWriter.close() calls. > > > -- > Ian. > > > > On Thu, Jul 21, 2011 at 4:05 PM, <v.se...@lombardodier.com> wrote: >> Hi, >> >> here is a concrete example. >> >> I am starting with an index that has 19017236 docs, which takes 58989 Mb >> on disk: >> >> 21.07.2011 15:21 20 segments.gen >> 21.07.2011 15:21 2'974 segments_2acy4 >> 21.07.2011 13:58 0 write.lock >> 16.07.2011 02:21 33'445'798'886 _52aho.fdt >> 16.07.2011 02:21 178'723'932 _52aho.fdx >> 16.07.2011 01:58 5'002 _52aho.fnm >> 16.07.2011 03:10 9'857'410'889 _52aho.frq >> 16.07.2011 03:10 4'538'234'846 _52aho.prx >> 16.07.2011 03:10 61'581'767 _52aho.tii >> 16.07.2011 03:10 5'505'039'790 _52aho.tis >> 21.07.2011 01:01 1'899'536 _52aho_5.del >> 21.07.2011 01:05 4'222'206'034 _6t61z.fdt >> 21.07.2011 01:05 21'424'556 _6t61z.fdx >> 21.07.2011 01:01 5'002 _6t61z.fnm >> 21.07.2011 01:12 1'170'370'187 _6t61z.frq >> 21.07.2011 01:12 598'373'388 _6t61z.prx >> 21.07.2011 01:12 7'574'912 _6t61z.tii >> 21.07.2011 01:12 678'766'206 _6t61z.tis >> 21.07.2011 13:46 1'458'592'058 _7d6me.cfs >> 21.07.2011 13:48 15'702'654 _7dhgz.cfs >> 21.07.2011 13:52 16'800'942 _7dphm.cfs >> 21.07.2011 13:55 16'714'431 _7dxht.cfs >> 21.07.2011 14:24 17'505'435 _7e0wz.cfs >> 21.07.2011 14:24 5'875'852 _7e0xu.cfs >> 21.07.2011 14:48 18'340'470 _7e1x5.cfs >> 21.07.2011 15:19 16'978'564 _7e3ck.cfs >> 21.07.2011 15:21 1'208'656 _7e3hv.cfs >> 21.07.2011 15:21 19'361 _7e3hw.cfs >> 28 File(s) 61'855'156'350 bytes >> >> I am doing a delete of some of the older documents. after the delete, I >> commit then I optimize down to 2 segments. at the end of the optimize > the >> index contains 18702510 docs (314727 were deleted) and it takes now > 58975 >> Mb on disk: >> >> 21.07.2011 15:37 20 segments.gen >> 21.07.2011 15:37 524 segments_2acy6 >> 21.07.2011 13:58 0 write.lock >> 16.07.2011 02:21 33'445'798'886 _52aho.fdt >> 16.07.2011 02:21 178'723'932 _52aho.fdx >> 16.07.2011 01:58 5'002 _52aho.fnm >> 16.07.2011 03:10 9'857'410'889 _52aho.frq >> 16.07.2011 03:10 4'538'234'846 _52aho.prx >> 16.07.2011 03:10 61'581'767 _52aho.tii >> 16.07.2011 03:10 5'505'039'790 _52aho.tis >> 21.07.2011 15:23 1'999'945 _52aho_6.del >> 21.07.2011 15:31 5'194'848'138 _7e3hy.fdt >> 21.07.2011 15:31 28'613'668 _7e3hy.fdx >> 21.07.2011 15:25 5'002 _7e3hy.fnm >> 21.07.2011 15:37 1'529'771'296 _7e3hy.frq >> 21.07.2011 15:37 726'582'244 _7e3hy.prx >> 21.07.2011 15:37 8'518'198 _7e3hy.tii >> 21.07.2011 15:37 763'213'144 _7e3hy.tis >> 18 File(s) 61'840'347'291 bytes >> >> as you can see, size on disk did not really change. at this point I >> optimize down to 1 segment and at the end the index takes 48273 Mb on >> disk: >> >> 21.07.2011 16:46 20 segments.gen >> 21.07.2011 16:46 278 segments_2acy8 >> 21.07.2011 13:58 0 write.lock >> 21.07.2011 16:06 32'901'423'750 _7e3hz.fdt >> 21.07.2011 16:06 149'582'052 _7e3hz.fdx >> 21.07.2011 15:42 5'002 _7e3hz.fnm >> 21.07.2011 16:46 8'608'541'177 _7e3hz.frq >> 21.07.2011 16:46 4'392'616'115 _7e3hz.prx >> 21.07.2011 16:46 50'571'856 _7e3hz.tii >> 21.07.2011 16:46 4'515'914'658 _7e3hz.tis >> 10 File(s) 50'618'654'908 bytes >> >> >> this means that with the 1 segment optimize I was able to reclaim 10 Gb > on >> disk that the 2 segments optimize could not achieve. >> >> how can this be explained? is that a normal behavior? >> >> thanks, >> >> vince >> >> >> >> >> >> >> >> >> >> >> >> >> Simon Willnauer <simon.willna...@googlemail.com> >> >> >> 20.07.2011 23:11 >> Please respond to >> java-user@lucene.apache.org >> >> >> >> To >> java-user@lucene.apache.org >> cc >> >> Subject >> Re: optimize with num segments > 1 index keeps growing >> >> >> >> >> >> >> On Wed, Jul 20, 2011 at 2:00 PM, <v.se...@lombardodier.com> wrote: >>> Hi, >>> >>> I index several millions small documents per day. each day, I remove >> some >>> of the older documents to keep the index at a stable number of >> documents. >>> after each purge, I commit then I optimize the index. what I found is >> that >>> if I keep optimizing with max num segments = 2, then the index keeps >>> growing on the disk. but as soon as I optimize with just 1 segment, the >>> space gets reclaimed on the disk. so, I have currently adopted the >>> following strategy : every night I optimize with 2 segments, except > once >>> per week where I optimize with just 1 segment. >> >> what do you mean by keeps growing. you have n segments and you >> optimize down to 2 and the index is bigger than the one with n >> segments? >> >> simon >>> >>> is that an expected behavior? >>> I guess I am doing something special because I was not able to > reproduce >>> this behavior in a unit test. what could it be? >>> >>> it would be nice to get some explanatory services within the product to >>> help get some understanding on its behavior. something that tells you >> some >>> information about your index for instance (number of docs in the >> different >>> states, how the space is being used, ...). lucene is a wonderful >> product, >>> but to me this is almost like black magic, and when there is a specific >>> behavior, I have got little clues to figure out something by myself. >> some >>> user oriented logging would be nice as well (the index writer info >> stream >>> is really verbose and very low level). >>> >>> thanks for your help, >>> >>> >>> Vince > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > ************************ DISCLAIMER ************************ > This message is intended only for use by the person to > whom it is addressed. It may contain information that is > privileged and confidential. Its content does not > constitute a formal commitment by Lombard Odier > Darier Hentsch & Cie or any of its branches or affiliates. > If you are not the intended recipient of this message, > kindly notify the sender immediately and destroy this > message. Thank You. > ***************************************************************** > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org