Here’s a neat visualization: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

The short form is this: 

- A “segment” is all the files with a particular prefix in your index 
directory, e.g.  _12ey1* is one segment
- Segments are created as documents are indexed and commits occur.
- Periodically, segments are “merged”, that is some number of segments are 
combined into a single new segment and then the old segments are deleted.
- During the merge, both the old and new segments occupy index space.
- Deleted documents continue to occupy disk space until the segment containing 
them are merged. NOTE: updating the same document deletes the old version and 
adds a new one, so that is a “deleted” document for this discussion.

So it’s quite common for deletes to accumulate until they are merged away. You 
have two sources of fluctuation:
1> deleted docs
2> the merging process.

And in your case, I see one segment around 25G. That indicates your index has 
been optimized at some point, and also I’d guess you’re on Lucene prior to 
release 7.5, so whenever you optimized again, _all_ segments will be merged 
into a single new segment, meaning your index will _at least- double in size 
temporarily.

Now, how this happens, you’d have to ask the jackrabbit folks since I don’t 
know that app either. 

For the gory details on optimize, see: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/.
 Even though that’s labeled Solr, it’s really about Lucene and the doc applies 
to anything that uses Lucene with the Tiered Merge Policy (which has been the 
default for some time). Although whether jackrabbit does anything with this I 
don’t have a clue.

Best,
Erick


> On Nov 4, 2019, at 11:19 AM, Raffaele Gambelli <r.gambe...@westpole.it> wrote:
> 
> For what you know, is this behaviour which you defined "typical" described 
> deeply somewhere?
> 
> It is foundamental for me to better understand it even to know how big an 
> index can grow, in a way that I can allocate the right disk space.
> 
> Thank you very much
> 
> -----Messaggio originale-----
> Da: Raffaele Gambelli <r.gambe...@westpole.it> 
> Inviato: lunedì 4 novembre 2019 15:16
> A: java-user@lucene.apache.org
> Oggetto: R: Lucene index directory grows and shrinks
> 
> Thanks for your quick reply, I'm quite a beginner in Lucene concepts, 
> Jackrabbit hides almost all about the way it uses Lucene internally.
> 
> Anyway here it is the size of each sub-directory in my index, please note the 
> bigger one, 25G,  is it normal?
> 
> ...repository/workspaces/default/index$ du -h .
> 2.5G    ./_12ey1
> 14M     ./_1dr9s
> 20M     ./_1dr8d
> 2.8G    ./_1b9pj
> 5.8M    ./_1drqc
> 19M     ./_1dr4q
> 2.5G    ./_17lmu
> 4.0M    ./_1drmx
> 11M     ./_1drbf
> 4.3M    ./_1drok
> 13M     ./_1drq1
> 40K     ./_1drqe
> 11M     ./_1drhc
> 260M    ./_1dr3g
> 664M    ./_1by44
> 2.5G    ./_14tet
> 281M    ./_1c4wj
> 25G     ./_zzgq
> 274M    ./_1d2nc
> 638M    ./_1ctf0
> 580K    ./_1drqf
> 304K    ./_1drqd
> 6.5M    ./_1dr6m
> 325M    ./_1djfp
> 37G
> 
> I tried also to download index directory to my local machine, to inspect them 
> with Luke which I know a bit, but for network problem the download always 
> interrupts.
> 
>> What is your segment size limit?
> 
> I don't know, where could I see that limit?
> 
>> Have you changed the default merge frequency or max segments configuration?
> 
> Merge frequency is the mergeFactor ? If yes I'm using the default that is 10, 
> read here https://jackrabbit.apache.org/archive/wiki/JCR/Search_115513504.html
> 
> Max segment I don't know, where could I see it?
> 
> Bye
> 
> -----Messaggio originale-----
> Da: Sharma <a...@apache.org>
> Inviato: lunedì 4 novembre 2019 14:46
> A: java-user@lucene.apache.org
> Oggetto: Re: Lucene index directory grows and shrinks
> 
> This are typical symptoms of an index merge.
> 
> However, it is hard to predict more without knowing more data. What is your 
> segment size limit? Have you changed the default merge frequency or max 
> segments configuration? Would you have an estimate of ratio of number of 
> segments reaching max limit / total segments?
> 
> Atri
> 
> On Mon, Nov 4, 2019 at 7:12 PM Raffaele Gambelli <r.gambe...@westpole.it> 
> wrote:
>> 
>> Hi all,
>> 
>> I'm using Jackrabbit 2.18.0 which uses lucene-core 3.6.0.
>> 
>> I'm working on an application that has reached 37 G of directory index, a 
>> few days ago, disk occupancy has quickly reached 100% and then returned to 
>> pre-growth employment.
>> 
>> I believe that was caused by a rapid growth of Lucene index directory, 
>> looking for such an event I've found only this article describing 
>> something really similar 
>> https://helpx.adobe.com/uk/experience-manager/kb/lucene-index-director
>> y-growth.html
>> 
>> I would like to know more info about this behaviour, first of all can you 
>> confirm this growth and shrinkage?
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> [https://westpole.it/firma/logo.png]
> 
> Raffaele Gambelli
> WebRainbow® Software Developer
> 
> P +39 051 8550 576
> M #
> E r.gambe...@westpole.it
> W https://westpole.webex.com/meet/R.Gambelli
> A Via Ettore Cristoni, 84 - 40030 Casalecchio di Reno
> 
> [https://vitamined.it/westpole/website.png]<https://westpole.it>  
> [https://westpole.it/firma/twitter.png] <https://twitter.com/WESTPOLE_SPA>   
> [https://westpole.it/firma/linkedin.png] 
> <https://www.linkedin.com/company/westpole/>
> 
> This email for the D.lgs.196/2003 (Privacy Code) and European Regulation 
> 679/2016/UE (GDPR) may contain confidential and/or privileged information for 
> the exclusive use of the intended recipient. Any review or distribution by 
> others is strictly prohibited. If you are not the intended recipient, you 
> must not use, copy, disclose or take any action based on this message or any 
> information here. If you have received this email in error, please contact us 
> (email:priv...@westpole.it) by reply email and delete all copies. Legal 
> privilege is not waived because you have read this email. Thank you for your 
> cooperation.
> 
> [https://westpole.it/firma/ambiente.png] Please consider the environment 
> before printing this email
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to