[ 
https://issues.apache.org/jira/browse/OAK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135247#comment-16135247
 ] 

Tommaso Teofili commented on OAK-5192:
--------------------------------------

I've taken some of the feedback received on the Lucene dev list and tried to 
adapt also other IndexWriterConfiguration parameters like _ramBufferSize_ which 
controls the max size of a segment before it gets flushed.
By increasing the maxRamBufferSize to 100MB I reduced the no. of merges to 0 in 
the current tests, while this is not 100% perfect because merging is also 
useful to avoid too many segments to read and therefore possibly slow queries 
(each segment needs to be queried), this means that we could also look into 
other options to have a less aggressive merging.
On the other hand I've also implemented an easy algorithm for mitigating 
merging, which only merges if no changes on the index are being made (not 
taking deletions into account), which somewhat reduced the no. of merges and 
the related IO, especially for lucene46 codec, as per table below.

||codec||merge policy||ramBufferSize||segment size||FDS size||no. of merges||
|oakCodec mrl=4000|default|default|285.0 MB|2 GB|7|
|oakCodec mrl=4000|default|100 MB|283.0 MB|1 GB|0|
|oakCodec mrl=4000|mitigated|default|284.9 MB|2 GB|5|
|lucene46 mrl=4000|default|default|284.7 MB|2 GB|8|
|lucene46 mrl=4000|default|100 MB|282.9 MB|1 GB|0|
|lucene46 mrl=4000|mitigated|default|284.7 MB|1 GB|4|



> Reduce Lucene related growth of repository size
> -----------------------------------------------
>
>                 Key: OAK-5192
>                 URL: https://issues.apache.org/jira/browse/OAK-5192
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, segment-tar
>            Reporter: Michael Dürig
>            Assignee: Tommaso Teofili
>              Labels: perfomance, scalability
>             Fix For: 1.8, 1.7.8
>
>         Attachments: added-bytes-zoom.png, binSize100.txt, binSize16384.txt, 
> binSizeTotal.txt, diff.txt.zip, nonBinSizeTotal.txt, OAK-5192.0.patch, Screen 
> Shot 2017-07-03 at 16.50.00.png
>
>
> I observed Lucene indexing contributing to up to 99% of repository growth. 
> While the size of the index itself is well inside reasonable bounds, the 
> overall turnover of data being written and removed again can be as much as 
> 99%. 
> In the case of the TarMK this negatively impacts overall system performance 
> due to fast growing number of tar files / segments, bad locality of 
> reference, cache misses/thrashing when looking up segments and vastly 
> prolonged garbage collection cycles.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to