I opened OAK-4596 to track the segment leak.
2016-07-25 16:01 GMT+02:00 Francesco Mari <[email protected]>: > I put together some statistics [1] for the process I described above. > The "dumb" variant requires more segments to store the same amount of > data, because of the increased size of serialised record IDs. As you > can see the amount of records per segment is definitely lower in the > dumb variant. > > On the other hand, ignoring the growth of segment ID reference table > seems to be a good choice. As shown from the segment size average, > dumb segments are usually fuller that their counterpart. Moreover, a > lower standard deviation shows that it's more common to have full dumb > segments. > > In addition, my analysis seems to have found a bug too. There are a > lot of segments with no segment ID references and only one record, > which is very likely to be the segment info. The flush thread writes > every 5 seconds the current segment buffer, provided that the buffer > is not empty. It turns out that a segment buffer is never empty, since > it always contains at least one record. As such, we are currently > leaking almost empty segments every 5 seconds, that waste additional > space on disk because of the padding required by the TAR format. > > [1]: > https://docs.google.com/spreadsheets/d/1gXhmPsm4rDyHnle4TUh-mtB2HRtRyADXALARRFDh7z4/edit?usp=sharing > > 2016-07-25 10:05 GMT+02:00 Michael Dürig <[email protected]>: >> >> Hi Jukka, >> >> Thanks for sharing your perspective and the historical background. >> >> I agree that repository size shouldn't be a primary concern. However, we >> have seen many repositories (especially with an external data store) where >> the content is extremely fine granular. Much more than in an initial content >> installation of CQ (which I believe was one of the initial setup for >> collecting statistics). So we should at least understand the impact of the >> patch in various scenarios. >> >> My main concern is the cache footprint of node records. Those are made up of >> a list of record ids and would thus grow by a factor of 6 with the current >> patch. >> >> Locality is not so much of concern here. I would expect it to actually >> improve as the patch gets rid of the 255 references limit of segments. A >> limit which in practical deployments leads to degeneration of segment sizes >> (I regularly see median sizes below 5k). See OAK-2896 for some background on >> this. >> Furthermore we already did a big step forward in improving locality in >> concurrent write scenarios when we introduced the SegmentBufferWriterPool. >> In essence: thread affinity for segments. >> >> We should probably be more carefully looking at the micro benchmarks. I >> guess we neglected this part a bit in the past. Unfortunately CI >> infrastructure isn't making this easy for us... OTOH those benchmarks only >> tell you so much. Many of the problems we recently faced only surfaced in >> the large: huge repos, high concurrent load, many days of traffic. >> >> Michael >> >> >> >> >> >> On 23.7.16 12:34 , Jukka Zitting wrote: >>> >>> Hi, >>> >>> Cool! I'm pretty sure there are various ways in which the format could be >>> improved, as the original design was based mostly on intuition, guided >>> somewhat by collected stats <http://markmail.org/message/kxe3iy2hnodxsghe> >>> and >>> the micro-benchmarks <https://issues.apache.org/jira/browse/OAK-119> used >>> to optimize common operations. >>> >>> Note though that the total size of the repository was not and probably >>> shouldn't be a primary metric, since the size of a typical repository is >>> governed mostly by binaries and string properties (though it's a good idea >>> to make sure you avoid things like duplicates of large binaries). Instead >>> the rationale for squeezing things like record ids to as few bytes as >>> possible is captured in the principles listed in the original design doc >>> <http://jackrabbit.apache.org/oak/docs/nodestore/segmentmk.html>: >>> >>> - Compactness. The formatting of records is optimized for size to >>> reduce >>> IO costs and to fit as much content in caches as possible. A node >>> stored in >>> SegmentNodeStore typically consumes only a fraction of the size it >>> would as >>> a bundle in Jackrabbit Classic. >>> - Locality. Segments are written so that related records, like a node >>> and its immediate children, usually end up stored in the same segment. >>> This >>> makes tree traversals very fast and avoids most cache misses for >>> typical >>> clients that access more than one related node per session. >>> >>> Thus I would recommend keeping an eye also on benchmark results in >>> addition >>> to raw repository size when evaluating possible improvements. Also, the >>> number and size of data segments are good size metrics to look at in >>> addition to total disk usage. >>> >>> BR, >>> >>> Jukka Zitting >>> >>> On Fri, Jul 22, 2016 at 5:55 AM Francesco Mari <[email protected]> >>> wrote: >>> >>>> The impact on repository size needs to be assessed with more specific >>>> tests. In particular, I found RecordUsageAnalyserTest and >>>> SegmentSizeTest unsuitable to this task. It's not a coincidence that >>>> these tests are usually the first to be disabled or blindly updated >>>> every time a small fix changes the size of the records. >>>> >>>> Regarding GC, the segment graph could be computed during the mark >>>> phase. Of course, it's handy to have this information pre-computed for >>>> you, but since the record graph is traversed anyway we could think >>>> about dynamically reconstructing the segment graph when needed. >>>> >>>> There are still so many questions to answer, but I think that this >>>> simplification exercise can be worth the effort. >>>> >>>> 2016-07-22 11:34 GMT+02:00 Michael Dürig <[email protected]>: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> Neat! I would have expected a greater impact on the size of the segment >>>>> store. But as you say it probably all depends on the binary/content >>>> >>>> ratio. I >>>>> >>>>> think we should look at the #references / repository size ratio for >>>>> repositories of different structures and see how such a number differs >>>> >>>> with >>>>> >>>>> and without the patch. >>>>> >>>>> I like the patch as it fixes OAK-2896 while at the same time reducing >>>>> complexity a lot. >>>>> >>>>> OTOH we need to figure out how to regain the lost functionality (e.g. >>>>> gc) >>>>> and asses its impact on repository size. >>>>> >>>>> Michael >>>>> >>>>> >>>>> >>>>> On 22.7.16 11:32 , Francesco Mari wrote: >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> Yesterday I took some time for a little experiment: how many >>>>>> optimisations can be removed from the current segment format while >>>>>> maintaining the same functionality? >>>>>> >>>>>> I made some work in a branch on GitHub [1]. The code on that branch is >>>>>> similar to the current trunk except for the following changes: >>>>>> >>>>>> 1. Record IDs are always serialised in their entirety. As such, a >>>>>> serialised record ID occupies 18 bytes instead of 3. >>>>>> >>>>>> 2. Because of the previous change, the table of referenced segment IDs >>>>>> is not needed anymore, so I removed it from the segment header. It >>>>>> turns out that this table is indeed needed for the mark phase of >>>>>> compaction, so this feature is broken in that branch. >>>>>> >>>>>> Anyway, since the code is in a runnable state, I generated some >>>>>> content using the current trunk and the dumber version of >>>>>> oak-segment-tar. This is the repository created by the dumb >>>>>> oak-segment-tar: >>>>>> >>>>>> 524744 data00000a.tar >>>>>> 524584 data00001a.tar >>>>>> 524688 data00002a.tar >>>>>> 460896 data00003a.tar >>>>>> 8 journal.log >>>>>> 0 repo.lock >>>>>> >>>>>> This is the one created by the current trunk: >>>>>> >>>>>> 524864 data00000a.tar >>>>>> 524656 data00001a.tar >>>>>> 524792 data00002a.tar >>>>>> 297288 data00003a.tar >>>>>> 8 journal.log >>>>>> 0 repo.lock >>>>>> >>>>>> The process that generates the content doesn't change between the two >>>>>> executions, and the generated content is coming from a real world >>>>>> scenario. For those familiar with it, the content is generated by an >>>>>> installation of Adobe Experience Manager. >>>>>> >>>>>> It looks like that the size of the repository is not changing so much. >>>>>> Probably the de-optimisation in the small is dwarfed by the binary >>>>>> content in the large. Another effect of my change is that there is no >>>>>> limit on the number of referenced segment IDs per segment, and this >>>>>> might allow segments to pack more records than before. >>>>>> >>>>>> Questions apart, the clear advantage of this change is a great >>>>>> simplification of the code. I guess I can remove some lines more, but >>>>>> what I peeled off is already a considerable amount. Look at the code! >>>>>> >>>>>> Francesco >>>>>> >>>>>> [1]: https://github.com/francescomari/jackrabbit-oak/tree/dumb >>>>>> >>>>> >>>> >>> >>
