While the testing effort on dumb segments is ongoing, I opened OAK-4659 and attached a patch to it. This change is based on the dumb segments, and improves the format by implementing logic record IDs. This way, records can by addressed by a record number instead of using their offsets inside the segment.
2016-07-27 17:06 GMT+02:00 Michael Dürig <[email protected]>: > > Looks good! I think we should give this one a spin. Some minor points we > should keep an eye on before we commit this though: > > - does tooling still work with the changes in the segment format? Some of > them access the segments directly such that expanding the segment header by > 2 bytes might break them. > > - have a look at the micro benchmarks and compare to before. > > - remind us to remember ;-) updating the documentation of the segment format > at some point > > - I would like to have something along the lines of the segment size test > back. Probably not as a unit test but more as a benchmark for record sizes. > So instead of it failing the build, it would output some numbers which we > could then graph very much the same way like for performance benchmarks. > > Michael > > > > On 26.7.16 11:47 , Francesco Mari wrote: >> >> With my latest commits on this branch [1] I enabled every previously >> ignored test, fixing them when needed., The only two exceptions are >> RecordUsageAnalyserTest and SegmentSizeTest, that were simply deleted. >> I also added a couple of tests to cover the cases that work slightly >> differently than before. >> >> [1]: https://github.com/francescomari/jackrabbit-oak/tree/dumb >> >> 2016-07-25 17:48 GMT+02:00 Francesco Mari <[email protected]>: >>> >>> It might be a variation in the process I tried. This shouldn't affect >>> much the statistics anyway, given that the population sample is big >>> enough in both cases. >>> >>> 2016-07-25 17:46 GMT+02:00 Michael Dürig <[email protected]>: >>>> >>>> >>>> Interesting numbers. Most of them look as I would have expected. I.e. >>>> the >>>> distributions in the dumb case are more regular (smaller std. dev, mean >>>> and >>>> median closer to each other), bigger segment sizes, etc. >>>> >>>> What I don't understand is the total number of records. These numbers >>>> differ >>>> greatly between current and dumb. Is this a test artefact (i.e. test not >>>> reproducible) or are we missing out on something. >>>> >>>> Michael >>>> >>>> >>>> On 25.7.16 4:01 , Francesco Mari wrote: >>>>> >>>>> >>>>> I put together some statistics [1] for the process I described above. >>>>> The "dumb" variant requires more segments to store the same amount of >>>>> data, because of the increased size of serialised record IDs. As you >>>>> can see the amount of records per segment is definitely lower in the >>>>> dumb variant. >>>>> >>>>> On the other hand, ignoring the growth of segment ID reference table >>>>> seems to be a good choice. As shown from the segment size average, >>>>> dumb segments are usually fuller that their counterpart. Moreover, a >>>>> lower standard deviation shows that it's more common to have full dumb >>>>> segments. >>>>> >>>>> In addition, my analysis seems to have found a bug too. There are a >>>>> lot of segments with no segment ID references and only one record, >>>>> which is very likely to be the segment info. The flush thread writes >>>>> every 5 seconds the current segment buffer, provided that the buffer >>>>> is not empty. It turns out that a segment buffer is never empty, since >>>>> it always contains at least one record. As such, we are currently >>>>> leaking almost empty segments every 5 seconds, that waste additional >>>>> space on disk because of the padding required by the TAR format. >>>>> >>>>> [1]: >>>>> >>>>> https://docs.google.com/spreadsheets/d/1gXhmPsm4rDyHnle4TUh-mtB2HRtRyADXALARRFDh7z4/edit?usp=sharing >>>>> >>>>> 2016-07-25 10:05 GMT+02:00 Michael Dürig <[email protected]>: >>>>>> >>>>>> >>>>>> >>>>>> Hi Jukka, >>>>>> >>>>>> Thanks for sharing your perspective and the historical background. >>>>>> >>>>>> I agree that repository size shouldn't be a primary concern. However, >>>>>> we >>>>>> have seen many repositories (especially with an external data store) >>>>>> where >>>>>> the content is extremely fine granular. Much more than in an initial >>>>>> content >>>>>> installation of CQ (which I believe was one of the initial setup for >>>>>> collecting statistics). So we should at least understand the impact of >>>>>> the >>>>>> patch in various scenarios. >>>>>> >>>>>> My main concern is the cache footprint of node records. Those are made >>>>>> up >>>>>> of >>>>>> a list of record ids and would thus grow by a factor of 6 with the >>>>>> current >>>>>> patch. >>>>>> >>>>>> Locality is not so much of concern here. I would expect it to actually >>>>>> improve as the patch gets rid of the 255 references limit of segments. >>>>>> A >>>>>> limit which in practical deployments leads to degeneration of segment >>>>>> sizes >>>>>> (I regularly see median sizes below 5k). See OAK-2896 for some >>>>>> background >>>>>> on >>>>>> this. >>>>>> Furthermore we already did a big step forward in improving locality in >>>>>> concurrent write scenarios when we introduced the >>>>>> SegmentBufferWriterPool. >>>>>> In essence: thread affinity for segments. >>>>>> >>>>>> We should probably be more carefully looking at the micro benchmarks. >>>>>> I >>>>>> guess we neglected this part a bit in the past. Unfortunately CI >>>>>> infrastructure isn't making this easy for us... OTOH those benchmarks >>>>>> only >>>>>> tell you so much. Many of the problems we recently faced only surfaced >>>>>> in >>>>>> the large: huge repos, high concurrent load, many days of traffic. >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 23.7.16 12:34 , Jukka Zitting wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Cool! I'm pretty sure there are various ways in which the format >>>>>>> could >>>>>>> be >>>>>>> improved, as the original design was based mostly on intuition, >>>>>>> guided >>>>>>> somewhat by collected stats >>>>>>> <http://markmail.org/message/kxe3iy2hnodxsghe> >>>>>>> and >>>>>>> the micro-benchmarks <https://issues.apache.org/jira/browse/OAK-119> >>>>>>> used >>>>>>> to optimize common operations. >>>>>>> >>>>>>> Note though that the total size of the repository was not and >>>>>>> probably >>>>>>> shouldn't be a primary metric, since the size of a typical repository >>>>>>> is >>>>>>> governed mostly by binaries and string properties (though it's a good >>>>>>> idea >>>>>>> to make sure you avoid things like duplicates of large binaries). >>>>>>> Instead >>>>>>> the rationale for squeezing things like record ids to as few bytes as >>>>>>> possible is captured in the principles listed in the original design >>>>>>> doc >>>>>>> <http://jackrabbit.apache.org/oak/docs/nodestore/segmentmk.html>: >>>>>>> >>>>>>> - Compactness. The formatting of records is optimized for size to >>>>>>> reduce >>>>>>> IO costs and to fit as much content in caches as possible. A node >>>>>>> stored in >>>>>>> SegmentNodeStore typically consumes only a fraction of the size it >>>>>>> would as >>>>>>> a bundle in Jackrabbit Classic. >>>>>>> - Locality. Segments are written so that related records, like a >>>>>>> node >>>>>>> and its immediate children, usually end up stored in the same >>>>>>> segment. >>>>>>> This >>>>>>> makes tree traversals very fast and avoids most cache misses for >>>>>>> typical >>>>>>> clients that access more than one related node per session. >>>>>>> >>>>>>> Thus I would recommend keeping an eye also on benchmark results in >>>>>>> addition >>>>>>> to raw repository size when evaluating possible improvements. Also, >>>>>>> the >>>>>>> number and size of data segments are good size metrics to look at in >>>>>>> addition to total disk usage. >>>>>>> >>>>>>> BR, >>>>>>> >>>>>>> Jukka Zitting >>>>>>> >>>>>>> On Fri, Jul 22, 2016 at 5:55 AM Francesco Mari >>>>>>> <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> The impact on repository size needs to be assessed with more >>>>>>>> specific >>>>>>>> tests. In particular, I found RecordUsageAnalyserTest and >>>>>>>> SegmentSizeTest unsuitable to this task. It's not a coincidence that >>>>>>>> these tests are usually the first to be disabled or blindly updated >>>>>>>> every time a small fix changes the size of the records. >>>>>>>> >>>>>>>> Regarding GC, the segment graph could be computed during the mark >>>>>>>> phase. Of course, it's handy to have this information pre-computed >>>>>>>> for >>>>>>>> you, but since the record graph is traversed anyway we could think >>>>>>>> about dynamically reconstructing the segment graph when needed. >>>>>>>> >>>>>>>> There are still so many questions to answer, but I think that this >>>>>>>> simplification exercise can be worth the effort. >>>>>>>> >>>>>>>> 2016-07-22 11:34 GMT+02:00 Michael Dürig <[email protected]>: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Neat! I would have expected a greater impact on the size of the >>>>>>>>> segment >>>>>>>>> store. But as you say it probably all depends on the binary/content >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ratio. I >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> think we should look at the #references / repository size ratio for >>>>>>>>> repositories of different structures and see how such a number >>>>>>>>> differs >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> with >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> and without the patch. >>>>>>>>> >>>>>>>>> I like the patch as it fixes OAK-2896 while at the same time >>>>>>>>> reducing >>>>>>>>> complexity a lot. >>>>>>>>> >>>>>>>>> OTOH we need to figure out how to regain the lost functionality >>>>>>>>> (e.g. >>>>>>>>> gc) >>>>>>>>> and asses its impact on repository size. >>>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 22.7.16 11:32 , Francesco Mari wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Yesterday I took some time for a little experiment: how many >>>>>>>>>> optimisations can be removed from the current segment format while >>>>>>>>>> maintaining the same functionality? >>>>>>>>>> >>>>>>>>>> I made some work in a branch on GitHub [1]. The code on that >>>>>>>>>> branch >>>>>>>>>> is >>>>>>>>>> similar to the current trunk except for the following changes: >>>>>>>>>> >>>>>>>>>> 1. Record IDs are always serialised in their entirety. As such, a >>>>>>>>>> serialised record ID occupies 18 bytes instead of 3. >>>>>>>>>> >>>>>>>>>> 2. Because of the previous change, the table of referenced segment >>>>>>>>>> IDs >>>>>>>>>> is not needed anymore, so I removed it from the segment header. It >>>>>>>>>> turns out that this table is indeed needed for the mark phase of >>>>>>>>>> compaction, so this feature is broken in that branch. >>>>>>>>>> >>>>>>>>>> Anyway, since the code is in a runnable state, I generated some >>>>>>>>>> content using the current trunk and the dumber version of >>>>>>>>>> oak-segment-tar. This is the repository created by the dumb >>>>>>>>>> oak-segment-tar: >>>>>>>>>> >>>>>>>>>> 524744 data00000a.tar >>>>>>>>>> 524584 data00001a.tar >>>>>>>>>> 524688 data00002a.tar >>>>>>>>>> 460896 data00003a.tar >>>>>>>>>> 8 journal.log >>>>>>>>>> 0 repo.lock >>>>>>>>>> >>>>>>>>>> This is the one created by the current trunk: >>>>>>>>>> >>>>>>>>>> 524864 data00000a.tar >>>>>>>>>> 524656 data00001a.tar >>>>>>>>>> 524792 data00002a.tar >>>>>>>>>> 297288 data00003a.tar >>>>>>>>>> 8 journal.log >>>>>>>>>> 0 repo.lock >>>>>>>>>> >>>>>>>>>> The process that generates the content doesn't change between the >>>>>>>>>> two >>>>>>>>>> executions, and the generated content is coming from a real world >>>>>>>>>> scenario. For those familiar with it, the content is generated by >>>>>>>>>> an >>>>>>>>>> installation of Adobe Experience Manager. >>>>>>>>>> >>>>>>>>>> It looks like that the size of the repository is not changing so >>>>>>>>>> much. >>>>>>>>>> Probably the de-optimisation in the small is dwarfed by the binary >>>>>>>>>> content in the large. Another effect of my change is that there is >>>>>>>>>> no >>>>>>>>>> limit on the number of referenced segment IDs per segment, and >>>>>>>>>> this >>>>>>>>>> might allow segments to pack more records than before. >>>>>>>>>> >>>>>>>>>> Questions apart, the clear advantage of this change is a great >>>>>>>>>> simplification of the code. I guess I can remove some lines more, >>>>>>>>>> but >>>>>>>>>> what I peeled off is already a considerable amount. Look at the >>>>>>>>>> code! >>>>>>>>>> >>>>>>>>>> Francesco >>>>>>>>>> >>>>>>>>>> [1]: https://github.com/francescomari/jackrabbit-oak/tree/dumb >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >
