Hi, On Tue, Jan 29, 2013 at 12:01 PM, Marcel Reutegger <[email protected]> wrote: >> both internal storage (as seen below) and garbage collection. Segments >> that are no longer referenced can be efficiently identified by >> traversing the graph of segment-level references without having to >> parse or even fetch the contents of each segment. > > But it still requires to traverse the complete graph, right?
Right. A 10GB repository would likely contain some 1k-100k segments, which should still be traversable in just minutes. If that isn't good enough in practice, we could organize the segments in generations to reduce the number of segments that need to be traversed in one go. >> Journals are special, atomically updated documents that record the >> state of the repository as a sequence of references to successive >> root node records. > > when you write 'root node record', do you mean the segment, which > contains the root node? The journal would contain a reference to the root node record, i.e. a combination of the segment UUID and the offset within that segment. In some cases it might be possible for a single segment to contain multiple root nodes. > how do you decide whether it's OK to write to a lower level journal and > when to write to the root journal? That can be decided either at deployment time (this client will always write to this journal) or at runtime (this commit should go to this journal) if the client application wants more fine-grained control over commit behavior (cf. write concerns in MongoDB) > Who merges the journal and when? A lower level journal could contain metadata to indicate when and how it should be merged. A client that commits to or reads from such a journal should interpret that metadata and take care of merging when needed. > How do you make sure a client sticks to a given lower level journal? Switching > journals means you might not see your own writes. As described above, in a typical scenario the journal would be selected at deployment time, so the client would always use the same journal. If the application wants more fine-grained control, it needs to also deal with such issues as partial visibility of changes. >> Each record is uniquely addressable by its location within the segment >> and the UUID of that segment. Assuming that the size of a segment is >> limited to 16MB (maximum size of a MongoDB document) and that a single >> segment can contain references to up to 255 other segments, then a >> reference to any record in any segment can be stored in just 4 bytes >> (1 byte to identify the segment, 3 bytes for the record offset). > > I'm wondering if 255 references are sufficient. You need those in > various places. e.g. when building higher level structure like lists, > maps etc. but also to reference templates. I did some benchmarking with typical content (CQ5) and in practice the majority of references would point to the same segment or a small set (dozens) of other segments. Once the limit of 255 segment references is reached, you can always start a new segment and thus reset the counter. If this turns out to be a practical limit, we could either limit the maximum size of segments to say 4MB or pad all records to four-byte boundaries to increase the number of possible external segment references to 1023 while still needing just 4 bytes per record reference. > It seems the requirements for the underlying storage system are minimal, > which is kind of nice, but makes me wonder why we need MongoDB then? > Most of the data in MongoDB would consist of segment documents that > contain binaries, right? Right. The main benefit of using an underlying storage engine like MongoDB is that we get distribution, sharding, and atomic updates essentially for free. As noted, any other database that provides similar functionality should also work with the proposed design. >> Template records > > I'm a bit skeptical if the added complexity is worth the disk space saving. I thought so too, but simple benchmarks show that the size savings are significant (20-50%) especially for small, frequently accessed nodes like ACL entries or CQ paragraphs that would ideally be kept as close as possible to the containing parent node. However, template records are by no means necessary for the design to work, so we could well try it without the extra complexity. >> A node that contains more than N properties or M child nodes (exact size >> TBD, M ~ 1k) is stored differently, using map records for the properties >> and child nodes. This way a node can become arbitrarily large and still >> remain reasonably efficient to access and modify. The main downside of >> this alternative storage layout is that the ordering of child nodes is >> lost. > > There's no requirement to keep the child nodes in insertion order, right? That's one of the trade-offs we discussed earlier. It seems safe to drop the requirement of orderability (or even of insertion-order) for large nodes. This shouldn't be much of a backwards compatibility issue given the previous lack of performance with large nodes, and for safety we could also add a commit validator that prevents an orderable node from growing beyond that threshold. > What is not yet clear to me is, how do you decide what the size of a segment > should be? This can be quite delicate. On the one hand you want to avoid > small segments because they are inefficient with all the UUID lookups. But > larger segments are also not ideal, because they increase the cost of a write. > I understand the complete segments needs to be re-written in this case, > or do you have a design in mind where a segment can overlay another one? > however, this leads to fragmentation. I was thinking of each commit (or merge) resulting in one or more (in case of a large commit) segments. To prevent segments from remaining too small, a process like Lucene's index merge could be used in combination with garbage collection to combine them. For example, if the garbage collector detects a sequence of more than a few small segments (say < 256kB) referenced by only one journal, then it could replace them all with one or more larger segments containing copies of all the records accessible from those smaller segments. Then it can update the journal to point to the combined segments and let the next garbage collection round (once clients no longer refer to the smaller segments) take care of disposing the small segments that are no longer needed. BR, Jukka Zitting
