I've recently been working on a redesign of the vldb on-disk format (version 5, "vldb5"), together with some others at SNA. This is still in the early stages, but I wanted to provide a rough description of what I'm working with so far, to solicit feedback and give others a chance to raise objections.
In this email, I'm just trying to stick to describing the more interesting aspects of the new format; followup email will explain a little more about the relevant motivations, and possible concerns I have. But I'm not trying to provide a full spec for the format here; this is just informally describing the design and various features. This is also not intended to cover other practical matters, like how the vlserver will deal with db format upgrades/downgrades. This is just about the new db format itself. Feedback is appreciated. Motivation -- The immediate motivation for this work is a cell that will probably exceed the 32-bit file offset limit in ubik in the not-too-distant future. There are a few other things that also need to be changed to fix that (ubik protocols, APIs), but the vldb4 disk format itself also uses 32-bit offsets to refer to database entries inside hash chains, etc. The naive way to fix _that_ means changing all of the relevant fields to 64-bits and doing a full database conversion. So, if we're doing that anyway, we might as well try to fix some of the other limitations in vldb4 at the same time by defining a completely new format. The main other benefits of this are to lift the various hard-coded limits on various structures (e.g. replication sites), and to make it easier to introduce new types of data into the database (e.g. ipv6 addresses, encryption keys). Code -- To follow along with concrete code examples, I have a rough Python script here <https://github.com/adeason/afs-tools/blob/adeason/vldb5/debug/vldbconv> for converting between vldb4, vldb5, and yaml. No actual vlserver implementation or any C yet. I also have a format description for vldb5 in the Kaitai Struct language here <https://github.com/adeason/afs-tools/blob/adeason/vldb5/debug/ubik_vldb5.ksy>. This can be used with <https://ide.kaitai.io> to get a kind of interactive explorer for the binary format. (It's pretty low-level, just to decode the structs, no knowledge of what the data actually represents.) Basic structure -- Data types are encoded in XDR (so, big-endian). The database consists of a sequence of fixed-size blocks that I call "record"s (VL5_Record in the above script). Each record starts with a 4-byte tag, indicating what the record is for (a volume, a fileserver, etc). The tag is then followed by the record's payload which consists of a sequence of "value" objects. Each "value" consists of a 4-byte tag, a 4-byte length, and 'length' bytes of payload. Because every value has an explicit length, most values with an unknown tag can be ignored by the vlserver, while still allowing for the remaining values to be parsed. Records and values share the same tag namespace, and there are no record-specific tags or anything like that. Overloading the constants for tags is not allowed; we have 2^32 of them after all, which is more than enough. The size of each record is specified by a value in the root/header record; the "recsize". The recsize must be a power of two, as it's currently specified as log2 of the actual recsize. I'm currently guessing the typical value to be somewhere around between 256 bytes and 4k. Anything directly pointing to another item in the db does so by referencing that item's record number/recno as 64-bit int (not the file offset); recno N is simply at offset N*recsize. So theoretically, increasing the recsize can be done just by moving some data around, without needing to actually interpret any of the actual data inside various structures beyond the header. A small partial example: Say we have a record for a volume. The record's tag is VL5_TAG_VOLUME (0x1D), and contained in that record is value VL5_TAG_VOL_NAME (0x1E) for the volume's name. So we have the data (in hex): 00 00 00 1D # VL5_TAG_VOLUME 00 00 00 1E # VL5_TAG_VOL_NAME 00 00 00 0C # value payload length (12 bytes) 00 00 00 07 # xdr-encoded string "vol.foo" 76 6F 6C 2E 66 6F 6F 00 And then other values for the volume would immediately follow, until the end of the record is reached, or we encounter a value tag of 0x00 (VL5_TAG_NULL). Fundamental info -- The first record in the db (recno 0) is the root record, VL5_TAG_ROOT_BE (0x05) in big-endian, so the first 4 bytes in the stream are 0x00 0x00 0x00 0x05. For big-endian encoding, the first value in that record must be VL5_TAG_FUNINFO_BE (0x55555555), which contains information that's fundamental to interpreting the rest of the db: the recsize, and the encoding style / endianness. To indicate little-endian encoding, a different tag is used: VL5_TAG_FUNINFO_LE (0x66666666), and the root record tag is instead defined as VL5_TAG_ROOT_LE (0x05000000). (The code examples provided don't deal with little-endian, though.) If a vlserver fails to understand the fundamental info inside one of those values, or if the first value isn't one of those _FUNINFO_ values, the vlserver bails out with an error. Unlike most other values, we can't continue to parse the database if we fail to understand this first one, since it's needed to understand anything else. Continuation records -- If a record needs more data than fits in a single database record, we can link together multiple records, using a VL5_TAG_CONTINUE_PTR (0x07) value. The idea is that when something is parsing a record, if it comes across such a value, we read the indicated continuation record(s), and just concatenate the payload to the payload of the current record (ignoring VL5_TAG_NULL values), and parsing continues. The continuation record that gets pointed to still has a record header (tagged with VL5_TAG_CONTINUE_REC, 0x08). There are no "extent"-style records that e.g. claim the next 3 records; all records still have a tag as the first 4 bytes. Example: recno 0: 00 00 00 05 # VL5_TAG_ROOT_BE 55 55 55 55 # VL5_TAG_FUNINFO_BE 00 00 00 04 # length 4 08 01 00 00 # recsize=2^8=256, big-endian 00 00 00 07 # VL5_TAG_CONTINUE_PTR 00 00 00 0C # length 12 00 00 00 02 # xdr-encoded array [1,2] 00 00 00 01 00 00 00 02 00 00 00 00 # VL5_TAG_NULL ... recno 1: 00 00 00 08 # VL5_TAG_CONTINUE_REC 00 00 00 09 # VL5_TAG_CONTINUE_HDR 00 00 00 04 # length 4 00 00 00 00 # previous recno 0 XX XX XX XX # continuation data ... 00 00 00 00 # VL5_TAG_NULL recno 2: 00 00 00 08 # VL5_TAG_CONTINUE_REC 00 00 00 09 # VL5_TAG_CONTINUE_HDR 00 00 00 04 # length 4 00 00 00 01 # previous recno 1 YY YY YY YY # continuation data ... 00 00 00 00 # VL5_TAG_NULL When parsing recno 0, when the VL5_TAG_CONTINUE_PTR value is encountered, the specified continuation records are read, and appended to the payload of recno 0. So then the logic view of recno 0 looks like this: 00 00 00 05 # VL5_TAG_ROOT_BE 55 55 55 55 # VL5_TAG_FUNINFO_BE 00 00 00 04 # length 4 08 01 00 00 # recsize=2^8=256, big-endian XX XX XX XX # payload from recno 1 ... YY YY YY YY # payload from recno 2 ... 00 00 00 00 # VL5_TAG_NULL Volume entry lookups -- To look up volume entries, the proposed vldb5 uses b+ trees to find entries. There is one tree to lookup by 64-bit volume id, and one tree to lookup by volume name. For the id tree, the volume ids themselves are the keys in the trees, and the leaves point directly to the volume's record. For the name tree, the keys are the crc32c "hash" for the name. If there is no collision for the name, the leaf points directly to the volume's record. If there are collisions, the leaf points to a "collision record" (VL5_TAG_VOL_NAMEHTREE_COLL_REC, 0x2C), which contains a simple array containing (volume name, recno) pairs, sorted by name. Each node in the tree is a db record (e.g. VL5_TAG_VOL_IDTREE_REC, 0x24), and we just store as many keys/children per node as we can fit in a record. The root node in the name tree also contains a small value (VL5_TAG_VOL_NAMEHTREE_ROOTINFO, 0x2B) to indicate some metadata (currently just which algorithm is being used for the tree key hash values). Volume entries don't point/link to other volume entries in any way. To list all volumes in the db, we just look through every record, looking for volume records. Fileserver/partition indirection -- Volume records contain, among other things, the list of sites where that volume resides. We indicate a volume site's location by "partition id", an internal id number that gets mapped to vicepX, fileserver Y in a global table. And of course "fileserver Y" refers to a fileserver id that maps to a fileserver entry that contains the fileserver's address list, uuid, and other information. The mappings of partition/fileserver IDs to the relevant data are stored in simple arrays referenced by the root record: VL5_TAG_FILESERVER_LIST_PTR (0x14), VL5_TAG_PARTITION_LIST_PTR (0x17). This data is intended to be cached by the application in memory (since it rarely changes), so the layout of these structures is not so important; no trees or hashes, etc, are needed. General-purpose notes -- One feature that is completely new to the db is the ability to add basic arbitrary user-defined "note"s to objects in the database, such as volumes or fileservers. A volume record, for instance, can contain a VL5_TAG_NOTE_PTR (0x1A) value, which points to a VL5_TAG_NOTE_REC (0x1B) record that just contains a string set by a user. This is intended to either be an ad-hoc human readable string set by administrators (e.g. "let u...@example.com know before moving this volume --adeason"), or could contain yaml/json/etc-encoded data for use by automation tools on top of OpenAFS. -- Andrew Deason adea...@sinenomine.net _______________________________________________ OpenAFS-devel mailing list OpenAFS-devel@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-devel