Re: reftable [v3]: new ref storage format
On Fri, Jul 28, 2017 at 7:18 PM, Michael Haggertywrote: > On Fri, Jul 28, 2017 at 3:12 PM, Shawn Pearce wrote: >> I'm with you this far, and like the {min,max}_update_index in the >> header. I'm concerned about update_index in 32 bits. At some point you >> need to reset the counter, or the repository is broken. 4b updates is >> enough for anyone? I'd feel better about this being a 64 bit field. > > Yes, I was a little bit nervous about 32 bits, too. But that's a *lot* > of updates: one per second for 136 years. If that limit were ever > reached, there could be a compaction step, where any update indices > that don't have associated reflog entries are "compacted out" of the > numerical sequence and the remaining indices are renumbered > contiguously. I considered that, but its a bit of a pain for the writer to renumber the remaining records. > But it's ok with me to make it 64 bits. Usually those extra bytes > would be appear as and so should prefix- and zlib-compress > well. That was my thought. Within a single reference these will prefix compress right out, and zlib will fix any sins within the log block at restart points.
Re: reftable [v3]: new ref storage format
On Fri, Jul 28, 2017 at 3:12 PM, Shawn Pearcewrote: > I'm with you this far, and like the {min,max}_update_index in the > header. I'm concerned about update_index in 32 bits. At some point you > need to reset the counter, or the repository is broken. 4b updates is > enough for anyone? I'd feel better about this being a 64 bit field. Yes, I was a little bit nervous about 32 bits, too. But that's a *lot* of updates: one per second for 136 years. If that limit were ever reached, there could be a compaction step, where any update indices that don't have associated reflog entries are "compacted out" of the numerical sequence and the remaining indices are renumbered contiguously. But it's ok with me to make it 64 bits. Usually those extra bytes would be appear as and so should prefix- and zlib-compress well. Michael
Re: reftable [v3]: new ref storage format
On Thu, Jul 27, 2017 at 7:28 AM, Michael Haggertywrote: > On Sat, Jul 22, 2017 at 11:29 AM, Shawn Pearce wrote: >> 3rd iteration of the reftable storage format. >> >> [...] >> ### Objectives >> >> - Near constant time lookup for any single reference, even when the >> repository is cold and not in process or kernel cache. >> - Near constant time verification a SHA-1 is referred to by at least >> one reference (for allow-tip-sha1-in-want). >> - Efficient lookup of an entire namespace, such as `refs/tags/`. >> - Support atomic push `O(size_of_update)` operations. >> - Combine reflog storage with ref storage. > > I think that the optimization enabled by obj_records is only > interesting for server-side repositories that will frequently be > fetched from, and which additionally have allow-tip-sha1-in-want > turned on, right? So could we make it explicitly optional? Correct. Its already optional. Maybe I'm just not clear enough about that in the document. I agree its not required in the file, and writers can choose to omit obj_record. >> obj record >> >> An `obj_record` describes a single object abbreviation, and the blocks >> containing references using that unique abbreviation: >> >> varint( prefix_length ) >> varint( (suffix_length << 3) | cnt_3 ) >> suffix >> varint( cnt_rest )? >> varint( block_delta )+ >> > [...] >> Each record contains `block_count` number of block identifiers for ref >> blocks. The `block_count` is determined by: >> >> block_count = cnt_3 >> if (cnt_3 == 0x7) { >> block_count += cnt_rest >> } >> >> The `cnt_rest` field is only present when `block_count >= 0x7` and >> could overflow the `cnt_3` field available in the record start. This >> encoding scheme is used as the vast majority of abbreviations are >> only one reference (or at most a few), and unlikely to exceed 6 blocks. > > I agree that this `cnt_3` handling is overly cute. There will never be > records with `block_count == 0`, will there? Then I propose that > `cnt_3` be set to zero if `block_count` is greater than 7, in which > case the full block count is stored as a varint. It's simpler, and I > think the only `block_count`s for which this scheme costs a byte more > than your scheme are implausible: 128-134, 16383-16390, etc. That is a great idea, thanks. I've updated the draft to reflect your approach of cnt_3 = 0 indicates the actual count follows in a varint. >> [...] >> ### Log block format > > I'm still not happy about how you store reflogs. Here are my objections: > > * You store reflogs by appending a high-resolution time to the > refname. This is awkward: > * There is always some uncertainty about uniqueness, even with a > high-resolution clock. If you want to eliminate that uncertainty, you > have to read the last entry in the reflog for the reference *for every > update* to be sure that your timestamp is not already "in use". > * There is a need to invent microsecond values on systems with > low-resolution clocks, again to ensure uniqueness. > * If there is clock skew, then either you lose information about the > *order* of reference updates (which I would consider unacceptable), or > you might have to invent fictional times for new updates to retain the > order relative to previous updates. But inventing fictional times is > very problematic: suppose somebody's clock is incorrectly set to the > year 2020 when they make a commit. As far as Git is concerned, that is > plausible, so the reflog entry gets that timestamp. Then the user > corrects the clock and makes another commit. We would either be forced > to rewrite the old reflog entry(ies) or to give a timestamp in 2020 to > the new update. > * With reftable storage, multiple-reference updates can be done > atomically. But the reflogs don't retain the information about which > references were updated together. > * Even if all reflogs are retained, it is not possible to deduce the > global state of all references at a moment in the past. > > I propose to solve all of the above problems, plus some others, as follows: > > Let's keep an integer `update_index`, which starts at zero and > increases by one each atomic reference update (i.e., it increases by > one even if the update touches multiple references). > > Let's store reflog entries under keys `(refname, update_index)`, and > store the timestamp of the entry *within* the record rather than as > part of the key. The tuple `(refname, update_index)` has to be encoded > somehow that ensures the desired order; see below. This encoding > ensures that the order of reference updates is well-defined regardless > of clock skew, and that one can use the reflog to determine which > references were updated together and what the complete state of > references was after any update (assuming that the user retains all > reflogs). > > Let's store in the header of each reftable file the `min_update_index` > and
Re: reftable [v3]: new ref storage format
Thanks for your continued work on this. I have some comments about v3 (inline below). On Sat, Jul 22, 2017 at 11:29 AM, Shawn Pearcewrote: > 3rd iteration of the reftable storage format. > > [...] > ### Objectives > > - Near constant time lookup for any single reference, even when the > repository is cold and not in process or kernel cache. > - Near constant time verification a SHA-1 is referred to by at least > one reference (for allow-tip-sha1-in-want). > - Efficient lookup of an entire namespace, such as `refs/tags/`. > - Support atomic push `O(size_of_update)` operations. > - Combine reflog storage with ref storage. I think that the optimization enabled by obj_records is only interesting for server-side repositories that will frequently be fetched from, and which additionally have allow-tip-sha1-in-want turned on, right? So could we make it explicitly optional? > [...] > obj record > > An `obj_record` describes a single object abbreviation, and the blocks > containing references using that unique abbreviation: > > varint( prefix_length ) > varint( (suffix_length << 3) | cnt_3 ) > suffix > varint( cnt_rest )? > varint( block_delta )+ > > Like in reference blocks, abbreviations are prefix compressed within > an obj block. On large reftable files with many unique objects, > higher block sizes (64k), and higher restart interval (128), a > `prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in > obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits). > > Each record contains `block_count` number of block identifiers for ref > blocks. The `block_count` is determined by: > > block_count = cnt_3 > if (cnt_3 == 0x7) { > block_count += cnt_rest > } > > The `cnt_rest` field is only present when `block_count >= 0x7` and > could overflow the `cnt_3` field available in the record start. This > encoding scheme is used as the vast majority of abbreviations are > only one reference (or at most a few), and unlikely to exceed 6 blocks. I agree that this `cnt_3` handling is overly cute. There will never be records with `block_count == 0`, will there? Then I propose that `cnt_3` be set to zero if `block_count` is greater than 7, in which case the full block count is stored as a varint. It's simpler, and I think the only `block_count`s for which this scheme costs a byte more than your scheme are implausible: 128-134, 16383-16390, etc. > [...] > ### Log block format I'm still not happy about how you store reflogs. Here are my objections: * You store reflogs by appending a high-resolution time to the refname. This is awkward: * There is always some uncertainty about uniqueness, even with a high-resolution clock. If you want to eliminate that uncertainty, you have to read the last entry in the reflog for the reference *for every update* to be sure that your timestamp is not already "in use". * There is a need to invent microsecond values on systems with low-resolution clocks, again to ensure uniqueness. * If there is clock skew, then either you lose information about the *order* of reference updates (which I would consider unacceptable), or you might have to invent fictional times for new updates to retain the order relative to previous updates. But inventing fictional times is very problematic: suppose somebody's clock is incorrectly set to the year 2020 when they make a commit. As far as Git is concerned, that is plausible, so the reflog entry gets that timestamp. Then the user corrects the clock and makes another commit. We would either be forced to rewrite the old reflog entry(ies) or to give a timestamp in 2020 to the new update. * With reftable storage, multiple-reference updates can be done atomically. But the reflogs don't retain the information about which references were updated together. * Even if all reflogs are retained, it is not possible to deduce the global state of all references at a moment in the past. I propose to solve all of the above problems, plus some others, as follows: Let's keep an integer `update_index`, which starts at zero and increases by one each atomic reference update (i.e., it increases by one even if the update touches multiple references). Let's store reflog entries under keys `(refname, update_index)`, and store the timestamp of the entry *within* the record rather than as part of the key. The tuple `(refname, update_index)` has to be encoded somehow that ensures the desired order; see below. This encoding ensures that the order of reference updates is well-defined regardless of clock skew, and that one can use the reflog to determine which references were updated together and what the complete state of references was after any update (assuming that the user retains all reflogs). Let's store in the header of each reftable file the `min_update_index` and `max_update_index` that are covered by the file. Do this such that the indexes chain together properly for the reftables
Re: reftable [v3]: new ref storage format
On Mon, Jul 24, 2017 at 4:00 PM, Shawn Pearcewrote: > On Mon, Jul 24, 2017 at 3:22 PM, Stefan Beller wrote: >>> index record >>> >>> An index record describes the last entry in another block. >>> Index records are written as: >>> >>> varint( prefix_length ) >>> varint( (suffix_length << 3) | 0 ) >>> suffix >>> varint( block_offset ) >>> >>> Index records use prefix compression exactly like `ref_record`. >>> >>> Index records store `block_offset` after the suffix, specifying the >>> offset in bytes (from the start of the file) of the block that ends >>> with this reference. >> >> Instead of hardcoding the "0" in the last 3 bits, maybe pick one >> of the reserved bit patterns to be there? I would imagine this >> makes debugging easier: >> >> 0x5? Hah that must be an index block I have been >> looking at the wrong block! > > This is an excellent suggestion. I'll include it in the next iteration. I was thinking about this a bit more and wondering if this would allow mixing all sorts of entries in the same block. Think about one of the most common cases client side: git commit -m "just a lonely ref update, no fancy stuff" For that we need (a) a single ref update and (b) now that the format evolved a single reflog entry, maybe (c) an obj entry, too. I just realize that in this case we'd crank down the block size such that there is very little padding required, but I still wonder if it would be possible to omit the second (and third) block, but rather go for a "m"ixed block containing just these 3 entries. Additionally by setting the block_size to a special value "0", we could indicate the omission of a footer, such that a single ref update reftable is super small.
Re: reftable [v3]: new ref storage format
On Mon, Jul 24, 2017 at 3:22 PM, Stefan Bellerwrote: > On Sat, Jul 22, 2017 at 11:29 AM, Shawn Pearce wrote: >> 3rd iteration of the reftable storage format. >> >> You can read a rendered version of this here: >> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md >> >> Significant changes from v2: >> - efficient lookup by SHA-1 for allow-tip-sha1-in-want. > > I'll focus on that in the review, it sounds exciting. >> ### Ref block format > ... > >> A variable number of 4-byte `restart_offset` values follow the >> records. Offsets are relative to the start of the block and refer to >> the first byte of any `ref_record` whose name has not been prefix >> compressed. Readers can start linear scans from any of these records. >> Offsets in the first block are relative to the start of the file >> (position 0), and include the file header. This requires the first >> restart in the first block to be at offset 8. >> >> The 2-byte `restart_count_m1` stores *one less* than the number of >> entries in the `restart_offset` list. There is always a restart >> corresponding to the first ref record. Readers are responsible for >> computing `restart_count = restart_count_m1 + 1`. > > I had to reread these two paragraphs a couple of times as it calls out > the uninteresting things[1] and the interesting part is the logical > conclusion that one has to make themselves, > here is how I would write it: > > Readers can start linear scans from any record whose name has > not been prefix compressed. The first record of a block must not > be prefix-compressed. > > To aid finding the entry points for linear scans, a variable number > of 4-byte `restart_offset` values follow the records. Offsets are > relative to the start of the block and refer to the first byte of any > such `ref_record` that is not prefix compressed. > The first record can be omitted in the `restart_offset` values as it > is implicit. No, the first record must be listed in the restart_offset table. Its not implicit. The count of it is implicit, to allow up to 65536 restart_offset entries using only a uint16 restart_count. Maybe that is being too cute, and the count should just be the entry count. > The `restart_offset' values must be sorted (ascending, > descending?). Sorted, ascending. >> The `value` follows. Its format is determined by `value_type`, one of >> the following: >> >> - `0x0`: deletion; no value data (see transactions, below) >> - `0x1`: one 20-byte object id; value of the ref >> - `0x2`: two 20-byte object ids; value of the ref, peeled target > > Up to here it is easy to spot a pattern: > The number indicates how many oids follow. > >> - `0x3`: symbolic reference: `varint( target_len ) target` >> - `0x4`: length delimited extension: `varint( data_len ) data` > > This breaks the pattern. > > Instead of hardcoding the numbers here, I wonder if we rather > want to make the bits more meaningful: > > bit 0, 1: number of oids iff bit 2 unset > > Iff bit 2 set, then we have a "varint (len) data" > that follows, bits 0,1 are used for a different purpose, > 00 indicates 'symlink' and data is the string > 01 indicates 'multihead' such as FETCH_HEAD > 1* is reserved for now. > > This *may* be neat micro optimization, but hardcoding all bits > to a lookup table is fine, too. The problem I have with this bit-based rule is it breaks down later as you use additional codes, or you find yourself limited by the bit scheme and don't have the full range of 8 values available. So I decided to be very literal about what the codes mean, and use a lookup table. > Note that symrefs and multiheads could share the same > type, iff we had dissallowed '\n' in refnames. (we do? > otherwise FETCH_HEAD would be broken) > The differentiator would be the '\n' or '\0' at the end of the > first target. True, \n is not allowed in a ref name, and neither is \0. Symrefs in loose ref format use "ref: \n" as their content. We could use that format in reftable, and then HEAD and FETCH_HEAD could use the same value code, 0x3. >> index record >> >> An index record describes the last entry in another block. >> Index records are written as: >> >> varint( prefix_length ) >> varint( (suffix_length << 3) | 0 ) >> suffix >> varint( block_offset ) >> >> Index records use prefix compression exactly like `ref_record`. >> >> Index records store `block_offset` after the suffix, specifying the >> offset in bytes (from the start of the file) of the block that ends >> with this reference. > > Instead of hardcoding the "0" in the last 3 bits, maybe pick one > of the reserved bit patterns to be there? I would imagine this > makes debugging easier: > > 0x5? Hah that must be an index block I have been > looking at the wrong block! This is an excellent suggestion. I'll include it in the next iteration. >> ### Obj
Re: reftable [v3]: new ref storage format
On Sat, Jul 22, 2017 at 11:29 AM, Shawn Pearcewrote: > 3rd iteration of the reftable storage format. > > You can read a rendered version of this here: > https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md > > Significant changes from v2: > - efficient lookup by SHA-1 for allow-tip-sha1-in-want. I'll focus on that in the review, it sounds exciting. > - type 0x4 for FETCH_HEAD, MERGE_HEAD. So we'd have varint( (suffix_length << 3) | value_type ) which leaves us with 5 bits for the varint. One bit is needed to indicate if ew have more bytes coming in the varint, so only 4 bits in the first byte to encode the suffix length. That means we can have suffixes up to 15 and we'd still be fine with just one byte IIUC. That should be enough, specifically for contiguous numbers as ref names. > ### Block size > > The `block_size` is arbitrarily determined by the writer, and does not > have to be a power of 2. The block size must be larger than the > longest reference name or deflated log entry used in the repository, > as references cannot span blocks. > > Powers of two that are friendly to the virtual memory system or > filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can > yield better compression, with a possible increased cost incurred by > readers during access. > > The largest block size is `16777215` bytes (15.99 MiB). Thanks for calling this out, 16MB ought to be enough for everyone. (Not a joke, as we can have multiple blocks) > ### Ref block format ... > A variable number of 4-byte `restart_offset` values follow the > records. Offsets are relative to the start of the block and refer to > the first byte of any `ref_record` whose name has not been prefix > compressed. Readers can start linear scans from any of these records. > Offsets in the first block are relative to the start of the file > (position 0), and include the file header. This requires the first > restart in the first block to be at offset 8. > > The 2-byte `restart_count_m1` stores *one less* than the number of > entries in the `restart_offset` list. There is always a restart > corresponding to the first ref record. Readers are responsible for > computing `restart_count = restart_count_m1 + 1`. I had to reread these two paragraphs a couple of times as it calls out the uninteresting things[1] and the interesting part is the logical conclusion that one has to make themselves, here is how I would write it: Readers can start linear scans from any record whose name has not been prefix compressed. The first record of a block must not be prefix-compressed. To aid finding the entry points for linear scans, a variable number of 4-byte `restart_offset` values follow the records. Offsets are relative to the start of the block and refer to the first byte of any such `ref_record` that is not prefix compressed. The first record can be omitted in the `restart_offset` values as it is implicit. The `restart_offset' values must be sorted (ascending, descending?). The 2-byte `restart_count_m1` stores the number of optional entry points, i.e. the number of values in `restart_offset'. As the first record is omitted in the offset table, there is at least one entry at 0. the `restart_count_m1` can be zero for best compression. [1] The last paragraph in my proposal sounds less like an off-by-one error to me, but just states what the number is and how it relates to the surroundings in the file format. ... > The `value` follows. Its format is determined by `value_type`, one of > the following: > > - `0x0`: deletion; no value data (see transactions, below) > - `0x1`: one 20-byte object id; value of the ref > - `0x2`: two 20-byte object ids; value of the ref, peeled target Up to here it is easy to spot a pattern: The number indicates how many oids follow. > - `0x3`: symbolic reference: `varint( target_len ) target` > - `0x4`: length delimited extension: `varint( data_len ) data` This breaks the pattern. Instead of hardcoding the numbers here, I wonder if we rather want to make the bits more meaningful: bit 0, 1: number of oids iff bit 2 unset Iff bit 2 set, then we have a "varint (len) data" that follows, bits 0,1 are used for a different purpose, 00 indicates 'symlink' and data is the string 01 indicates 'multihead' such as FETCH_HEAD 1* is reserved for now. This *may* be neat micro optimization, but hardcoding all bits to a lookup table is fine, too. Note that symrefs and multiheads could share the same type, iff we had dissallowed '\n' in refnames. (we do? otherwise FETCH_HEAD would be broken) The differentiator would be the '\n' or '\0' at the end of the first target. > Types `0x5..0x7` are reserved. The proposal above would make 0x3, 0x6, 0x7 be reserved. > ### Ref index ... > > An index block should only be written if there are at least 4 blocks > in the file, as cold
Re: reftable [v3]: new ref storage format
On Sat, Jul 22, 2017 at 8:29 PM, Shawn Pearcewrote: > 3rd iteration of the reftable storage format. > > You can read a rendered version of this here: > https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md > > Significant changes from v2: > - efficient lookup by SHA-1 for allow-tip-sha1-in-want. > - type 0x4 for FETCH_HEAD, MERGE_HEAD. > - file size up (27.7 M in v1, 34.4 M in v3) I had some feedback on v2 here which still applies: https://public-inbox.org/git/87k234tti7@gmail.com/ It would be good to either get a reply to that, or if you don't think it's sensible for whatever reason and left it out of v3 have a "feedback received but discarded for " in these summaries as you're sending new versions. Aside from the mail I sent I think that would be very useful in general if there's been any other such feedback (I honestly don't know if there has, I haven't been following this actively).
Re: reftable [v3]: new ref storage format
3rd iteration of the reftable storage format. You can read a rendered version of this here: https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md Significant changes from v2: - efficient lookup by SHA-1 for allow-tip-sha1-in-want. - type 0x4 for FETCH_HEAD, MERGE_HEAD. - file size up (27.7 M in v1, 34.4 M in v3) The file size increase is due to lookup by SHA-1 support. By using unique abbreviations its adding about 7 MiB to the file size for 865,258 objects behind 866,456 refs. Average entry for this direction costs 8 bytes, using a 6 byte/12 hex unique abbreviation. ## Overview ### Problem statement Some repositories contain a lot of references (e.g. android at 866k, rails at 31k). The existing packed-refs format takes up a lot of space (e.g. 62M), and does not scale with additional references. Lookup of a single reference requires linearly scanning the file. Atomic pushes modifying multiple references require copying the entire packed-refs file, which can be a considerable amount of data moved (e.g. 62M in, 62M out) for even small transactions (2 refs modified). Repositories with many loose references occupy a large number of disk blocks from the local file system, as each reference is its own file storing 41 bytes (and another file for the corresponding reflog). This negatively affects the number of inodes available when a large number of repositories are stored on the same filesystem. Readers can be penalized due to the larger number of syscalls required to traverse and read the `$GIT_DIR/refs` directory. ### Objectives - Near constant time lookup for any single reference, even when the repository is cold and not in process or kernel cache. - Near constant time verification a SHA-1 is referred to by at least one reference (for allow-tip-sha1-in-want). - Efficient lookup of an entire namespace, such as `refs/tags/`. - Support atomic push `O(size_of_update)` operations. - Combine reflog storage with ref storage. ### Description A reftable file is a portable binary file format customized for reference storage. References are sorted, enabling linear scans, binary search lookup, and range scans. Storage in the file is organized into blocks. Prefix compression is used within a single block to reduce disk space. Block size is tunable by the writer. ### Performance Space used, packed-refs vs. reftable: repository | packed-refs | reftable | % original | avg ref | avg obj ---|:|-:|---:|-:|: android| 62.2 M | 34.4 M | 55.2% | 33 bytes | 8 bytes rails | 1.8 M |1.1 M | 57.7% | 29 bytes | 6 bytes git| 78.7 K | 44.0 K | 60.0% | 50 bytes | 6 bytes git (heads)| 332 b |239 b | 72.0% | 31 bytes | 0 bytes Scan (read 866k refs), by reference name lookup (single ref from 866k refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs): format | scan| by name| by SHA-1 |:|---:|---: packed-refs | 402 ms | 409,660.1 usec | 412,535.8 usec reftable| 112 ms | 42.7 usec | 340.8 usec Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable: format| size | avg log --|--:|---: $GIT_DIR/logs | 173 M | 1209 bytes reftable | 4 M | 30 bytes ## Details ### Peeling References in a reftable are always peeled. ### Reference name encoding Reference names should be encoded with UTF-8. ### Network byte order All multi-byte, fixed width fields are in network byte order. ### Ordering Blocks are lexicographically ordered by their first reference. ### Directory/file conflicts The reftable format accepts both `refs/heads/foo` and `refs/heads/foo/bar` as distinct references. This property is useful for retaining log records in reftable, but may confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain references. Users of reftable may choose to continue to reject `foo` and `foo/bar` type conflicts to prevent problems for peers. ## File format ### Structure A reftable file has the following high-level structure: first_block { header first_ref_block } ref_blocks* ref_index? obj_blocks* obj_index? log_blocks* log_index? footer ### Block size The `block_size` is arbitrarily determined by the writer, and does not have to be a power of 2. The block size must be larger than the longest reference name or deflated log entry used in the repository, as references cannot span blocks. Powers of two that are friendly to the virtual memory system or filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can yield better compression, with a possible increased cost incurred by readers during access. The largest block size is `16777215` bytes (15.99 MiB). ### Header An 8-byte header appears at the beginning of the file: '\1REF' uint8(