Ok, I just put in a pull request for this:
https://github.com/apache/orc/pull/133
Let me know if anything is still unclear.
Thanks,
Owen
On Fri, Jun 16, 2017 at 12:19 PM, Dain Sundstrom <[email protected]> wrote:
> Recently I have been working on a custom writer for Presto and during this
> I kept notes on sections of the documentation that might have problems.
> Some of these may have already been addressed:
>
> ## Compression
> see https://orc.apache.org/docs/compression.html
>
> I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03].
No, if it is compressed the low bit is 0. It ends up with:
2 * 100,000 + 0 = 0x30d40
> Also, it is not clear if compressed length is 2 bytes, or .
>
The header is always 3 bytes. I thought about adding a special case if the
chunk size was less than 32k, but didn't.
> ```
> Each header is 3 bytes long with (compressedLength * 2 + isOriginal)
> stored as a little endian value. For example, the header for a chunk that
> compressed to 100,000 bytes would be [0x40, 0x0d, 0x03]. The header for 5
> bytes that did not compress would be [0x0b, 0x00, 0x00].
> ```
>
> This section is not clear:
> ```
> The default compression chunk size is 256K, but writers can choose their
> own value less than 223.
> ```
> Should the that be 223K? If so, that seems strange since I would assume
> any value smaller than 256K is legit.
>
>
> ## String encodings
> see https://orc.apache.org/docs/encodings.html#string-char-
> and-varchar-columns
>
> This first sentence seems to be describing a heuristic used by the default
> implementation.
>
> ## File tail
> The docs should make it clear that the maximum length stored for archer
> and char are the maximum number of unicode characters and specifically not
> byte count and not UTF-16 sequences (like Java does by default).
> ```
> // the maximum length of the type for varchar or char
> optional uint32 maximumLength = 4;
> ```
>
>