Re: Documentations issues

Owen O'Malley Mon, 19 Jun 2017 13:26:25 -0700

Ok, I just put in a pull request for this:

https://github.com/apache/orc/pull/133


Let me know if anything is still unclear.

Thanks,
    Owen

On Fri, Jun 16, 2017 at 12:19 PM, Dain Sundstrom <[email protected]> wrote:

> Recently I have been working on a custom writer for Presto and during this
> I kept notes on sections of the documentation that might have problems.
> Some of these may have already been addressed:
>
> ## Compression
> see https://orc.apache.org/docs/compression.html
>
> I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03].


No, if it is compressed the low bit is 0. It ends up with:

2 * 100,000 + 0 = 0x30d40


>   Also, it is not clear if compressed length is 2 bytes, or .
>

The header is always 3 bytes. I thought about adding a special case if the
chunk size was less than 32k, but didn't.


> ```
> Each header is 3 bytes long with (compressedLength * 2 + isOriginal)
> stored as a little endian value.   For example, the header for a chunk that
> compressed to 100,000 bytes would be [0x40, 0x0d, 0x03]. The header for 5
> bytes that did not compress would be [0x0b, 0x00, 0x00].
> ```
>
> This section is not clear:
> ```
> The default compression chunk size is 256K, but writers can choose their
> own value less than 223.
> ```
> Should the that be 223K?  If so, that seems strange since I would assume
> any value smaller than 256K is legit.
>
>
> ## String encodings
> see https://orc.apache.org/docs/encodings.html#string-char-
> and-varchar-columns
>
> This first sentence seems to be describing a heuristic used by the default
> implementation.
>
> ## File tail
> The docs should make it clear that the maximum length stored for archer
> and char are the maximum number of unicode characters and specifically not
> byte count and not UTF-16 sequences (like Java does by default).
> ```
> // the maximum length of the type for varchar or char
>  optional uint32 maximumLength = 4;
> ```
>
>

Re: Documentations issues

Reply via email to