Douglas Drinka created ORC-144:
----------------------------------
Summary: PATCHED BASE Documentation Issues
Key: ORC-144
URL: https://issues.apache.org/jira/browse/ORC-144
Project: Orc
Issue Type: Bug
Components: documentation
Reporter: Douglas Drinka
Priority: Minor
The documentation for Patched Base encoding has two issues.
First is a repeat of "Data values (W * L bits padded to the byte)..." in the
data field description.
Second is in the example given. The sample data for all the other encoding
formats actually trigger their encoder based on the logic in the java code.
However this example sequence is too short to trigger both the 90% cutoff for
non-rebased data (1.0-.9)*10 = 0.99999999999999978 which floors to 0, and the
95% cutoff of rebased data. At least 20 values are needed for a single patch
to occur.
I propose the following sequence:
[2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070, 2080, 2090, 2100, 2110,
2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
Which encodes to [0x8e, 0x13, 0x2b, 0x21, 0x07, 0xd0, 0x1e, 0x00, 0x14, 0x70,
0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e, 0x78, 0x82, 0x8c, 0x96, 0xa0,
0xaa, 0xb4, 0xbe, 0xfc, 0xe8]
Then in the description the wording should be "a length of 20 (19)".
These samples were critical for me to verify my code, and I appreciated them
being provided, particularly since I didn't find any unit tests available in
the java code to directly compare byte outputs of the encoders.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)