Douglas Drinka created ORC-144:
----------------------------------

             Summary: PATCHED BASE Documentation Issues
                 Key: ORC-144
                 URL: https://issues.apache.org/jira/browse/ORC-144
             Project: Orc
          Issue Type: Bug
          Components: documentation
            Reporter: Douglas Drinka
            Priority: Minor


The documentation for Patched Base encoding has two issues.

First is a repeat of "Data values (W * L bits padded to the byte)..." in the 
data field description.

Second is in the example given.  The sample data for all the other encoding 
formats actually trigger their encoder based on the logic in the java code.  
However this example sequence is too short to trigger both the 90% cutoff for 
non-rebased data (1.0-.9)*10 = 0.99999999999999978 which floors to 0, and the 
95% cutoff of rebased data.  At least 20 values are needed for a single patch 
to occur.

I propose the following sequence:
[2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070, 2080, 2090, 2100, 2110, 
2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]

Which encodes to [0x8e, 0x13, 0x2b, 0x21, 0x07, 0xd0, 0x1e, 0x00, 0x14, 0x70, 
0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e, 0x78, 0x82, 0x8c, 0x96, 0xa0, 
0xaa, 0xb4, 0xbe, 0xfc, 0xe8]

Then in the description the wording should be "a length of 20 (19)".

These samples were critical for me to verify my code, and I appreciated them 
being provided, particularly since I didn't find any unit tests available in 
the java code to directly compare byte outputs of the encoders.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to