[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved

Owen O'Malley (JIRA) Sun, 21 Jul 2013 13:05:44 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714795#comment-13714795
 ]


Owen O'Malley commented on HIVE-4123:
-------------------------------------

More comments:
* I don't see why bitpack reader/writer are more than static methods that 
read/write to the underlying stream. So I would have expected a method like 
writeInts(long[] data, int offset, int length, int numBits, OutputStream 
stream) and the corresponding one for reading.
* Utils.bytesToLongBE should take an input stream rather than a byte[].
* In IntegerCompressionReader:
** I'd write a method to translate the int into an opcode rather than use 
ordinal.
** It is probably worth remembering that you are in a repeat, so that you don't 
need to copy the value N times in short repeat.
** It may be easier to loop through the base values and then run through the 
patches. You might even do three loops: unpack the main values, unpack the 
patches, add the base to each value.
** For patched based only the base is zigzag encoded. The rest of the values 
are always positive.
** For delta only the base and base delta are zigzag encoded. 
* In IntegerCompressionWriter:
** You should give more comments about the patched base encoding.
** Instead of sorting for the percentiles, you could keep a count of how many 
values use each number of bits.
** Replace the commented out printlns with LOG.debug surrounded by 
LOG.ifDebugEnabled
** flush should use if/then/else to prevent writing the data twice
** the constructor should probably call clear rather than risk having the 
default values be different
** in write, just copy the data with system.arraycopy instead of cloning the 
array
** write should track whether the values are monotonically increasing or 
decreasing so that we know if delta applies
** there is a lot of duplication of effort in determine encoding
** if the sequence is both increasing and decreasing, it is constant and we 
should either use short literal or delta depending on the length
** delta encoding should return before doing the percentile work
** 
* How much unit test coverage do you have of the new code?
* Have you run the encoder/decoder round trip over the github data to test it?


                
> The RLE encoding for ORC can be improved
> ----------------------------------------
>
>                 Key: HIVE-4123
>                 URL: https://issues.apache.org/jira/browse/HIVE-4123
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Prasanth J
>         Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, 
> ORC-Compression-Ratio-Comparison.xlsx
>
>
> The run length encoding of integers can be improved:
> * tighter bit packing
> * allow delta encoding
> * allow longer runs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved

Reply via email to