[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved

Prasanth J (JIRA) Wed, 24 Jul 2013 14:06:56 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718825#comment-13718825
 ]


Prasanth J commented on HIVE-4123:
----------------------------------

{quote}Comments:
merge Utils into SerializationUtils.
use the zigzag encode/decode in the the SerializationUtils.read/writeVslong
move Utils.nextLong to the test code
Utils.getTotalBytesRequired should just use long math. (n * numBits + 7) / 8 
should work
Rename IntegerCompressionReader/Writer to RunLengthIntegerReader/WriterV2 
{quote}

Done.

{quote}
Create an interface IntegerReader that has:
seek
next
skip
{quote}

Added hasNext() to interface as well.

{quote}
Make RunLengthIntegerReader and RunLengthIntegerReaderV2 implement IntegerReader
The TreeReaders should declare the fields as IntegerReader.
Each of the startStripe should use the encoding to create the right 
implementation of IntegerReader.
We should do the same with an IntegerWriter interface.
Replace fixedBitSizes with static methods in SerializationUtils:
static int encodeBitWidth(int n)
static int decodeBitWidth(int n)
{quote}

Done.

{quote}
Finding the percentiles seems expensive, we should look at an alternative
{quote}

Done.

{quote}
Why is the delta blob zigzag encoded? The sign should always be positive or 
negative for the entire run.
{quote}

Made the delta base field mandatory, blob is now directly bit packed.

{quote}
Maybe we could create an enum in the Writer that is the version to write that 
would look like enum OrcVersion { V0_11, V0_12 }
and the StreamFactory could provide the version to the TreeWriters.
{quote}

Not done (as per your last comment about passing factory object)

{quote}
I don't see why bitpack reader/writer are more than static methods that 
read/write to the underlying stream. So I would have expected a method like 
writeInts(long[] data, int offset, int length, int numBits, OutputStream 
stream) and the corresponding one for reading.
{quote}

Added as a separate static method. Can we reuse BitFieldReader/BitFieldWriter 
which essentially does the same thing (except it deals with ints)?

{quote}
Utils.bytesToLongBE should take an input stream rather than a byte[].
{quote}

Done.

{quote}
In IntegerCompressionReader:
I'd write a method to translate the int into an opcode rather than use ordinal.
It is probably worth remembering that you are in a repeat, so that you don't 
need to copy the value N times in short repeat.
{quote}

Done.

{quote}
It may be easier to loop through the base values and then run through the 
patches. You might even do three loops: unpack the main values, unpack the 
patches, add the base to each value.
{quote}

My initial implementation was running through 3 loops. But later I refactored 
it to do in a single loop. I think this current patch removed some complexity 
(removed zigzag and changed bitpacking).

{quote}
For patched based only the base is zigzag encoded. The rest of the values are 
always positive.
For delta only the base and base delta are zigzag encoded.
{quote}

Good catch. Updated the patch.

{quote}
In IntegerCompressionWriter:
You should give more comments about the patched base encoding.
Instead of sorting for the percentiles, you could keep a count of how many 
values use each number of bits.
{quote}

Done. Nice idea!

{quote}
Replace the commented out printlns with LOG.debug surrounded by 
LOG.ifDebugEnabled
flush should use if/then/else to prevent writing the data twice
the constructor should probably call clear rather than risk having the default 
values be different
in write, just copy the data with system.arraycopy instead of cloning the array
{quote}

Done. 

{quote}
write should track whether the values are monotonically increasing or 
decreasing so that we know if delta applies
there is a lot of duplication of effort in determine encoding
{quote}

write primarily deals with cutting the runs (determining the scope). There was 
some redundancy that I removed in the current patch. Also tracking min/max was 
wrong with the earlier which is fixed in the new patch. Earlier as and when a 
value is buffered min/max are updated. But this lead to wrong output in some 
cases. For example: 2 3 4 5 6 1 1 1 sequence has min value of 1, but this 1 is 
part of short repeat sequence. This same min value was used for initial delta 
run as well.

min/max/monotonicity/delta computation/percentile are determined while 
iterating through the buffered values.

{quote}
if the sequence is both increasing and decreasing, it is constant and we should 
either use short literal or delta depending on the length
delta encoding should return before doing the percentile work
{quote}

Currently, delta encoding returns before percentile computation. Short repeats 
are determined when buffering values. All other encodings are determined in 
determineEncoding(). 

{quote}
How much unit test coverage do you have of the new code?
{quote}

I have unit tests to cover the basic cases (runs that covers all the encodings) 
and unit tests for edges cases of each encodings. I also have a separate 
synthetic datagen utility that generates sorted/signed/unsigned/sparse/dense 
data. Should I add that datagen utility to unit tests as well?

{quote}
Have you run the encoder/decoder round trip over the github data to test it?
{quote}

Yes. I ran it over 237 columns of github data. Apart from github data, I ran it 
over aol querylog timestamp, Nasdaq stocks data, twitter api/search id dataset 
and few columns from httparchive dataset. 

I am still doing some more refactoring/code cleaning. I will post another patch 
in case of any changes. This patch covers most of the code review changes. 
                
> The RLE encoding for ORC can be improved
> ----------------------------------------
>
>                 Key: HIVE-4123
>                 URL: https://issues.apache.org/jira/browse/HIVE-4123
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Prasanth J
>         Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, 
> ORC-Compression-Ratio-Comparison.xlsx
>
>
> The run length encoding of integers can be improved:
> * tighter bit packing
> * allow delta encoding
> * allow longer runs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved

Reply via email to