[
https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718825#comment-13718825
]
Prasanth J commented on HIVE-4123:
----------------------------------
{quote}Comments:
merge Utils into SerializationUtils.
use the zigzag encode/decode in the the SerializationUtils.read/writeVslong
move Utils.nextLong to the test code
Utils.getTotalBytesRequired should just use long math. (n * numBits + 7) / 8
should work
Rename IntegerCompressionReader/Writer to RunLengthIntegerReader/WriterV2
{quote}
Done.
{quote}
Create an interface IntegerReader that has:
seek
next
skip
{quote}
Added hasNext() to interface as well.
{quote}
Make RunLengthIntegerReader and RunLengthIntegerReaderV2 implement IntegerReader
The TreeReaders should declare the fields as IntegerReader.
Each of the startStripe should use the encoding to create the right
implementation of IntegerReader.
We should do the same with an IntegerWriter interface.
Replace fixedBitSizes with static methods in SerializationUtils:
static int encodeBitWidth(int n)
static int decodeBitWidth(int n)
{quote}
Done.
{quote}
Finding the percentiles seems expensive, we should look at an alternative
{quote}
Done.
{quote}
Why is the delta blob zigzag encoded? The sign should always be positive or
negative for the entire run.
{quote}
Made the delta base field mandatory, blob is now directly bit packed.
{quote}
Maybe we could create an enum in the Writer that is the version to write that
would look like enum OrcVersion { V0_11, V0_12 }
and the StreamFactory could provide the version to the TreeWriters.
{quote}
Not done (as per your last comment about passing factory object)
{quote}
I don't see why bitpack reader/writer are more than static methods that
read/write to the underlying stream. So I would have expected a method like
writeInts(long[] data, int offset, int length, int numBits, OutputStream
stream) and the corresponding one for reading.
{quote}
Added as a separate static method. Can we reuse BitFieldReader/BitFieldWriter
which essentially does the same thing (except it deals with ints)?
{quote}
Utils.bytesToLongBE should take an input stream rather than a byte[].
{quote}
Done.
{quote}
In IntegerCompressionReader:
I'd write a method to translate the int into an opcode rather than use ordinal.
It is probably worth remembering that you are in a repeat, so that you don't
need to copy the value N times in short repeat.
{quote}
Done.
{quote}
It may be easier to loop through the base values and then run through the
patches. You might even do three loops: unpack the main values, unpack the
patches, add the base to each value.
{quote}
My initial implementation was running through 3 loops. But later I refactored
it to do in a single loop. I think this current patch removed some complexity
(removed zigzag and changed bitpacking).
{quote}
For patched based only the base is zigzag encoded. The rest of the values are
always positive.
For delta only the base and base delta are zigzag encoded.
{quote}
Good catch. Updated the patch.
{quote}
In IntegerCompressionWriter:
You should give more comments about the patched base encoding.
Instead of sorting for the percentiles, you could keep a count of how many
values use each number of bits.
{quote}
Done. Nice idea!
{quote}
Replace the commented out printlns with LOG.debug surrounded by
LOG.ifDebugEnabled
flush should use if/then/else to prevent writing the data twice
the constructor should probably call clear rather than risk having the default
values be different
in write, just copy the data with system.arraycopy instead of cloning the array
{quote}
Done.
{quote}
write should track whether the values are monotonically increasing or
decreasing so that we know if delta applies
there is a lot of duplication of effort in determine encoding
{quote}
write primarily deals with cutting the runs (determining the scope). There was
some redundancy that I removed in the current patch. Also tracking min/max was
wrong with the earlier which is fixed in the new patch. Earlier as and when a
value is buffered min/max are updated. But this lead to wrong output in some
cases. For example: 2 3 4 5 6 1 1 1 sequence has min value of 1, but this 1 is
part of short repeat sequence. This same min value was used for initial delta
run as well.
min/max/monotonicity/delta computation/percentile are determined while
iterating through the buffered values.
{quote}
if the sequence is both increasing and decreasing, it is constant and we should
either use short literal or delta depending on the length
delta encoding should return before doing the percentile work
{quote}
Currently, delta encoding returns before percentile computation. Short repeats
are determined when buffering values. All other encodings are determined in
determineEncoding().
{quote}
How much unit test coverage do you have of the new code?
{quote}
I have unit tests to cover the basic cases (runs that covers all the encodings)
and unit tests for edges cases of each encodings. I also have a separate
synthetic datagen utility that generates sorted/signed/unsigned/sparse/dense
data. Should I add that datagen utility to unit tests as well?
{quote}
Have you run the encoder/decoder round trip over the github data to test it?
{quote}
Yes. I ran it over 237 columns of github data. Apart from github data, I ran it
over aol querylog timestamp, Nasdaq stocks data, twitter api/search id dataset
and few columns from httparchive dataset.
I am still doing some more refactoring/code cleaning. I will post another patch
in case of any changes. This patch covers most of the code review changes.
> The RLE encoding for ORC can be improved
> ----------------------------------------
>
> Key: HIVE-4123
> URL: https://issues.apache.org/jira/browse/HIVE-4123
> Project: Hive
> Issue Type: New Feature
> Components: File Formats
> Reporter: Owen O'Malley
> Assignee: Prasanth J
> Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt,
> ORC-Compression-Ratio-Comparison.xlsx
>
>
> The run length encoding of integers can be improved:
> * tighter bit packing
> * allow delta encoding
> * allow longer runs
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira