Re: ORC double encoding optimization proposal

Gopal Vijayaraghavan Sun, 25 Mar 2018 23:47:52 -0700

>    2. Under seek or predicate pushdown scenario, there’s no need to load the 
> entire stream.
 
Yes, that is a valid scenario where the reader reads partial-streams & causes 
random IO.


The current double encoding is actually 2 streams today & will continue to use 
2 streams for the FLIP implementation.

The SPLIT implementation will go from the current 2 streams to 4 streams (i.e 
1+1->1+3 streams) & the total data IO will drop by ~2x or so. More so if one of 
the streams can be suppressed (like in my IoT data-set, where the sign-bit is 
always +ve for my electric meter data).

The trade-offs seem to be working out on regular HDDs with locality & for LLAP 
SSD caches - if your use-cases are different, I'd like to hear more about it.

The only significant random IO delays expected seem to be entirely within the 
HDFS API network hops (which offers 0% locality when data is erasure coded or 
for cloud-storage), which I hope to fix in the Hadoop-3.x branch with a new API.

Cheers,
Gopal

Re: ORC double encoding optimization proposal

Reply via email to