Re: ORC double encoding optimization proposal

Gopal Vijayaraghavan Mon, 26 Mar 2018 11:36:53 -0700

>    Where does the 2x IO drop come from? Based on Cheng Xu’s data, Split + 
> Zstd has ~15% improvement over PlainV2 + Zstd in terms of the file size.


That was from my measurements on TPC-DS - from Cheng Xu's excel sheet, let me 
call out columns from TPC-DS store_sales here (price & discount)

LIST_PRICE

FLIP+ZLIB was 73.66% of original
SPLIT+ZLIB was 30.87% of original

For DISCOUNT_AMT

FLIP+ZLIB was 24.79% of original
SPLIT+ZLIB was 11.14% of original

On Zstd, the gap is much more.

FLIP+ZSTD was 40.08% of original
SPLIT+ZSTD was 7.43% of original

FLIP+ZSTD was 9.05% of original
SPLIT+ZSTD was 1.02% of original

>    The random IOPS would eventually determines the throughput of HDD. IO 
> queue can build up quickly when there are too many seeks and then drastically 
> affects read/write performance. That’s the major concern, and it’s not 
> related to locality. 

There's no doubt that IOPs is a fundamental limit - my measurements say that 
the latency is elsewhere in the DFS impl & that the OS read-ahead is 
out-running the seeks.

Shuffle operations however, they are eating up my IOPs.

Cheers,
Gopal

Re: ORC double encoding optimization proposal

Reply via email to