Re: ORC double encoding optimization proposal

Gopal Vijayaraghavan Sun, 25 Mar 2018 21:59:32 -0700

Hi,


> Since Split creates two separated streams, reading one data batch will need 
> an additional seek in order to reconstruct the column data

If you are seeing a seek like that, we've messed up something else higher up in 
the pipeline & that can be fixed.

ORC columnar reads only do random IO at the column level, not the stream level 
(except for non-column streams like the bloom filters) - adjacent streams are 
read together as a single IO op.

DiskRangeList produce a merged read plan before firing off any read, so the 
actual IO layer will (or should) never a seek between adjacent streams.

There's a possibility that someone will add an extra byte or something to a 
stream which they do not read ever, which might be a problem.

In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, which 
performs very poorly if you add irrelevant seeks.

If you do find a similar case in Apache ORC (not Hive-orc), I'll file a 
corresponding ticket to this

https://issues.apache.org/jira/browse/HIVE-13161

That was actually about reading 2 columns with an entirely NULL column in the 
middle, not exactly about splitting streams.

The next giant leap of IO performance for seeks is expected from a new HDFS 
API, which allows for the scatter-gather to be pushed-down further into the IO 
layer.

https://issues.apache.org/jira/browse/HADOOP-11867

This mainly intended for reading ORC files from Erasure coded streams, where 
the IO layer can reorganize and align the reads along the Erasure Coding 
boundaries (not so much about actual IOPs), instead of assuming normal 
read-ahead for the block reader.

Cheers,
Gopal

Re: ORC double encoding optimization proposal

Reply via email to