Re: ORC double encoding optimization proposal

Xiening Dai Sun, 25 Mar 2018 23:04:56 -0700

Hi Gopal,

ORC spec doesn’t guarantee streams belong to the same column are stored 
together. Even if that’s guaranteed, there are reasons why we cannot read 
adjacent streams with one single IO -


1. Streams can be large. Reading the whole stream(s) will add unnecessary 
memory pressure.
2. Under seek or predicate pushdown scenario, there’s no need to load the 
entire stream.

So in a lot of cases, reader will read fixed-size chunks from multiple streams 
and reconstruct the column data. That’s also what I see from Apache ORC Java 
and C++ readers. The more streams we have, the more fragmented IO pattern will 
become. 


> On Mar 26, 2018, at 12:59 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote:
> 
> Hi,
> 
> 
>> Since Split creates two separated streams, reading one data batch will need 
>> an additional seek in order to reconstruct the column data
> 
> If you are seeing a seek like that, we've messed up something else higher up 
> in the pipeline & that can be fixed.
> 
> ORC columnar reads only do random IO at the column level, not the stream 
> level (except for non-column streams like the bloom filters) - adjacent 
> streams are read together as a single IO op.
> 
> DiskRangeList produce a merged read plan before firing off any read, so the 
> actual IO layer will (or should) never a seek between adjacent streams.
> 
> There's a possibility that someone will add an extra byte or something to a 
> stream which they do not read ever, which might be a problem.
> 
> In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, 
> which performs very poorly if you add irrelevant seeks.
> 
> If you do find a similar case in Apache ORC (not Hive-orc), I'll file a 
> corresponding ticket to this
> 
> https://issues.apache.org/jira/browse/HIVE-13161
> 
> That was actually about reading 2 columns with an entirely NULL column in the 
> middle, not exactly about splitting streams.
> 
> The next giant leap of IO performance for seeks is expected from a new HDFS 
> API, which allows for the scatter-gather to be pushed-down further into the 
> IO layer.
> 
> https://issues.apache.org/jira/browse/HADOOP-11867
> 
> This mainly intended for reading ORC files from Erasure coded streams, where 
> the IO layer can reorganize and align the reads along the Erasure Coding 
> boundaries (not so much about actual IOPs), instead of assuming normal 
> read-ahead for the block reader.
> 
> Cheers,
> Gopal
> 
> 
>

Re: ORC double encoding optimization proposal

Reply via email to