Hi Gopal,

ORC spec doesn’t guarantee streams belong to the same column are stored 
together. Even if that’s guaranteed, there are reasons why we cannot read 
adjacent streams with one single IO -

1. Streams can be large. Reading the whole stream(s) will add unnecessary 
memory pressure.
2. Under seek or predicate pushdown scenario, there’s no need to load the 
entire stream.

So in a lot of cases, reader will read fixed-size chunks from multiple streams 
and reconstruct the column data. That’s also what I see from Apache ORC Java 
and C++ readers. The more streams we have, the more fragmented IO pattern will 
become. 


> On Mar 26, 2018, at 12:59 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote:
> 
> Hi,
> 
> 
>> Since Split creates two separated streams, reading one data batch will need 
>> an additional seek in order to reconstruct the column data
> 
> If you are seeing a seek like that, we've messed up something else higher up 
> in the pipeline & that can be fixed.
> 
> ORC columnar reads only do random IO at the column level, not the stream 
> level (except for non-column streams like the bloom filters) - adjacent 
> streams are read together as a single IO op.
> 
> DiskRangeList produce a merged read plan before firing off any read, so the 
> actual IO layer will (or should) never a seek between adjacent streams.
> 
> There's a possibility that someone will add an extra byte or something to a 
> stream which they do not read ever, which might be a problem.
> 
> In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, 
> which performs very poorly if you add irrelevant seeks.
> 
> If you do find a similar case in Apache ORC (not Hive-orc), I'll file a 
> corresponding ticket to this
> 
> https://issues.apache.org/jira/browse/HIVE-13161
> 
> That was actually about reading 2 columns with an entirely NULL column in the 
> middle, not exactly about splitting streams.
> 
> The next giant leap of IO performance for seeks is expected from a new HDFS 
> API, which allows for the scatter-gather to be pushed-down further into the 
> IO layer.
> 
> https://issues.apache.org/jira/browse/HADOOP-11867
> 
> This mainly intended for reading ORC files from Erasure coded streams, where 
> the IO layer can reorganize and align the reads along the Erasure Coding 
> boundaries (not so much about actual IOPs), instead of assuming normal 
> read-ahead for the block reader.
> 
> Cheers,
> Gopal
> 
> 
> 

Reply via email to