Hi Gopal, ORC spec doesn’t guarantee streams belong to the same column are stored together. Even if that’s guaranteed, there are reasons why we cannot read adjacent streams with one single IO -
1. Streams can be large. Reading the whole stream(s) will add unnecessary memory pressure. 2. Under seek or predicate pushdown scenario, there’s no need to load the entire stream. So in a lot of cases, reader will read fixed-size chunks from multiple streams and reconstruct the column data. That’s also what I see from Apache ORC Java and C++ readers. The more streams we have, the more fragmented IO pattern will become. > On Mar 26, 2018, at 12:59 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > Hi, > > >> Since Split creates two separated streams, reading one data batch will need >> an additional seek in order to reconstruct the column data > > If you are seeing a seek like that, we've messed up something else higher up > in the pipeline & that can be fixed. > > ORC columnar reads only do random IO at the column level, not the stream > level (except for non-column streams like the bloom filters) - adjacent > streams are read together as a single IO op. > > DiskRangeList produce a merged read plan before firing off any read, so the > actual IO layer will (or should) never a seek between adjacent streams. > > There's a possibility that someone will add an extra byte or something to a > stream which they do not read ever, which might be a problem. > > In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, > which performs very poorly if you add irrelevant seeks. > > If you do find a similar case in Apache ORC (not Hive-orc), I'll file a > corresponding ticket to this > > https://issues.apache.org/jira/browse/HIVE-13161 > > That was actually about reading 2 columns with an entirely NULL column in the > middle, not exactly about splitting streams. > > The next giant leap of IO performance for seeks is expected from a new HDFS > API, which allows for the scatter-gather to be pushed-down further into the > IO layer. > > https://issues.apache.org/jira/browse/HADOOP-11867 > > This mainly intended for reading ORC files from Erasure coded streams, where > the IO layer can reorganize and align the reads along the Erasure Coding > boundaries (not so much about actual IOPs), instead of assuming normal > read-ahead for the block reader. > > Cheers, > Gopal > > >