Hi,
> Since Split creates two separated streams, reading one data batch will need > an additional seek in order to reconstruct the column data If you are seeing a seek like that, we've messed up something else higher up in the pipeline & that can be fixed. ORC columnar reads only do random IO at the column level, not the stream level (except for non-column streams like the bloom filters) - adjacent streams are read together as a single IO op. DiskRangeList produce a merged read plan before firing off any read, so the actual IO layer will (or should) never a seek between adjacent streams. There's a possibility that someone will add an extra byte or something to a stream which they do not read ever, which might be a problem. In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, which performs very poorly if you add irrelevant seeks. If you do find a similar case in Apache ORC (not Hive-orc), I'll file a corresponding ticket to this https://issues.apache.org/jira/browse/HIVE-13161 That was actually about reading 2 columns with an entirely NULL column in the middle, not exactly about splitting streams. The next giant leap of IO performance for seeks is expected from a new HDFS API, which allows for the scatter-gather to be pushed-down further into the IO layer. https://issues.apache.org/jira/browse/HADOOP-11867 This mainly intended for reading ORC files from Erasure coded streams, where the IO layer can reorganize and align the reads along the Erasure Coding boundaries (not so much about actual IOPs), instead of assuming normal read-ahead for the block reader. Cheers, Gopal