Afair ORC used to have some threshold below which it would still do one read if the gap is small.
On 18/3/25, 23:47, "Gopal Vijayaraghavan" <gop...@apache.org> wrote: > >> 2. Under seek or predicate pushdown scenario, there’s no need to >>load the entire stream. > >Yes, that is a valid scenario where the reader reads partial-streams & >causes random IO. > >The current double encoding is actually 2 streams today & will continue >to use 2 streams for the FLIP implementation. > >The SPLIT implementation will go from the current 2 streams to 4 streams >(i.e 1+1->1+3 streams) & the total data IO will drop by ~2x or so. More >so if one of the streams can be suppressed (like in my IoT data-set, >where the sign-bit is always +ve for my electric meter data). > >The trade-offs seem to be working out on regular HDDs with locality & for >LLAP SSD caches - if your use-cases are different, I'd like to hear more >about it. > >The only significant random IO delays expected seem to be entirely within >the HDFS API network hops (which offers 0% locality when data is erasure >coded or for cloud-storage), which I hope to fix in the Hadoop-3.x branch >with a new API. > >Cheers, >Gopal > >