> How small are you trying to make the stripes? I ask because all of the > above should be small, so if they are dominating, I would assume the stripe > is tiny or the compression really worked well.
I'm not in favour of stripelets for seek reasons, because reading a single column from a remote store is hit by the extra skipping over stripelet boundaries (or I read through the boundaries). Flushing at fixed offsets across all columns would not suffer from that and would not change the underlying read patterns. There's already an "ORC gap cache" in LLAP to hack around the lack of these boundaries, but something which I'd like to not keep around forever. > The ORC spec currently calls for sorted dictionaries, so if the they are not > sorted, they are not valid ORC files. > I find that most dictionary are a relatively small size compared to the row > count, so the cost of testing each entry isn’t a big deal. I agree, moving that out of the spec would be a good thing. The format can add a future optional stream which is "sort-order-index" which contains the dictionary transform from unsorted/sorted (i.e dict-ids in byte sorted order), so that the reader can remap it into a sorted list. But removing the "always sort" dictionaries would be a good thing for writer throughput and memory consumption. Cheers, Gopal