>  How small are you trying to make the stripes?  I ask because all of the 
> above should be small, so if they are dominating, I would assume the stripe 
> is tiny or the compression really worked well.

I'm not in favour of stripelets for seek reasons, because reading a single 
column from a remote store is hit by the extra skipping over stripelet 
boundaries (or I read through the boundaries).

Flushing at fixed offsets across all columns would not suffer from that and 
would not change the underlying read patterns.

There's already an "ORC gap cache" in LLAP to hack around the lack of these 
boundaries, but something which I'd like to not keep around forever.

>  The ORC spec currently calls for sorted dictionaries, so if the they are not 
> sorted, they are not valid ORC files.  
>   I find that most dictionary are a relatively small size compared to the row 
> count, so the cost of testing each entry isn’t a big deal.

I agree, moving that out of the spec would be a good thing.

The format can add a future optional stream which is "sort-order-index" which 
contains the dictionary transform from unsorted/sorted (i.e dict-ids in byte 
sorted order), so that the reader can remap it into a sorted list.

But removing the "always sort" dictionaries would be a good thing for writer 
throughput and memory consumption.

Cheers,
Gopal



Reply via email to