Thanks Dain. My comments are inline.

> On Oct 1, 2018, at 11:51 AM, Dain Sundstrom <d...@iq80.com> wrote:
> 
> 
> 
>> On Sep 28, 2018, at 2:40 PM, Xiening Dai <xndai....@live.com> wrote:
>> 
>> Hi all,
>> 
>> While we are working on the new Orc v2 spec, I want to bounce some ideas in 
>> this group. If we can get something concrete, I will open JIRAs to follow 
>> up. Some of these ideas were mentioned before in various discussion, but I 
>> just put them together in a list so people can comment and provide feedback. 
>> Thanks.
>> 
>> 
>> *   Clustered Index
>> 
>> In a lot of time, data written into Orc file has sorting property. For 
>> example a sort merge join output will result in the data stream being sorted 
>> on join key(s). Another example is the DISTRIBUTED BY … SORTED BY … keywords 
>> can enforce the sorting property on certain data set. Under such cases, if 
>> we can just record the sort key(s) values for the first row of each row 
>> group it will help us a lot while doing lookup using key ranges, because we 
>> already have row group index which gives us ~O(1) seek time. During query 
>> execution, a SQL filter predicate can be easily turned into a key range. For 
>> example “WHERE id > 0 and id <= 100” will be translated into range (0, 100], 
>> and this key range can be passed down all the way to the Orc reader. Then we 
>> only need to load the corresponding row groups that covers this range.
> 
> Don’t we already have min and max for every row group?  If so, isn’t that a 
> superset of this feature?

Not exactly. With min/max it’s an O(n) complexity to scan all the row group 
stats. But with the new index structure, this become O(logn). Another benefit 
is when writer takes advantage of sorting property, the min/max value of 
sorting column can be calculated with O(1) time. This is particularly 
beneficial for string columns (string compare could be costly). 

> 
>> *   Stripe Footer Location
>> 
>> Today stripe footers are stored at the end of each stripe. This design 
>> probably come from the Hive world where the implementation tries to align 
>> Orc stripe with an HDFS block. It would make sense when you only need to 
>> read one HDFS block for both the data and the footer. But the alignment 
>> assumption doesn’t hold in other systems that leverage Orc as a columnar 
>> data format. Besides even for Hive, often time it’s hard to make sure good 
>> alignment due to various reasons - for example, when memory pressure is high 
>> stripe needs to be flushed to disk earlier. With this in mind, it will make 
>> sense to support saving stripe footer at the end of the file, together with 
>> all the other file meta. This would be easier for one sequential IO to load 
>> all the meta, and is easier to cache them all together. And we can make this 
>> configurable through writer options.
> 
> The end of the file contains a metadata section with a summary of the stripe 
> statistics.  I’m curious what information you would like that isn’t already 
> present that data structure.  
> 
> Also, I’m curious how these other  systems "that leverage Orc as a columnar 
> data format" are accessing this stripe footer information.  Specifically, is 
> it cached or loaded on demand from storage?  Are you using this for planning 
> or data skipping?

We only need stripe footer when we start loading and decoding the data streams. 
That’s one additional seek IO. Since the stripe statistics tend to be bigger in 
size and they are already at the file tail. I am not sure why we cannot just 
move stripe footers to the tail as well. We are not caching stripe footer 
today, and I believe packing all the meta together would be easier for caching.

> 
>> *   File Level Dictionary
>> 
>> Currently Orc builds dictionary at stripe level. Each stripe has its own 
>> dictionary. But in most cases, data across stripes share a lot of 
>> similarities. Building one file level dictionary is probably more efficient 
>> than having one dictionary each stripe. We can reduce storage footprint and 
>> also improve read performance since we only have one dictionary per column 
>> per file. One challenge with this design is how to do file merge. Two files 
>> can have two different dictionary, and we need to be able to merge them 
>> without rewriting all the data. To solve this problem, we will need to 
>> support multiple dictionaries identified by uuid. Each stripe records the 
>> dictionary ID that identifies the dictionary it uses. And the reader loads 
>> the particular dictionary based on the ID when it loads a stripe. When you 
>> merge two files, dictionary data doesn’t need to be changed, but just to 
>> save the dictionaries from both files in the new merged file.
> 
> In my experience, this over head is very small when using the default 128MB 
> stripe settings, and I’d guess is  reasonable at 64MB or 32MB.  What we 
> typically see is that dictionary columns compress amazingly well, and the 
> other columns in the table take up the majority of the space in a table.  
> Even when you include the repeated cost of the dictionary per stripe, the 
> over head is tiny.
> 
> On the other hand, there are cases where you have columns that have a small 
> common set of values mixed in with pretty distinct values (skewed data), and 
> in those cases the dictionary blows up as you add more rows to the stripe.  
> The FB fork of ORC, DWRF, addresses this by having support for a "per row 
> group" dictionary.  Another, alternative is to support mixed direct and 
> dictionary in the same column, but that is pretty complex to implement and 
> effectively disables downstream dictionary processing.
> 
> BTW, I’m not advocating for either of these dictionary changes, just 
> providing my observations.

The overhead is also in reader path. Dictionary will be constructed every time 
loading a stripe (IO, decode, decompression and potentially memory 
allocation/deallocation). It’s interesting to see that DWRF is taking the other 
direction. We might do some prototyping to get better ideas. 

> 
>> *   Breaking Compression Block and RLE Runs at Row Group Boundary
>> 
>> Owen has mentioned this in previous discussion. We did a prototype and are 
>> able to show that there’s only a slight increase of file size (< 1%) with 
>> the change. But the benefit is obvious - all the seek to row group operation 
>> will not involve unnecessary decoding/decompression, making it really 
>> efficient. And this is critical in scenarios such as predicate pushdown or 
>> range scan using clustered index (see my first bullet point). The other 
>> benefit is doing so will greatly simply the index implementation we have 
>> today. We will only need to record a file offset for row group index.
> 
> This is one I’d like to see.  The complexity of encodings spanning row 
> groups, makes skipping super complex. Would you extend this to compression 
> blocks?
> 
> BTW, there are datasets that would become much bigger with this change, but I 
> think on average the change would be tiny.

Yes, this includes compression block - break both compression chunk and RLE 
runs at row group boundary. There’s one issue with the byte RLE encoding, but I 
think we can figure out something. What are the datasets you see a big 
difference?

> 
>> *   Encoding and Compression
>> 
>> The encoding today doesn’t have a lot of flexibility. Sometimes we would 
>> need to configure and fine tune encoding when it’s needed. For example, in 
>> previous discussions Gang brought up, we found LEB128 causes zStd to perform 
>> really bad. We would end up with much better result by just disabling LEB128 
>> under zstd compression. We don’t have flexibility for these kind of things 
>> today. And we will need additional meta fields for that.
> 
> I think this brings up a much bigger issue.  Why have such flexibility in the 
> compression algorithms?  The interaction between the block compression and 
> the data encodings, can have dramatic effects on the format.  Can we consider 
> limiting the compression to LZ4 and ZSTD (or may be just ZSTD), and then 
> design encodings that play well with them?  Also, ZSTD can have pre-trained 
> "dictionary" that might help with specific encodings…. Just a thought.
> 

I agree that we should limit the compression codecs to one or two (LZ4 is still 
pretty useful, and we might keep it around). But having the flexibility to 
enable/disable some of the encoding feature is still good. We are talking about 
the writer implementation, not the configurations we want to expose to the end 
user. Even with one compression codec, different compression levels might have 
drastically different behaviors.


> -dain
> 

Reply via email to