> On Oct 8, 2018, at 5:19 PM, Xiening Dai <[email protected]> wrote: >> On Oct 7, 2018, at 6:42 AM, Dain Sundstrom <[email protected]> wrote: >>> On Oct 6, 2018, at 11:42 AM, Owen O'Malley <[email protected]> wrote: >>> >>> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom <[email protected]> wrote: >>>>>> * Breaking Compression Block and RLE Runs at Row Group Boundary >>>>>> >>>>>> Owen has mentioned this in previous discussion. We did a prototype and >>>> are >>>>>> able to show that there’s only a slight increase of file size (< 1%) >>>> with >>>>>> the change. But the benefit is obvious - all the seek to row group >>>>>> operation will not involve unnecessary decoding/decompression, making it >>>>>> really efficient. And this is critical in scenarios such as predicate >>>>>> pushdown or range scan using clustered index (see my first bullet >>>> point). >>>>>> The other benefit is doing so will greatly simply the index >>>> implementation >>>>>> we have today. We will only need to record a file offset for row group >>>>>> index. >>>>>> >>>>> >>>>> Yeah, this is the alternative to the stripelets that I discussed above. > > > If we go with smaller stripes, we run into problems like dictionary > duplication, overhead of metas/stats, cross stripes read, etc. Actually a > couple of items I propose here is to make sure the format still works great > if we choose much smaller stripe size. As you mentioned, sometimes smaller > stripes are inevitable (e.g. under dynamic partition insert).
How small are you trying to make the stripes? I ask because all of the above should be small, so if they are dominating, I would assume the stripe is tiny or the compression really worked well. >>>>> >>>>>> * Encoding and Compression >>>>>> >>>>>> The encoding today doesn’t have a lot of flexibility. Sometimes we would >>>>>> need to configure and fine tune encoding when it’s needed. For example, >>>> in >>>>>> previous discussions Gang brought up, we found LEB128 causes zStd to >>>>>> perform really bad. We would end up with much better result by just >>>>>> disabling LEB128 under zstd compression. We don’t have flexibility for >>>>>> these kind of things today. And we will need additional meta fields for >>>>>> that. >>>>>> >>>>> >>>>> I certainly have had requests for custom encodings, but I've tended to >>>> push >>>>> back because it makes it hard to ensure the files are readable on all of >>>>> the platforms. I did just add the option to turn off dictionary encoding >>>>> for particular columns. >>>> >>>> Yep. As someone that maintains a reader/writer implementation, I would >>>> prefer to keep the variety of encodings down for the same reason :) >>>> >>>> As for flexibility, the dictionary encoding flag you mentioned wouldn’t >>>> effect the format, so it seems like a reasonable change to me. One format >>>> level flexibility change, I’d like to see is the ability to not sort >>>> dictionaries, because no one is taking advantage of it, and it make it >>>> impossible to predict the output size of the stipe (sorting can make >>>> compression better or worse). >>>> >>> >>> Absolutely. We probably should make that the default for ORC v2. >> >> If we keep this feature, I don’t see any reason to not enable it, since the >> reader must still have all of the complex code. My preference would be to >> require reset at row groups, and then we remove that complex code. > > We implemented a non-sort dictionary encoding using hash map. It performs > better than the sorted version. But have dictionary entires sorted can help > runtime execution directly running on the dictionary (some calls this lazy > materialization). This can be a huge gain in a lot of scenario. Does Presto > have it? The ORC spec currently calls for sorted dictionaries, so if the they are not sorted, they are not valid ORC files. I’m not sure what you exactly mean by "lazy materialization", but Presto has done lazy materialization since the first version of the ORC reader. Also, Presto has always understood and optimized for dictionary and run length encoded data. The only think that I can see a sorted dictionary helping with is optimizing exact or range filters (making then O(1) instead of O(dictionaryCount)). I find that most dictionary are a relatively small size compared to the row count, so the cost of testing each entry isn’t a big deal. -dain
