On Fri, Sep 28, 2018 at 2:40 PM Xiening Dai <xndai....@live.com> wrote:

> Hi all,
>
> While we are working on the new Orc v2 spec, I want to bounce some ideas
> in this group. If we can get something concrete, I will open JIRAs to
> follow up. Some of these ideas were mentioned before in various discussion,
> but I just put them together in a list so people can comment and provide
> feedback. Thanks.
>
>
>   *   Clustered Index
>
> In a lot of time, data written into Orc file has sorting property. For
> example a sort merge join output will result in the data stream being
> sorted on join key(s). Another example is the DISTRIBUTED BY … SORTED BY …
> keywords can enforce the sorting property on certain data set. Under such
> cases, if we can just record the sort key(s) values for the first row of
> each row group it will help us a lot while doing lookup using key ranges,
> because we already have row group index which gives us ~O(1) seek time.
> During query execution, a SQL filter predicate can be easily turned into a
> key range. For example “WHERE id > 0 and id <= 100” will be translated into
> range (0, 100], and this key range can be passed down all the way to the
> Orc reader. Then we only need to load the corresponding row groups that
> covers this range.
>

Dain covered this already, but the min/max should already do this.

One piece that we should add is automatically recording whether the data is
strictly ascending or descending.


>
>   *   Stripe Footer Location
>
> Today stripe footers are stored at the end of each stripe. This design
> probably come from the Hive world where the implementation tries to align
> Orc stripe with an HDFS block. It would make sense when you only need to
> read one HDFS block for both the data and the footer. But the alignment
> assumption doesn’t hold in other systems that leverage Orc as a columnar
> data format. Besides even for Hive, often time it’s hard to make sure good
> alignment due to various reasons - for example, when memory pressure is
> high stripe needs to be flushed to disk earlier. With this in mind, it will
> make sense to support saving stripe footer at the end of the file, together
> with all the other file meta. This would be easier for one sequential IO to
> load all the meta, and is easier to cache them all together. And we can
> make this configurable through writer options.
>

Some of the systems like S3 and Wasabi have a strong preference for reading
forward in files, so I think it is reasonable to put the stripe "footer"
first in the file, followed by the indexes, and finally the data. Putting
the "footer" first isn't hard for the implementation since it has to buffer
the entire stripe before anything is written.

One of the ideas that I've been playing with is to make "stripelets" where
each one contains 128k rows and flush the streams at that point. That would
enable you to read the first stripelet and start processing while you read
the next stripelet.


>
>   *   File Level Dictionary
>
> Currently Orc builds dictionary at stripe level. Each stripe has its own
> dictionary. But in most cases, data across stripes share a lot of
> similarities. Building one file level dictionary is probably more efficient
> than having one dictionary each stripe. We can reduce storage footprint and
> also improve read performance since we only have one dictionary per column
> per file. One challenge with this design is how to do file merge. Two files
> can have two different dictionary, and we need to be able to merge them
> without rewriting all the data. To solve this problem, we will need to
> support multiple dictionaries identified by uuid. Each stripe records the
> dictionary ID that identifies the dictionary it uses. And the reader loads
> the particular dictionary based on the ID when it loads a stripe. When you
> merge two files, dictionary data doesn’t need to be changed, but just to
> save the dictionaries from both files in the new merged file.
>

I haven't seen the dictionary duplication being a problem. On the other
hand, having cross-block reads, which a global dictionary would require,
are painful. As you point out, stripe merging become much more complicated
in this case. (You don't need UUID schemes, because the writer would always
know the mapping of which stripes & columns are using which dictionary.)


>
>   *   Breaking Compression Block and RLE Runs at Row Group Boundary
>
> Owen has mentioned this in previous discussion. We did a prototype and are
> able to show that there’s only a slight increase of file size (< 1%) with
> the change. But the benefit is obvious - all the seek to row group
> operation will not involve unnecessary decoding/decompression, making it
> really efficient. And this is critical in scenarios such as predicate
> pushdown or range scan using clustered index (see my first bullet point).
> The other benefit is doing so will greatly simply the index implementation
> we have today. We will only need to record a file offset for row group
> index.
>

Yeah, this is the alternative to the stripelets that I discussed above.

>
>
>   *   Encoding and Compression
>
> The encoding today doesn’t have a lot of flexibility. Sometimes we would
> need to configure and fine tune encoding when it’s needed. For example, in
> previous discussions Gang brought up, we found LEB128 causes zStd to
> perform really bad. We would end up with much better result by just
> disabling LEB128 under zstd compression. We don’t have flexibility for
> these kind of things today. And we will need additional meta fields for
> that.
>

I certainly have had requests for custom encodings, but I've tended to push
back because it makes it hard to ensure the files are readable on all of
the platforms. I did just add the option to turn off dictionary encoding
for particular columns.

With respect to zstd, we need to test it under different data sets and
build up an understanding of when it works and when it doesn't. It sounds
like zstd with the options that you were using were a bad fit for what we
need. I would guess that longer windows and pre-loaded dictionaries may
help. We need more work to figure out what the right parameters are in
general by looking at more data sets.

.. Owen

Reply via email to