Re: ORC double encoding optimization proposal

Owen O'Malley Fri, 30 Mar 2018 08:37:47 -0700

On Wed, Mar 28, 2018 at 1:01 AM, Xiening Dai <xndai....@live.com> wrote:


This modification will increase the complexity of implementation, and I am
> not sure how much we will gain by not closing compression and rle chunks.
> You probably have some data when you firstly designed row group and index.
>

Actually, I didn't. Let's take that as a first step. I'll hack a change so
that we can get a sense of what the new format would look like.

Ok, so what I'm trying is:
* Move the dictionaries (the string contents and lengths) between the
indexes and the data.
* Remove the positions from the row indexes (we don't need them if we flush
at the row group level)
* Close the rle and compression after each row group
* Write the data streams for each of the column
   - the streams are ordered as data, length, secondary, present

So this has a few impacts:
* We can read and process any row group by reading just the bytes for that
row group.
  - That enables a much better async io reader.
  - We reduce the memory required to read a stripe to just the dictionaries
and row group.
* It also means that we could flush the row group to the file as we write.
  - Less memory consumed by the writer
  - We could use async io for writing.

I won't have a lot of time for the next week and half, but this sounds fun.
:)

.. Owen

Re: ORC double encoding optimization proposal

Reply via email to