On Wed, Mar 28, 2018 at 1:01 AM, Xiening Dai <xndai....@live.com> wrote:
This modification will increase the complexity of implementation, and I am > not sure how much we will gain by not closing compression and rle chunks. > You probably have some data when you firstly designed row group and index. > Actually, I didn't. Let's take that as a first step. I'll hack a change so that we can get a sense of what the new format would look like. Ok, so what I'm trying is: * Move the dictionaries (the string contents and lengths) between the indexes and the data. * Remove the positions from the row indexes (we don't need them if we flush at the row group level) * Close the rle and compression after each row group * Write the data streams for each of the column - the streams are ordered as data, length, secondary, present So this has a few impacts: * We can read and process any row group by reading just the bytes for that row group. - That enables a much better async io reader. - We reduce the memory required to read a stripe to just the dictionaries and row group. * It also means that we could flush the row group to the file as we write. - Less memory consumed by the writer - We could use async io for writing. I won't have a lot of time for the next week and half, but this sounds fun. :) .. Owen