Thanks for the support. I've created PARQUET-1784 <https://issues.apache.org/jira/browse/PARQUET-1784> to track this one. Do not hesitate to comment on the jira.
Cheers, Gabor On Tue, Feb 4, 2020 at 6:34 PM Radev, Martin <[email protected]> wrote: > Dear all, > > > in our project of using Parquet for streaming fp data with various > entropy, we definitely needed to treat the columns differently. > > For fp data with low entropy, dictionary encoding provided good results. > For fp data with entropy >15 bits element, the newly added encoding + zstd > yielded much better results in terms of compression throughput and ratio. > > > I think that it makes sense to offer the possibility to configure each > column differently if the developer wants to, but still keep the default > paths in tact. > > > Kind regards, > > Martin > > ________________________________ > From: Manik Singla <[email protected]> > Sent: Tuesday, February 4, 2020 6:03:14 PM > To: Parquet Dev > Subject: Re: Allow users to fine-tune parquet writing > > I think making parquet more configurable is nice idea. > We had similar kind of requirement where we wanted to have different > configurations for different columns. ( I dont even remember details now as > its 2-3 months) > > We already had some kind of optimizations in system for frequently queried > columns which exist irrespective of data format. So, we thought to save > some money by saving on storage. > > > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well." > > > On Tue, Feb 4, 2020 at 1:53 PM Gabor Szadovszky <[email protected]> wrote: > > > Hi Ryan, > > > > I wouldn't say they are expensive. But, in case the column data is kind > of > > random the column indexes would not help in filtering but would have a > > small overhead in performance. Why would we save column indexes for such > > columns wasting (a little amount of) space and some time at filtering? A > > more simpler situation when the user know that the related column won't > be > > used for selective queries at all. > > But, column indexes was only an example where the user might want to > > fine-tune the configuration. If we want to introduce new encodings it > would > > help a lot if the user would be able to select the exact encoding for a > > column and/or switch the dictionary encoding off for that column. > > Of course, the default mechanism would remain the same as it is: we try > to > > figure out the best configuration for each column. > > > > What do you think about the approach of introducing the suffix described > > above? > > > > Regards, > > Gabor > > > > On Mon, Feb 3, 2020 at 6:26 PM Ryan Blue <[email protected]> > > wrote: > > > > > Are column indexes so expensive that we don't want to use them for all > > > columns? > > > > > > On Mon, Feb 3, 2020 at 6:41 AM Gabor Szadovszky <[email protected]> > > wrote: > > > > > > > Dear All, > > > > > > > > After adding some new statistics and encodings into Parquet it is > > getting > > > > very hard to be smart and choose the best configs automatically. For > > > > example for which columns should we save column index and/or > > > bloom-filters? > > > > Is it worth using dictionary for a column that we know will fall back > > to > > > > another encoding? > > > > I think, we shall allow the users to decide the encoding of one > column > > or > > > > if some optional statistics is to be saved for another. It would also > > > help > > > > testing new encodings/statistics. > > > > > > > > We already have some configuration keys but only for setting the > > property > > > > for the current writing (e.g. parquet.enable.dictionary to enable > > > > dictionary encoding). I suggest extending such existing properties > by a > > > > suffix that can specify the related column: > > > > parquet.enable.dictionary#column.path.col_1 or > > > parquet.enable.dictionary#3 > > > > (3 is the index of the column in the projection). For new properties > we > > > > would keep the same format. I don't know if '#' is a valid character > in > > > the > > > > keys but I guess it should be fine. > > > > > > > > What do you think? > > > > > > > > Cheers, > > > > Gabor > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > >
