Hi Cheng Xu, It would be easier if we would have comment access to the document. After the first look I have the following comments: - "different [codec] implementations may not be compatible with others due to different purposes." - This is a huge problem. Parquet specifies the compression codecs that the format supports. We've already had issues by not specifying the codecs properly (see PARQUET-1241 <https://issues.apache.org/jira/browse/PARQUET-1241> for details). We shall not allow situations like this one. If a parquet file is written with a compression codec from the spec shall be readable by another parquet implementation that supports that codec independently from the provider. - providers of the compression codecs are usually implementation dependent. How would different parquet implementations handle the different providers? (e.g. a java based compression provider is to be used by parquet-cpp) - how do we specify the provider names?
Regards, Gabor On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <[email protected]> wrote: > Hi folks, any suggestions on this? > > Thanks > Cheng Xu > > -----Original Message----- > From: Dong, Xin <[email protected]> > Sent: Friday, June 5, 2020 2:19 PM > To: [email protected] > Subject: RE: Proposal for CompressionCodec Provider-aware Compression > Codec Lookup for parquet-mr > > Hi, Walid, > > We've moved the doc here for public access: > > https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/ > > Thanks, > Xin Dong > > -----Original Message----- > From: Gara Walid <[email protected]> > Sent: Thursday, June 4, 2020 2:14 PM > To: [email protected] > Subject: Re: Proposal for CompressionCodec Provider-aware Compression > Codec Lookup for parquet-mr > > Hi Xin, > > Thanks for the proposal. Could you please make the google doc public? > > Cheers, > Walid > > On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <[email protected]> wrote: > > > Hi, All, > > > > The existing Parquet compress codec framework only supports codec name > > based compression implementation lookup. And it's one-2-one mapping > > which means only one implementation is supported given a codec name. > > However, there are various implementations for the same codec name. > > And different implementations may not be compatible with others due to > > different purposes. Given Gzip as an example, for some accelerators, > > it's limited in memory capacity and the history buffer size is > > relatively smaller than CPU based. And currently codec framework > > doesn't provide a mechanism to allow users to customize standard > > compression codec for their own purposes (e.g. performance acceleration, > workload offloading). > > To address the problem, we propose a provider-aware compression codec > > lookup for parquet-mr. We've put the proposal here: > > > > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B47 > > 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm > > > > Any comment is welcome and please let us know your feedback. > > > > Thanks, > > Xin Dong > > >
