Thanks Gabor for the comments. 
https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/edit
 Updated with comment access. 

Yes, ideally, we should have all codec backward compatible with customized 
ones. However, in some cases, it's hard to support that. For some users, they 
may reply on some accelerators to do the compression work. Those accelerators 
are limited in memory which doesn't allow a large history buffer to decompress. 

My understanding for this proposal is we try to introduce a framework to allow 
customers customize their compression codec. And it's customer's own 
responsibility if they use in-compatible format in return with good performance.
This is similar to what airlift did. Airlift is actually a codec provider. It 
provides a few codec supported by Parquet. We can have some official supported 
codec provider IDs like built-in, airlift. And users can make their own 
decisions to extend providers with their new codec providers.

Your thoughts on this?

Thanks
Cheng Xu

-----Original Message-----
From: Gabor Szadovszky <[email protected]> 
Sent: Monday, June 22, 2020 5:09 PM
To: Parquet Dev <[email protected]>
Subject: Re: Proposal for CompressionCodec Provider-aware Compression Codec 
Lookup for parquet-mr

Hi Cheng Xu,

It would be easier if we would have comment access to the document.
After the first look I have the following comments:
- "different [codec] implementations may not be compatible with others due to 
different purposes." - This is a huge problem. Parquet specifies the 
compression codecs that the format supports. We've already had issues by not 
specifying the codecs properly (see PARQUET-1241 
<https://issues.apache.org/jira/browse/PARQUET-1241> for details). We shall not 
allow situations like this one. If a parquet file is written with a compression 
codec from the spec shall be readable by another parquet implementation that 
supports that codec independently from the provider.
- providers of the compression codecs are usually implementation dependent.
How would different parquet implementations handle the different providers?
(e.g. a java based compression provider is to be used by parquet-cpp)
- how do we specify the provider names?

Regards,
Gabor

On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <[email protected]> wrote:

> Hi folks, any suggestions on this?
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Dong, Xin <[email protected]>
> Sent: Friday, June 5, 2020 2:19 PM
> To: [email protected]
> Subject: RE: Proposal for CompressionCodec Provider-aware Compression 
> Codec Lookup for parquet-mr
>
> Hi, Walid,
>
> We've moved the doc here for public access:
>
> https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6E
> wS82dov2E/
>
> Thanks,
> Xin Dong
>
> -----Original Message-----
> From: Gara Walid <[email protected]>
> Sent: Thursday, June 4, 2020 2:14 PM
> To: [email protected]
> Subject: Re: Proposal for CompressionCodec Provider-aware Compression 
> Codec Lookup for parquet-mr
>
> Hi Xin,
>
> Thanks for the proposal. Could you please make the google doc public?
>
> Cheers,
> Walid
>
> On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <[email protected]> wrote:
>
> > Hi, All,
> >
> > The existing Parquet compress codec framework only supports codec 
> > name based compression implementation lookup. And it's one-2-one 
> > mapping which means only one implementation is supported given a codec name.
> > However, there are various implementations for the same codec name.
> > And different implementations may not be compatible with others due 
> > to different purposes. Given Gzip as an example, for some 
> > accelerators, it's limited in memory capacity and the history buffer 
> > size is relatively smaller than CPU based.  And currently codec 
> > framework doesn't provide a mechanism to allow users to customize 
> > standard compression codec for their own purposes (e.g. performance 
> > acceleration,
> workload offloading).
> > To address the problem, we propose a provider-aware compression 
> > codec lookup for parquet-mr. We've put the proposal here:
> >
> > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B
> > 47 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
> >
> > Any comment is welcome and please let us know your feedback.
> >
> > Thanks,
> > Xin Dong
> >
>

Reply via email to