Hi Cheng Xu,

It would be easier if we would have comment access to the document.
After the first look I have the following comments:
- "different [codec] implementations may not be compatible with others due
to different purposes." - This is a huge problem. Parquet specifies the
compression codecs that the format supports. We've already had issues by
not specifying the codecs properly (see PARQUET-1241
<https://issues.apache.org/jira/browse/PARQUET-1241> for details). We shall
not allow situations like this one. If a parquet file is written with a
compression codec from the spec shall be readable by another parquet
implementation that supports that codec independently from the provider.
- providers of the compression codecs are usually implementation dependent.
How would different parquet implementations handle the different providers?
(e.g. a java based compression provider is to be used by parquet-cpp)
- how do we specify the provider names?

Regards,
Gabor

On Fri, Jun 19, 2020 at 4:30 PM Xu, Cheng A <[email protected]> wrote:

> Hi folks, any suggestions on this?
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Dong, Xin <[email protected]>
> Sent: Friday, June 5, 2020 2:19 PM
> To: [email protected]
> Subject: RE: Proposal for CompressionCodec Provider-aware Compression
> Codec Lookup for parquet-mr
>
> Hi, Walid,
>
> We've moved the doc here for public access:
>
> https://docs.google.com/document/d/1ueSYq2FIzaom23cpHXppig93ylOxe8CU6EwS82dov2E/
>
> Thanks,
> Xin Dong
>
> -----Original Message-----
> From: Gara Walid <[email protected]>
> Sent: Thursday, June 4, 2020 2:14 PM
> To: [email protected]
> Subject: Re: Proposal for CompressionCodec Provider-aware Compression
> Codec Lookup for parquet-mr
>
> Hi Xin,
>
> Thanks for the proposal. Could you please make the google doc public?
>
> Cheers,
> Walid
>
> On Thu, Jun 4, 2020, 6:46 AM Dong, Xin <[email protected]> wrote:
>
> > Hi, All,
> >
> > The existing Parquet compress codec framework only supports codec name
> > based compression implementation lookup. And it's one-2-one mapping
> > which means only one implementation is supported given a codec name.
> > However, there are various implementations for the same codec name.
> > And different implementations may not be compatible with others due to
> > different purposes. Given Gzip as an example, for some accelerators,
> > it's limited in memory capacity and the history buffer size is
> > relatively smaller than CPU based.  And currently codec framework
> > doesn't provide a mechanism to allow users to customize standard
> > compression codec for their own purposes (e.g. performance acceleration,
> workload offloading).
> > To address the problem, we propose a provider-aware compression codec
> > lookup for parquet-mr. We've put the proposal here:
> >
> > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B47
> > 4dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
> >
> > Any comment is welcome and please let us know your feedback.
> >
> > Thanks,
> > Xin Dong
> >
>

Reply via email to