Hi, Ryan,

Currently the compression codec part of parquet-mr is tightly coupled with 
Hadoop codec API. It's make sense that parquet-mr introduce a separated 
compression API so that both Hadoop codecs and other customized codec 
implementation can be used.
However, our proposal does not intend to address this issue this time. The  
codec which leverage accelerators will still implements the Hadoop codecs API.
We've already have a pull request here : 
https://github.com/apache/parquet-mr/pull/762. In fact, it just allow the codec 
can be configured through configuration file instead of only can be looked up 
via CompressionCodecName. 

Thanks,
Xin Dong

-----Original Message-----
From: Dong, Xin <[email protected]> 
Sent: Wednesday, March 11, 2020 8:18 AM
To: Ryan Blue <[email protected]>; [email protected]
Subject: RE: Provide pluggable APIs to support user customized compression codec

Hi, Ryan,
Would you like to point me to the proposal of the module that you mentioned? 
Does it mean that the File System API  is separated out and allow different 
File System to plug in? We just need to figure out whether our proposal can 
meet the requirements of it. 
Thanks,
Xin Dong

-----Original Message-----
From: Ryan Blue <[email protected]>
Sent: Saturday, March 7, 2020 2:32 AM
To: Parquet Dev <[email protected]>
Subject: Re: Provide pluggable APIs to support user customized compression codec

I think it's a good idea to make this more customizable, as long as the 
compression codecs themselves are an agreed upon set.

One of the only blockers preventing us from building a module that doesn't rely 
on the Hadoop FileSystem API is that we use the Hadoop compression API. Being 
able to plug in something else would be great!

On Fri, Mar 6, 2020 at 7:55 AM Xu, Cheng A <[email protected]> wrote:

> Hi Martin
> > I suppose this is only about the Parquet Writer/Reader 
> > implementation,
> not about changes to the Parquet specification.
> [Cheng's comments] Yes, we don't need to change the specification 
> (Parquet
> format) unless we want to introduce a new compression codec. More 
> often, customers will extend or replace built-in codec with their own.
> So it's codec level changes used by Parquet reader/writer.
>
> > I would like to know whether offloading the task of
> compressing/decompressing some data is really beneficial performance wise.
> [Cheng's comments] There're two beneficial points expected to have 
> from
> accelerators: 1) CPU offloading 2) Better performance. For the second 
> one, you can check the following talk with detailed performance number 
> mentioned.
>
> https://databricks.com/session_eu19/accelerating-apache-spark-with-int
> el-quickassist-technology
>
> >- The accelerator might require to have the compressed data copied 
> >over
> to decompress it. This will only make compression/decompression slower 
> since many of the supported codecs actually have quite fast parsers 
> and decompressors. The accelerator would have to copy it back.
> >- Even if it doesn't have to be copied over, I suppose this 
> >accelerator
> is connected over the PCI-E bus so reading chunks would be expensive. 
> Also, many of those decompressors reference chunks observed previously 
> and perform a memcpy. The accelerator implementation has to be smart 
> about those things.
> >- Many of the decompressors do some decoding and essentially perform 
> >a
> memcpy which makes them quite fast.
>
> [Cheng's comments] Yes, there're several ways to address this like 
> implementing a DMA engine in FPGA or via shared memory.
>
> >- Can the supported codecs like zstd, lz4, etc run on those accelerators?
> [Cheng's comments] Yes.
>
> >Have you done some measurements?
> [Cheng's comments] Please see the slides above as a reference.
>
> Thanks
> Cheng Xu
>
> -----Original Message-----
> From: Radev, Martin <[email protected]>
> Sent: Wednesday, March 4, 2020 6:02 PM
> To: [email protected]
> Subject: Re: Provide pluggable APIs to support user customized 
> compression codec
>
> Hi Xin,
>
>
> thanks for the interest in extending Parquet. I suppose this is only 
> about the Parquet Writer/Reader implementation, not about changes to 
> the Parquet specification.
>
> I would like to know whether offloading the task of 
> compressing/decompressing some data is really beneficial performance wise.
>
> I suppose I don't understand how all of this would come together. Here 
> are my points:
>
> - The accelerator might require to have the compressed data copied 
> over to decompress it. This will only make compression/decompression 
> slower since many of the supported codecs actually have quite fast 
> parsers and decompressors. The accelerator would have to copy it back.
>
> - Even if it doesn't have to be copied over, I suppose this 
> accelerator is connected over the PCI-E bus so reading chunks would be 
> expensive. Also, many of those decompressors reference chunks observed 
> previously and perform a memcpy. The accelerator implementation has to 
> be smart about those things.
> - Many of the decompressors do some decoding and essentially perform a 
> memcpy which makes them quite fast.
>
> - Can the supported codecs like zstd, lz4, etc run on those accelerators?
>
> Have you done some measurements?
>
>
> Kind regards,
>
> Martin
>
>
>
> ________________________________
> From: Dong, Xin <[email protected]>
> Sent: Wednesday, March 4, 2020 1:46:29 AM
> To: [email protected]
> Subject: Provide pluggable APIs to support user customized compression 
> codec
>
> Hi,
> In demand of better performance, quite some end users want to leverage 
> accelerators (e.g. FPGA, Intel QAT) to offload compression computation.
> However, in current parquet-mr code, codec implementation can't be 
> customized to leverage accelerators. We would like to proposal a 
> pluggable API to support the customized compression codec.
> I've opened a JIRA https://issues.apache.org/jira/browse/PARQUET-1804
> for this issue. What's your throughts on this issue?
> Best Regards,
> Xin Dong
>


--
Ryan Blue
Software Engineer
Netflix

Reply via email to