Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in progress 
together with Parquet community to refine our proposal. 
https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html

This proposal here is more general to Arrow (indeed it can be used by native 
Parquet as well). Since Arrow is more in memory format mostly for intermediate 
data, I would expect less consideration in backward compatibility different 
from on-disk Parquet format. Considering this, we can discuss those two things 
separately. For Parquet part, it should be consistent behavior as Java Parquet. 
For Arrow part, it should also be compatible with new extendable Parquet 
compression codec framework. And we can start with Parquet part first.

Thanks
Cheng Xu

From: Micah Kornfield <emkornfi...@gmail.com>
Sent: Tuesday, June 23, 2020 12:11 PM
To: dev <dev@arrow.apache.org>
Cc: Xu, Cheng A <cheng.a...@intel.com>; Xie, Qi <qi....@intel.com>
Subject: Re: Proposal for the plugin API to support user customized compression 
codec

It would be good to clarify the exact scope of this.  If it is particular to 
parquet then we should wait for the discussion on dev@parquet to conclude 
before moving forward.  If it is more general to Arrow, then working through 
scenarios of how this would be used for decompression when the Codec can't 
support generic input would be useful (the codec library is a singleton across 
the arrow codebase).

On Mon, Jun 22, 2020 at 4:23 PM Wes McKinney 
<wesmck...@gmail.com<mailto:wesmck...@gmail.com>> wrote:
hi XieQi,

Is the idea that your custom Gzip implementation would automatically
override any places in the codebase where the built-in one would be
used (like the Parquet codebase)? I see some things in the design doc
about serializing the plugin information in the Parquet file metadata
(assuming you want to speed up decompression Parquet data pages) -- is
there a reason to believe that the plugin would be _required_ in order
to read the file? I recall some messages to the Parquet mailing list
about user-defined codecs.

In general, having a plugin API to provide a means to substitute one
functionally identical for another seems reasonable to me (I could
envision having people customizing kernel execution in the future). We
should try to create a general enough API so that it can be used for
customizations beyond compression codecs so we don't have to go
through a design exercise to support plugin/algorithm overrides for
something else. This is something we could hash out during code review
-- I should have some opinions and I'm sure others will as well

- Wes

On Fri, Jun 19, 2020 at 10:21 AM Xie, Qi 
<qi....@intel.com<mailto:qi....@intel.com>> wrote:
>
> Hi,
>
>
> In demand of better performance, quite some end users want to leverage 
> accelerators (e.g. FPGA, Intel QAT) to offload compression. However, in 
> current Arrow compression framework, it only supports codec name based 
> compression implementation and can't be customized to leverage accelerators. 
> For example, for gzip format, we can't call customized codec to accelerate 
> the compression. We would like to proposal a plugin API to support the 
> customized compression codec. We've put the proposal here:
>
>
>
> https://docs.google.com/document/d/1W_TxVRN7WV1wBVOTdbxngzBek1nTolMlJWy6aqC6WG8/edit
>
>
>
> Any comment is welcome and please let us know your feedback.
>
>
>
> Thanks,
>
> XieQi
>
>
>

Reply via email to