My thought here is that Parquet as a data format should provide a plugin mechanism. With that kind of APIs, users will be able to leverage their own optimized implementation. For the implementation based on that set of APIs, it should consider the compatibility like fallback mechanism.
Like accelerators, typically they will have CPU based implementation and buffer size sometimes is limited. In that case, fallback is implemented in CPU based implementation. Thoughts? Thanks Cheng Xu -----Original Message----- From: Gabor Szadovszky <[email protected]> Sent: Wednesday, March 4, 2020 5:59 PM To: Parquet Dev <[email protected]> Subject: Re: Provide pluggable APIs to support user customized compression codec Hi, My problem with this idea is that I cannot see how we can control that a customized codec would compress the data in the specified way so every reader that supports the codec can read it. We already have an issue about an incompatibility between the java and cpp implementations of the LZ4 compression (see https://issues.apache.org/jira/browse/PARQUET-1241 for details). Meanwhile, there might be several ways to generate a compatible compression so it is fair to allow the configuration of the codec just don't know how to properly control the output. Cheers, Gabor On Tue, Mar 3, 2020 at 7:00 PM Dong, Xin <[email protected]> wrote: > Hi, > In demand of better performance, quite some end users want to leverage > accelerators (e.g. FPGA, Intel QAT) to offload compression computation. > However, in current parquet-mr code, codec implementation can't be > customized to leverage accelerators. We would like to proposal a > pluggable API to support the customized compression codec. > I've opened a JIRA https://issues.apache.org/jira/browse/PARQUET-1804 > for this issue. What's your throughts on this issue? > Best Regards, > Xin Dong >
