Hi, All,

The existing Parquet compress codec framework only supports codec name based 
compression implementation lookup. And it's one-2-one mapping which means only 
one implementation is supported given a codec name.
However, there are various implementations for the same codec name. And 
different implementations may not be compatible with others due to different 
purposes. Given Gzip as an example, for some accelerators, it's limited in 
memory capacity and the history buffer size is relatively smaller than CPU 
based.  And currently codec framework doesn't provide a mechanism to allow 
users to customize standard compression codec for their own purposes (e.g. 
performance acceleration, workload offloading).
To address the problem, we propose a provider-aware compression codec lookup 
for parquet-mr. We've put the proposal here:
https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm

Any comment is welcome and please let us know your feedback.

Thanks,
Xin Dong

Reply via email to