Parquet encoding improvement

Sun, Dapeng Fri, 21 Jul 2017 01:54:56 -0700

Hello,

I'm researching the encoding in Parquet recently, I found some potential 
optimizations for Parquet encoding, I also filed the jiras, any comments are 
welcomed :), here are the related jiras:


PARQUET-1058: Support enable/disable dictionary for column
I found this issue at Hive side, thank Uwe's comments, I found Parquet supports 
to configure the encoding used by ValueWriters(PARQUET-601), but at 
Hive(Hadoop) side, ParquetOutputFormat is the entrance of ParquetWriter. And 
currently ParquetOutputFormat don't support to custom a ValuesWriterFactory, so 
I filed another jira about support set the factory at Hive side(PARQUET-1062), 
the initial patch could be found at 
https://github.com/apache/parquet-mr/pull/419

PARQUET-1060: Parquet Dictionary should support encoding
Currently, I found the Parquet Dictionary is plain encoding, I think it could 
be improved with BitPacking.

PARQUET-1059<https://issues.apache.org/jira/browse/PARQUET-1059>: Improve the 
RLE encoding for Parquet Dictionary IDs
The IDs of Parquet Dictionary encoding is using 
RunLengthBitPackingHybridEncoder. It handles encoding with repeat and 
bitpacking well, but I think it still could be improved with the method likes 
DeltaBinaryPackingWriter, for some cases, dictionary IDs may be adjoining or 
near.


Regards,
Dapeng

Parquet encoding improvement

Reply via email to