Hello, I'm researching the encoding in Parquet recently, I found some potential optimizations for Parquet encoding, I also filed the jiras, any comments are welcomed :), here are the related jiras:
PARQUET-1058: Support enable/disable dictionary for column I found this issue at Hive side, thank Uwe's comments, I found Parquet supports to configure the encoding used by ValueWriters(PARQUET-601), but at Hive(Hadoop) side, ParquetOutputFormat is the entrance of ParquetWriter. And currently ParquetOutputFormat don't support to custom a ValuesWriterFactory, so I filed another jira about support set the factory at Hive side(PARQUET-1062), the initial patch could be found at https://github.com/apache/parquet-mr/pull/419 PARQUET-1060: Parquet Dictionary should support encoding Currently, I found the Parquet Dictionary is plain encoding, I think it could be improved with BitPacking. PARQUET-1059<https://issues.apache.org/jira/browse/PARQUET-1059>: Improve the RLE encoding for Parquet Dictionary IDs The IDs of Parquet Dictionary encoding is using RunLengthBitPackingHybridEncoder. It handles encoding with repeat and bitpacking well, but I think it still could be improved with the method likes DeltaBinaryPackingWriter, for some cases, dictionary IDs may be adjoining or near. Regards, Dapeng
