Hello,

While playing with "parquet-tools", I found that the statistics data of
columns is not being printed out when the following is executed;

$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema --detailed
perf.1000.parquet

​

And the output for a row group like this;

=====================================================================================================================

row group 1: RC:747388 TS:134218473 OFFSET:4
--------------------------------------------------------------------------------
cust_key:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5979444/5979444/1.00
VC:747388 ENC:PLAIN,RLE,BIT_PACKED
name:  BINARY UNCOMPRESSED DO:0 FPO:5979448 SZ:16443766/16443766/1.00
VC:747388 ENC:PLAIN,RLE,BIT_PACKED
address:  BINARY UNCOMPRESSED DO:0 FPO:22423214
SZ:21716568/21716568/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
nation_key:  INT32 UNCOMPRESSED DO:0 FPO:44139782
SZ:2989697/2989697/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
phone:  BINARY UNCOMPRESSED DO:0 FPO:47129479
SZ:14201364/14201364/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
acctbal:  DOUBLE UNCOMPRESSED DO:0 FPO:61330843
SZ:5979444/5979444/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
mktsegment:  BINARY UNCOMPRESSED DO:0 FPO:67310287
SZ:9714675/9714675/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
comment_col:  BINARY UNCOMPRESSED DO:0 FPO:77024962
SZ:57193515/57193515/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED

=====================================================================================================================

​


Then I dived into the code and found "private static void
showDetails(PrettyPrintWriter out, ColumnChunkMetaData meta, boolean name)"
function in MetaDataUtils.java that is responsible of printing the output
above. I inserted a few lines to the "showDetails(...)" function to print
out to MAX & MIN value of Statistics object that can be retrieved from
ColumnChunkMetaData which is passed as argument to the "showDetails"
function.


After the modification, the output became like;

=====================================================================================================================

row group 1: RC:747388 TS:134218473 OFFSET:4
--------------------------------------------------------------------------------
cust_key:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5979444/5979444/1.00
VC:747388 MIN: 0 MAX: 747387 ENC:PLAIN,BIT_PACKED,RLE
name:  BINARY UNCOMPRESSED DO:0 FPO:5979448 SZ:16443766/16443766/1.00
VC:747388 MIN: Binary{18 bytes, [67, 117, 115, 116, 111, 109, 101,
114, 35, 48, 48, 48, 48, 48, 48, 48, 48, 49]} MAX: Binary{18 bytes,
[67, 117, 115, 116, 111, 109, 101, 114, 35, 48, 48, 48, 49, 53, 48,
48, 48, 48]} ENC:PLAIN,BIT_PACKED,RLE
address:  BINARY UNCOMPRESSED DO:0 FPO:22423214
SZ:21716568/21716568/1.00 VC:747388 MIN: Binary{13 bytes, [32, 32, 32,
50, 117, 90, 119, 86, 104, 81, 118, 119, 65]} MAX: Binary{23 bytes,
[122, 122, 120, 71, 107, 116, 122, 88, 84, 77, 75, 83, 49, 66, 120,
90, 108, 103, 81, 57, 110, 113, 81]} ENC:PLAIN,BIT_PACKED,RLE
nation_key:  INT32 UNCOMPRESSED DO:0 FPO:44139782
SZ:2989697/2989697/1.00 VC:747388 MIN: 0 MAX: 24
ENC:PLAIN,BIT_PACKED,RLE
phone:  BINARY UNCOMPRESSED DO:0 FPO:47129479
SZ:14201364/14201364/1.00 VC:747388 MIN: Binary{15 bytes, [49, 48, 45,
49, 48, 48, 45, 49, 48, 54, 45, 49, 54, 49, 55]} MAX: Binary{15 bytes,
[51, 52, 45, 57, 57, 57, 45, 54, 49, 56, 45, 54, 56, 56, 49]}
ENC:PLAIN,BIT_PACKED,RLE
acctbal:  DOUBLE UNCOMPRESSED DO:0 FPO:61330843
SZ:5979444/5979444/1.00 VC:747388 MIN: -999.99 MAX: 9999.99
ENC:PLAIN,BIT_PACKED,RLE
mktsegment:  BINARY UNCOMPRESSED DO:0 FPO:67310287
SZ:9714675/9714675/1.00 VC:747388 MIN: Binary{10 bytes, [65, 85, 84,
79, 77, 79, 66, 73, 76, 69]} MAX: Binary{9 bytes, [77, 65, 67, 72, 73,
78, 69, 82, 89]} ENC:PLAIN,BIT_PACKED,RLE
comment_col:  BINARY UNCOMPRESSED DO:0 FPO:77024962
SZ:57193515/57193515/1.00 VC:747388 MIN: Binary{116 bytes, [32, 84,
105, 114, 101, 115, 105, 97, 115, 32, 97, 99, 99, 111, 114, 100, 105,
110, 103, 32, 116, 111, 32, 116, 104, 101, 32, 115, 108, 121, 108,
121, 32, 98, 108, 105, 116, 104, 101, 32, 105, 110, 115, 116, 114,
117, 99, 116, 105, 111, 110, 115, 32, 100, 101, 116, 101, 99, 116, 32,
113, 117, 105, 99, 107, 108, 121, 32, 97, 116, 32, 116, 104, 101, 32,
115, 108, 121, 108, 121, 32, 101, 120, 112, 114, 101, 115, 115, 32,
99, 111, 117, 114, 116, 115, 46, 32, 101, 120, 112, 114, 101, 115,
115, 32, 100, 105, 110, 111, 115, 32, 119, 97, 107, 101, 32]} MAX:
Binary{41 bytes, [122, 122, 108, 101, 46, 32, 98, 108, 105, 116, 104,
101, 108, 121, 32, 114, 101, 103, 117, 108, 97, 114, 32, 105, 110,
115, 116, 114, 117, 99, 116, 105, 111, 110, 115, 32, 99, 97, 106, 111,
108]} ENC:PLAIN,BIT_PACKED,RLE

=====================================================================================================================

​

Is this feature not implemented intentionally?


Regards,
Onur Soyer

Reply via email to