Hi all,

I did an experiment to compare the parquet files generated by parquet-cpp and 
normal spark.

Then  I found that parquet file generated by parquet-cpp missing some metadata 
information.

I used "hadoop jar parquet-cli-1.9.1-runtime.jar org.apache.parquet.cli.Main 
meta file_name.parquet" command to read generated parquet files.


Case 1: Enable dictionary (config("parquet.enable.dictionary", "true"))

metadata of cpp generated parquet file

Properties: (none)
Schema:
message schema {
  optional int32 c1;
  optional int64 c2;
}


Row group 0:  count: 1000000  5.66 B records  start: 4  total: 5.397 MB
--------------------------------------------------------------------------------
    type      encodings count     avg size   nulls   min / max
c1  INT32     S RBR_    1000000   3.19 B     0       -999996 / 0
c2  INT64     S RBR_    1000000   2.47 B     0       -999996 / 0

metadata of spark generated parquet file

Properties:
  org.apache.spark.sql.parquet.row.metadata: 
{"type":"struct","fields":[{"name":"c1","type":"integer","nullable":true,"metadata":{}},{"name":"c2","type":"long","nullable":true,"metadata":{}}]}
Schema:
message spark_schema {
  optional int32 c1;
  optional int64 c2;
}


Row group 0:  count: 1000000  5.36 B records  start: 4  total: 5.112 MB
--------------------------------------------------------------------------------
    type      encodings count     avg size   nulls   min / max
c1  INT32     S _ R     1000000   3.21 B     0       -999996 / 0
c2  INT64     S _ R_ F  1000000   2.15 B     0       -999996 / 0


This shows that Properties and encodings metadata are different in two parquet 
files.

I did an analysis to find the root cause for this.

The reason for difference in encoding is, cpp generated parquet file lack  
"list<PageEncodingStats> encoding_stats" in it's metadata.
in fact this is empty in cpp generated metadata. But this is not empty in spark 
generated metadata (It seems to be that spark generated metadata are correct)

Furthermore, I checked the parquet documentation found at 
https://parquet.apache.org/documentation/latest/ to check the parquet metadata 
specifications. In that also I couldn't find any information about 
encoding_stats in metadata.


Is this an expected behaviour?

Thanks,
Omega

PS: In a seperate branch of mine I corrected this (added encoding_stats to 
metadata in parquet-cpp)

Reply via email to