Hi all, I did an experiment to compare the parquet files generated by parquet-cpp and normal spark.
Then I found that parquet file generated by parquet-cpp missing some metadata information. I used "hadoop jar parquet-cli-1.9.1-runtime.jar org.apache.parquet.cli.Main meta file_name.parquet" command to read generated parquet files. Case 1: Enable dictionary (config("parquet.enable.dictionary", "true")) metadata of cpp generated parquet file Properties: (none) Schema: message schema { optional int32 c1; optional int64 c2; } Row group 0: count: 1000000 5.66 B records start: 4 total: 5.397 MB -------------------------------------------------------------------------------- type encodings count avg size nulls min / max c1 INT32 S RBR_ 1000000 3.19 B 0 -999996 / 0 c2 INT64 S RBR_ 1000000 2.47 B 0 -999996 / 0 metadata of spark generated parquet file Properties: org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"c1","type":"integer","nullable":true,"metadata":{}},{"name":"c2","type":"long","nullable":true,"metadata":{}}]} Schema: message spark_schema { optional int32 c1; optional int64 c2; } Row group 0: count: 1000000 5.36 B records start: 4 total: 5.112 MB -------------------------------------------------------------------------------- type encodings count avg size nulls min / max c1 INT32 S _ R 1000000 3.21 B 0 -999996 / 0 c2 INT64 S _ R_ F 1000000 2.15 B 0 -999996 / 0 This shows that Properties and encodings metadata are different in two parquet files. I did an analysis to find the root cause for this. The reason for difference in encoding is, cpp generated parquet file lack "list<PageEncodingStats> encoding_stats" in it's metadata. in fact this is empty in cpp generated metadata. But this is not empty in spark generated metadata (It seems to be that spark generated metadata are correct) Furthermore, I checked the parquet documentation found at https://parquet.apache.org/documentation/latest/ to check the parquet metadata specifications. In that also I couldn't find any information about encoding_stats in metadata. Is this an expected behaviour? Thanks, Omega PS: In a seperate branch of mine I corrected this (added encoding_stats to metadata in parquet-cpp)