I opened https://issues.apache.org/jira/browse/PARQUET-1780. We would welcome a PR for this
On Tue, Jan 28, 2020 at 12:33 PM Omega Accelr <omega.gam...@accelr.net> wrote: > > > Hi all, > > I did an experiment to compare the parquet files generated by parquet-cpp and > normal spark. > > Then I found that parquet file generated by parquet-cpp missing some > metadata information. > > I used "hadoop jar parquet-cli-1.9.1-runtime.jar org.apache.parquet.cli.Main > meta file_name.parquet" command to read generated parquet files. > > > Case 1: Enable dictionary (config("parquet.enable.dictionary", "true")) > > metadata of cpp generated parquet file > > Properties: (none) > Schema: > message schema { > optional int32 c1; > optional int64 c2; > } > > > Row group 0: count: 1000000 5.66 B records start: 4 total: 5.397 MB > -------------------------------------------------------------------------------- > type encodings count avg size nulls min / max > c1 INT32 S RBR_ 1000000 3.19 B 0 -999996 / 0 > c2 INT64 S RBR_ 1000000 2.47 B 0 -999996 / 0 > > metadata of spark generated parquet file > > Properties: > org.apache.spark.sql.parquet.row.metadata: > {"type":"struct","fields":[{"name":"c1","type":"integer","nullable":true,"metadata":{}},{"name":"c2","type":"long","nullable":true,"metadata":{}}]} > Schema: > message spark_schema { > optional int32 c1; > optional int64 c2; > } > > > Row group 0: count: 1000000 5.36 B records start: 4 total: 5.112 MB > -------------------------------------------------------------------------------- > type encodings count avg size nulls min / max > c1 INT32 S _ R 1000000 3.21 B 0 -999996 / 0 > c2 INT64 S _ R_ F 1000000 2.15 B 0 -999996 / 0 > > > This shows that Properties and encodings metadata are different in two > parquet files. > > I did an analysis to find the root cause for this. > > The reason for difference in encoding is, cpp generated parquet file lack > "list<PageEncodingStats> encoding_stats" in it's metadata. > in fact this is empty in cpp generated metadata. But this is not empty in > spark generated metadata (It seems to be that spark generated metadata are > correct) > > Furthermore, I checked the parquet documentation found at > https://parquet.apache.org/documentation/latest/ to check the parquet > metadata specifications. In that also I couldn't find any information about > encoding_stats in metadata. > > > Is this an expected behaviour? > > Thanks, > Omega > > PS: In a seperate branch of mine I corrected this (added encoding_stats to > metadata in parquet-cpp) >