I opened https://issues.apache.org/jira/browse/PARQUET-1780. We would
welcome a PR for this

On Tue, Jan 28, 2020 at 12:33 PM Omega Accelr <omega.gam...@accelr.net> wrote:
>
>
> Hi all,
>
> I did an experiment to compare the parquet files generated by parquet-cpp and 
> normal spark.
>
> Then  I found that parquet file generated by parquet-cpp missing some 
> metadata information.
>
> I used "hadoop jar parquet-cli-1.9.1-runtime.jar org.apache.parquet.cli.Main 
> meta file_name.parquet" command to read generated parquet files.
>
>
> Case 1: Enable dictionary (config("parquet.enable.dictionary", "true"))
>
> metadata of cpp generated parquet file
>
> Properties: (none)
> Schema:
> message schema {
>   optional int32 c1;
>   optional int64 c2;
> }
>
>
> Row group 0:  count: 1000000  5.66 B records  start: 4  total: 5.397 MB
> --------------------------------------------------------------------------------
>     type      encodings count     avg size   nulls   min / max
> c1  INT32     S RBR_    1000000   3.19 B     0       -999996 / 0
> c2  INT64     S RBR_    1000000   2.47 B     0       -999996 / 0
>
> metadata of spark generated parquet file
>
> Properties:
>   org.apache.spark.sql.parquet.row.metadata: 
> {"type":"struct","fields":[{"name":"c1","type":"integer","nullable":true,"metadata":{}},{"name":"c2","type":"long","nullable":true,"metadata":{}}]}
> Schema:
> message spark_schema {
>   optional int32 c1;
>   optional int64 c2;
> }
>
>
> Row group 0:  count: 1000000  5.36 B records  start: 4  total: 5.112 MB
> --------------------------------------------------------------------------------
>     type      encodings count     avg size   nulls   min / max
> c1  INT32     S _ R     1000000   3.21 B     0       -999996 / 0
> c2  INT64     S _ R_ F  1000000   2.15 B     0       -999996 / 0
>
>
> This shows that Properties and encodings metadata are different in two 
> parquet files.
>
> I did an analysis to find the root cause for this.
>
> The reason for difference in encoding is, cpp generated parquet file lack  
> "list<PageEncodingStats> encoding_stats" in it's metadata.
> in fact this is empty in cpp generated metadata. But this is not empty in 
> spark generated metadata (It seems to be that spark generated metadata are 
> correct)
>
> Furthermore, I checked the parquet documentation found at 
> https://parquet.apache.org/documentation/latest/ to check the parquet 
> metadata specifications. In that also I couldn't find any information about 
> encoding_stats in metadata.
>
>
> Is this an expected behaviour?
>
> Thanks,
> Omega
>
> PS: In a seperate branch of mine I corrected this (added encoding_stats to 
> metadata in parquet-cpp)
>

Reply via email to