[GitHub] [beam] bsikander opened a new issue, #23179: [Bug]: Parquet size exploded for no apparent reason

GitBox Sun, 11 Sep 2022 07:41:07 -0700


bsikander opened a new issue, #23179:
URL: https://github.com/apache/beam/issues/23179


   ### What happened?
   
   Hello,
   I am facing a strange issue where my parquet file sizes are exploding.
   My Env:
   * Beam SDK: 2.35.0
   * Parquet-mr: 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
   * Execution Env: Google Dataflow
   * Compression: Snappy
    
   I have a pipeline which writes data to GCS and bigquery. I noticed a strange 
behavior where the parquet files being written in GCS directory were very big 
e.g. instead of being 5GB (in total) as they normally should be, they were 
around 35-40GB.
    
   I suspected that the write process maybe failed but that was not the case. I 
ran a few tests using spark instead of dataflow/beam.
   * The count of records in bigquery and gcs are the same.
   * If I read the data from bigquery and write to gcs using spark, the output 
size is as expected (5GB).
   * If I read the data from the big gcs folder (35-40GB) and try to write it 
again using spark, it is still the same size.
   * If I read the data from the big gcs folder (35-40GB and try to 
repartition(10) it, I still get the same size of 35-40GB.
   
   The only difference I can see between old and new pipeline is the upgrade of 
beam sdk from 2.20.0 to 2.35.0.
   I tried searching online and release notes but couldn't find anything. Is 
this something known? My only suspicion is the ParquetIO doing something wrong.
    
   Any help would be much appreciated.
   
   ### Issue Priority
   
   Priority: 1
   
   ### Issue Component
   
   Component: io-java-parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] bsikander opened a new issue, #23179: [Bug]: Parquet size exploded for no apparent reason

Reply via email to