bsikander opened a new issue, #23179: URL: https://github.com/apache/beam/issues/23179
### What happened? Hello, I am facing a strange issue where my parquet file sizes are exploding. My Env: * Beam SDK: 2.35.0 * Parquet-mr: 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20) * Execution Env: Google Dataflow * Compression: Snappy I have a pipeline which writes data to GCS and bigquery. I noticed a strange behavior where the parquet files being written in GCS directory were very big e.g. instead of being 5GB (in total) as they normally should be, they were around 35-40GB. I suspected that the write process maybe failed but that was not the case. I ran a few tests using spark instead of dataflow/beam. * The count of records in bigquery and gcs are the same. * If I read the data from bigquery and write to gcs using spark, the output size is as expected (5GB). * If I read the data from the big gcs folder (35-40GB) and try to write it again using spark, it is still the same size. * If I read the data from the big gcs folder (35-40GB and try to repartition(10) it, I still get the same size of 35-40GB. The only difference I can see between old and new pipeline is the upgrade of beam sdk from 2.20.0 to 2.35.0. I tried searching online and release notes but couldn't find anything. Is this something known? My only suspicion is the ParquetIO doing something wrong. Any help would be much appreciated. ### Issue Priority Priority: 1 ### Issue Component Component: io-java-parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
