[GitHub] [beam] mosche commented on issue #23179: [Bug]: Parquet size exploded for no apparent reason

GitBox Wed, 21 Sep 2022 03:18:07 -0700


mosche commented on issue #23179:
URL: https://github.com/apache/beam/issues/23179#issuecomment-1253499739


   @bsikander I don't think `DropFields.fields` would trigger a shuffle, it's 
just a projection that affects the schema as far as I know. Are you applying a 
deduplication in addition, e.g. `Deduplicate.values()`? If the latter is the 
case it would certainly explain what you are seeing.
   
   It's hard to give a general recommendation here without knowing the data, 
but also your read patterns. You can experiment with partitioning (GroupByKey) 
and ordering (SortValues). In any case it'll be a tradeoff between compute time 
at the time of writing vs storage efficiency and compute time when reading 
(mostly due to pruning entire files / row groups).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] mosche commented on issue #23179: [Bug]: Parquet size exploded for no apparent reason

Reply via email to