mosche commented on issue #23179: URL: https://github.com/apache/beam/issues/23179#issuecomment-1253499739
@bsikander I don't think `DropFields.fields` would trigger a shuffle, it's just a projection that affects the schema as far as I know. Are you applying a deduplication in addition, e.g. `Deduplicate.values()`? If the latter is the case it would certainly explain what you are seeing. It's hard to give a general recommendation here without knowing the data, but also your read patterns. You can experiment with partitioning (GroupByKey) and ordering (SortValues). In any case it'll be a tradeoff between compute time at the time of writing vs storage efficiency and compute time when reading (mostly due to pruning entire files / row groups). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
