I would like to extend FileIO with possibility to specify a custom sharding function: https://issues.apache.org/jira/browse/BEAM-12493
I have 2 use-cases for this: 1. I need to generate shards which are compatible with Hive bucketing and therefore need to decide shard assignment based on data fields of input element 2. When running e.g. on Spark and job encounters kind of failure which cause a loss of some data from previous stages, Spark does issue recompute of necessary task in necessary stages to recover data. Because the shard assignment function is random as default, some data will end up in different shards and cause duplicates in the final output. Please let me know your thoughts in case you see a reason to not to add such improvement. Thanks, Jozef