HeartSaVioR edited a comment on pull request #31355: URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101
Actually the proposal is more likely giving data source to force having static number of partitions regardless of output data. I see valid concerns about drawbacks when data source has no idea on the output data but able to provide static number of partitions. As data source is blinded on the amount of data, the value it gives could be likely sub-optimal. But as I described on the PR description, there's some exceptional case where it requires "static" number of partitions. We might be able to say "not supported on DSv2", but then we'll never be able to deprecate DSv1 because it doesn't have exhaustive coverage. Furthermore, it's also true that Spark is blinded on the characteristic of data source. Spark assumes the data source has unlimited bandwidth and performance will always be better on higher parallelism. That would depend on the real range what Spark will provide, but I'm wondering it's always true for arbitrary external storage if we assume arbitrary number. If it's not ideal for the target data source to require arbitrary number of parallelism of writes, end users may try to repartition manually in prior to stick with some static number of partitions, but Spark will do repartition again if data source requires distribution/ordering and ignore the adjustment. That's probably different issue as human is giving some hint, but the point is that there could be a case on desired number of partitions. The static number of partitions may not also be good idea for such case (might not be too bad if that's an intention by human), but (lower limit, upper limit) can be given heuristically accounting the characteristic. Another case is that data source may know better about the optimization for the relation of data volume and the parallelism. This requires Spark to provide some statistic info back to data source and let data source decide the parallelism if possible. Except static partitioning, others are like sketched ideas. Just a 2 cents. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
