HeartSaVioR commented on pull request #31355: URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101
Actually the proposal is more likely giving data source to force having static number of partitions regardless of output data. I see valid concerns about drawbacks when data source has no idea on the output data but able to provide static number of partitions. As data source is blinded on the amount of data, the value it gives could be likely sub-optimal. But as I described on the PR description, there's some exceptional case where it requires "static" number of partitions. We might be able to say "not supported on DSv2", but then we'll never be able to deprecate DSv1 because it doesn't have exhaustive coverage. Furthermore, it's also true that Spark is blinded on the characteristic of data source. Spark assumes the data source has unlimited bandwidth and performance will always be better on higher parallelism. That would depend on the real range what Spark will provide, but I'm wondering it's always true for arbitrary external storage if we assume arbitrary number. If it's not ideal for the target data source to require arbitrary number of parallelism of writes, end users may try to repartition manually in prior to stick with some static number of partitions, but Spark will do repartition again if data source requires distribution/ordering and ignore the adjustment. That's probably different issue as human is giving some hint, but the point is that there could be a case on desired number of partitions. The static number of partitions may not also be good idea for such case, but (lower limit, upper limit) can be given heuristically accounting the characteristic. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
