HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101


   Actually the proposal is more likely giving data source to force having 
static number of partitions regardless of output data. 
   
   I see valid concerns about drawbacks when data source has no idea on the 
output data but able to provide static number of partitions. As data source is 
blinded on the amount of data, the value it gives could be likely sub-optimal. 
But as I described on the PR description, there's some exceptional case where 
it requires "static" number of partitions. We might be able to say "not 
supported on DSv2", but then we'll never be able to deprecate DSv1 because it 
doesn't have exhaustive coverage.
   
   Furthermore, it's also true that Spark is blinded on the characteristic of 
data source. Spark assumes the data source has unlimited bandwidth and 
performance will always be better on higher parallelism. That would depend on 
the real range what Spark will provide, but I'm wondering it's always true for 
arbitrary external storage if we assume arbitrary number.
   
   If it's not ideal for the target data source to require arbitrary number of 
parallelism of writes, end users may try to repartition manually in prior to 
stick with some static number of partitions, but Spark will do repartition 
again if data source requires distribution/ordering and ignore the adjustment. 
That's probably different issue as human is giving some hint, but the point is 
that there could be a case on desired number of partitions. The static number 
of partitions may not also be good idea for such case, but (lower limit, upper 
limit) can be given heuristically accounting the characteristic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to