[GitHub] [spark] HeartSaVioR edited a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

GitBox Wed, 27 Jan 2021 00:12:05 -0800


HeartSaVioR edited a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101



   Actually the proposal is more likely giving data source to force having 
static number of partitions regardless of output data. 
   
   I see valid concerns about drawbacks when data source has no idea on the 
output data but able to provide static number of partitions. As data source is 
blinded on the amount of data, the value it gives could be likely sub-optimal. 
But as I described on the PR description, there's some exceptional case where 
it requires "static" number of partitions. We might be able to say "not 
supported on DSv2", but then we'll never be able to deprecate DSv1 because it 
doesn't have exhaustive coverage.
   
   Furthermore, it's also true that Spark is blinded on the characteristic of 
data source. Spark assumes the data source has unlimited bandwidth and 
performance will always be better on higher parallelism. That would depend on 
the real range what Spark will provide, but I'm wondering it's always true for 
arbitrary external storage if we assume arbitrary number.
   
   If it's not ideal for the target data source to require arbitrary number of 
parallelism of writes, end users may try to repartition manually in prior to 
stick with some static number of partitions, but Spark will do repartition 
again if data source requires distribution/ordering and ignore the adjustment. 
That's probably different issue as human is giving some hint, but the point is 
that there could be a case on desired number of partitions. The static number 
of partitions may not also be good idea for such case (might not be too bad if 
that's an intention by human), but (lower limit, upper limit) can be given 
heuristically accounting the characteristic.
   
   Another case is that data source may know better about the optimization for 
the relation of data volume and the parallelism. This requires Spark to provide 
some statistic info back to data source and let data source decide the 
parallelism if possible.
   
   Except static partitioning, others are like sketched ideas. Just a 2 cents.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Reply via email to