[GitHub] [spark] HeartSaVioR commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

GitBox Sun, 29 Nov 2020 21:38:53 -0800


HeartSaVioR commented on pull request #29066:
URL: https://github.com/apache/spark/pull/29066#issuecomment-735558685



   From what I understand is replacing existing `Distribution` and 
`ClusteredDistribution` in read path (package 
`org.apache.spark.sql.connector.read.partitioning`) with new addition here, 
otherwise the new addition has to be only used for write path, and read and 
write paths are ended up diverged. Do I understand correctly?
   
   I'm not sure how many data sources in ecosystem are already taking down 
steps to follow up DSv2 changes in Spark 3.0.0. These existing interfaces are 
marked as `Evolving` so they're allowed to make change, though I'm not 100% 
sure that is also OK to unify, which means both package and signature are 
changing. Probably worth to raise a discussion in dev@ mailing list?
   
   Btw, the information on distribution and sort order is for optimization 
(avoid shuffle) on read path, but on write path it's not just an optimization 
but could be "requirements" for specific data sources, blocking specific data 
sources to implement. I personally feel this should have higher priority than 
the read path, unfortunately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

Reply via email to