HeartSaVioR commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-735558685
From what I understand is replacing existing `Distribution` and `ClusteredDistribution` in read path (package `org.apache.spark.sql.connector.read.partitioning`) with new addition here, otherwise the new addition has to be only used for write path, and read and write paths are ended up diverged. Do I understand correctly? I'm not sure how many data sources in ecosystem are already taking down steps to follow up DSv2 changes in Spark 3.0.0. These existing interfaces are marked as `Evolving` so they're allowed to make change, though I'm not 100% sure that is also OK to unify, which means both package and signature are changing. Probably worth to raise a discussion in dev@ mailing list? Btw, the information on distribution and sort order is for optimization (avoid shuffle) on read path, but on write path it's not just an optimization but could be "requirements" for specific data sources, blocking specific data sources to implement. I personally feel this should have higher priority than the read path, unfortunately. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
