maryannxue commented on pull request #30494: URL: https://github.com/apache/spark/pull/30494#issuecomment-735522219
The output partitioning and sort order are totally implementation dependent. Say an SMJ would preserve partitioning from both sides while a BHJ would only do one side. And we could well have any random join implementation that does not preserve partititioning from either side. And how can spark possibly guarantee that? On the other hand, if the user asks for a certain partitioning, they should specify it by using "repartition", and it's spark's job to optimize it out if possible. On Sun, Nov 29, 2020, 9:05 PM Manu Zhang <[email protected]> wrote: > To be clear, we can't make any guarantee about the internal shuffles, as > it's not reliable (depends on join strategy, aggregate strategy, etc.) > > When it's the final stage, it's crossing the boundary and interacting with > target table and downstream jobs. It's no longer internal and I think we > need have some guarantee here. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/30494#issuecomment-735518365>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA72RAFSGGYH4KUTSI4BOZLSSMDYDANCNFSM4UB37LRA> > . > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
