maryannxue commented on pull request #30494:
URL: https://github.com/apache/spark/pull/30494#issuecomment-735522219


   The output partitioning and sort order are totally implementation
   dependent. Say an SMJ would preserve partitioning from both sides while a
   BHJ would only do one side. And we could well have any random join
   implementation that does not preserve partititioning from either side. And
   how can spark possibly guarantee that?
   
   On the other hand, if the user asks for a certain partitioning, they should
   specify it by using "repartition", and it's spark's job to optimize it out
   if possible.
   
   On Sun, Nov 29, 2020, 9:05 PM Manu Zhang <[email protected]> wrote:
   
   > To be clear, we can't make any guarantee about the internal shuffles, as
   > it's not reliable (depends on join strategy, aggregate strategy, etc.)
   >
   > When it's the final stage, it's crossing the boundary and interacting with
   > target table and downstream jobs. It's no longer internal and I think we
   > need have some guarantee here.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/30494#issuecomment-735518365>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AA72RAFSGGYH4KUTSI4BOZLSSMDYDANCNFSM4UB37LRA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to