dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658828708


   @hvanhovell . I agree with you for the followings.
   > AFAIK nested ordering can be ignored from a relation algebra point of 
view. 
   > Regarding the shuffles. ...
   
   However, the following is not reasonable. There is nothing wrong in the file 
formats. They are just consumers and showing a better performance in a sorted 
input sequence because they are columnar vectorized format. I guess you assume 
that this is only a behavior at ORC. But, I'm sure that you can find your 
customers are relying on this in Parquet, too.
   > If you want sorted runs in ORC then you ought to fix is there, and not 
rely on some implicit system behavior.
   
   This is not an implicit system behavior in Apache Spark. Apache Spark has 
been working in the procedural ways as you see in the above. If we start to 
ignore the valid working pattern in the production, it becomes a huge 
regression.
   
   In short, `saving` to a file is a totally different and valid story. To 
optimize the final output files, the above pattern have been used in the 
production among Apache Spark users for a long time. If some optimizer rule 
ignores the existing usage, this ends up at a large regression in terms of the 
cost (for example, S3) obviously.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to