dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658828708
@hvanhovell . I agree with you for the followings. > AFAIK nested ordering can be ignored from a relation algebra point of view. > Regarding the shuffles. ... However, the following is not reasonable. There is nothing wrong in the file formats. They are just consumers and showing a better performance in a sorted input sequence because they are columnar vectorized format. I guess you assume that this is only a behavior at ORC. But, I'm sure that you can find your customers are relying on this in Parquet, too. > If you want sorted runs in ORC then you ought to fix is there, and not rely on some implicit system behavior. This is not an implicit system behavior in Apache Spark. Apache Spark has been working in the procedural ways as you see in the above. If we start to ignore the valid working pattern in the production, it becomes a huge regression. In short, `saving` to a file is a totally different and valid story. To optimize the final output files, the above pattern have been used in the production among Apache Spark users for a long time. If some optimizer rule ignores the existing usage, this ends up at a large regression in terms of the cost (for example, S3) obviously. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
