hvanhovell commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-660698710
@dongjoon-hyun I am a bit late with my response but here goes :) > However, the following is not reasonable. There is nothing wrong in the file formats. They are just consumers and showing a better performance in a sorted input sequence because they are columnar vectorized format. I guess you assume that this is only a behavior at ORC. But, I'm sure that you can find your customers are relying on this in Parquet, too. That is making the argument for explicitly organizing the data before the write right? You are currently just lucky that the system accidentally produces a nice layout for you; 99% of our users won't be as lucky. The only way you can he sure, is when you add these things yourself. > This is not an implicit system behavior in Apache Spark. Apache Spark has been working in the procedural ways as you see in the above. If we start to ignore the valid working pattern in the production, it becomes a huge regression. > In short, saving to a file is a totally different and valid story. To optimize the final output files, the above pattern have been used in the production among Apache Spark users for a long time. If some optimizer rule ignores the existing usage, this ends up at a large regression in terms of the cost (for example, S3) obviously. If you generalize the procedural argument then we also should not do things like join reordering or swapping window operators. The whole point of a declarative system like Spark SQL is that you don't care about how the system executes a query, and that it has the freedom move operations around to make execution more optimal. Have you considered that your regression is someone else's speed-up? Sorting is not free, and if we can avoid it we should. There might a large group of users that are adversely affected by spurious sorts in the queries (e.g. an order by in a view). Finally I do want to point out that there is no mechanism that captures this regression if it pops up again. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
