hvanhovell commented on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-660698710


   @dongjoon-hyun I am a bit late with my response but here goes :)
   
   > However, the following is not reasonable. There is nothing wrong in the 
file formats. They are just consumers and showing a better performance in a 
sorted input sequence because they are columnar vectorized format. I guess you 
assume that this is only a behavior at ORC. But, I'm sure that you can find 
your customers are relying on this in Parquet, too.
   
   That is making the argument for explicitly organizing the data before the 
write right? You are currently just lucky that the system accidentally produces 
a nice layout for you; 99% of our users won't be as lucky. The only way you can 
he sure, is when you add these things yourself.
   
   > This is not an implicit system behavior in Apache Spark. Apache Spark has 
been working in the procedural ways as you see in the above. If we start to 
ignore the valid working pattern in the production, it becomes a huge 
regression.
   > In short, saving to a file is a totally different and valid story. To 
optimize the final output files, the above pattern have been used in the 
production among Apache Spark users for a long time. If some optimizer rule 
ignores the existing usage, this ends up at a large regression in terms of the 
cost (for example, S3) obviously.
   
   If you generalize the procedural argument then we also should not do things 
like join reordering or swapping window operators. The whole point of a 
declarative system like Spark SQL is that you don't care about how the system 
executes a query, and that it has the freedom move operations around to make 
execution more optimal.
   
   Have you considered that your regression is someone else's speed-up? Sorting 
is not free, and if we can avoid it we should. There might a large group of 
users that are adversely affected by spurious sorts in the queries (e.g. an 
order by in a view).
   
   Finally I do want to point out that there is no mechanism that captures this 
regression if it pops up again.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to