[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

GitBox Mon, 20 Jul 2020 14:41:03 -0700


dongjoon-hyun edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352



   @hvanhovell . Thank you for your feedback. The following looks a little 
wrong to me because the above optimization was one of the recommendations for 
many Hortonworks customers to save their HDFS usage. I knew many production 
usages like that. I almost forgot that, but it rang my head suddenly during 
this PR. (Sadly, after I merged this.)
   >  You are currently just lucky that the system accidentally produces a nice 
layout for you; 99% of our users won't be as lucky. The only way you can he 
sure, is when you add these things yourself.
   
   I understand your point of views fully. However, I'm wondering if you can 
persuade the customers to waste their storage by generating 160x bigger files 
(the example from SPARK-32318). Do you think you can?
   
   ```
   -rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 
part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
   ```
   ```
   -rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 
part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
   ```
   
   
   .
   For the following, SPARK-32318 added a test coverage at master/3.0/2.4. Are 
you suggesting that's not enough? If then, we can add more.
   > Finally I do want to point out that there is no mechanism that captures 
this regression if it pops up again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Reply via email to