GuoPhilipse commented on pull request #28568:
URL: https://github.com/apache/spark/pull/28568#issuecomment-630288947


   
   
   
   Thanks Bart  for you advice,
   we are now in upgrading execution engine(from mr to spark), during the 
period, if the spark shows less expected than hive, we will switch back 
automated ,util we got an idea to solve that case.so the reasons for me to 
raise this pr are:
   
   
   1) we need same sql to run on hive/spark during migrating,if spark failed or 
behaviored less expected.,
   so with a compatibility flag ,as you said, we can easily migrate them and no 
need to change user's sqls, btw,we can change user's behavior after we migrate 
all task to spark(maybe at  spark3.0) to accecpt spark's dialect or new 
features.
   
   
   2) currently. if we do nothing, the migrating hive tasks risks in error 
data, which will be serious, if we block this case with a legacy,then we need 
to detect these tasks before migrating ahead to notice our user, it will also 
need some additional work.
   
   
   so my point is the #28534, is need after migrating, user have to use the 
function correctly if we put hive aside. but during the migrating,i think many 
big companies may suffer less  during embracing spark if we have a elegant 
solution.
   
   
   The above is my poor view,thanks again for your advice.
   Best Regards!
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   At 2020-05-18 23:11:00, "Bart Samwel" <[email protected]> wrote:
   
   @cloud-fan@MaxGekk FYI
   
   There's also PR #28534, which tries to solve the same thing using explicit 
functions.
   
   To be honest, I'm not a big fan of using compatibility flags unless we're 
actually planning to deprecate the old behavior and change the behavior by 
default. Realistically, the next time we can change the default behavior is in 
Spark 4.0, which is likely to be several years out. And until then, throughout 
the Spark 3.x line, you may have Spark deployments out there where some query 
unexpectedly has different semantics than on other Spark deployments. The 
behavior change also doesn't stick if you then port that same workload over to 
other deployments of Spark, and given that it's not made explicit in the 
queries what they mean, and there's no errors, you may silently produce 
incorrect results after changing the deployment.
   
   If anything, I'd be in favor of:
   
   Doing the thing from PR #28534 (adding TIMESTAMP_FROM_SECONDS etc.).
   If we really care enough to change the behavior (and hence break existing 
workloads), we should use a legacy compatibility flag that disables this CAST 
by default, and to let people choose between the (legacy) Spark behavior or the 
(new) Hive behavior. With the strong advice in the "this is disabled" error 
message to migrate to the functions above instead and to leave the setting at 
"disabled". Then people can shoot themselves in the foot if they really want 
to, but then at least we told them so.
   
   —
   You are receiving this because you authored the thread.
   Reply to this email directly, view it on GitHub, or unsubscribe.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to