Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/20023
Thanks for your analysis @gatorsmile. Actually the rule you specified for
Oracle is what it uses when casting, rather then when doing arithmetic
operations. Yes DB2 has rather different rules to define the output type of
operations. Anyway, we can have a behavior practically identical to DB2 by
changing the value of `MINIMUM_ADJUSTED_SCALE` to 31. Therefore, I'd propose,
instead of using the configuration you pointed out, to use a configuration for
the `MINIMUM_ADJUSTED_SCALE`, changing which we can have both the behavior of
Hive and SQLServer and the one of DB2. What do you think?
The reason why I am suggesting this is that my first concern is not Hive
compliance, but SQL standard compliance. Indeed, as you con see from the
summary, on point 1 there is not a uniform behavior (but this is OK to SQL
standard since it gives freedom). But on point 2 we are the only ones who are
not compliant to SQL standard. And having this behavior by default doesn't look
the right thing to do IMHO. On point 3, only we and Hive are not compliant.
Thus I think also that should be changed. But in that case, we can't use the
same flag, because it would be inconsistent. What do you think?
I can understand and agree that loosing precision looks scary. But to me
returning `NULL` is even more scary if possible: indeed, `NULL` is what should
be returned if either if the two operands are `NULL`. Thus queries running on
other DBs which relies on this might return very bad result. For instance,
let's think to a report where we join a prices table and a sold_product table
per country. In this use case, we can assume that if the result is `NULL`, it
means that there was no sold product in that country and then coalesce the
output of the multiplication to 0. This would work well on any DB but Spark.
With my proposal of tuning the `MINIMUM_ADJUSTED_SCALE`, each customer can
decide (query by query) how much precision loss they can tolerate. And if we
agree to change point 3 behavior to the SQL standard, in case of it is not
possible to meet their desires we throw an exception, giving them the choice
about what to do: allow more precision loss, change their input data type, et
c. etc. This is the safer way IMHO.
I would ne happy to help improving test cases. May I just kindly ask you
how you meant to do that? What would you like to be tested more? Would you like
me to add more test cases in scope of this PR or to open a new one for that?
Thank you for your time reading my long messages. I just want to take the
best choice and give you all the elements I have to decide for the best all
together.
Thank you.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]