R: Decimals

Marco Gaido Thu, 21 Dec 2017 18:59:21 -0800

Thanks for your answer Xiao. The point is that behaving like this is against 
SQL standard and is different also from Hive's behavior. Then I would propose 
to add a configuration flag to switch between the two behaviors, either being 
SQL compliant and Hive compliant or behaving like now (as Hermann was 
suggesting in the PR). Do we agree on this way? If so, is there any way to read 
a configuration property in the catalyst project?


Thank you,
Marco

----- Messaggio originale -----
Da: "Xiao Li" <gatorsm...@gmail.com>
Inviato: ‎21/‎12/‎2017 22:46
A: "Marco Gaido" <marcogaid...@gmail.com>
Cc: "Reynold Xin" <r...@databricks.com>; "dev@spark.apache.org" 
<dev@spark.apache.org>
Oggetto: Re: Decimals

Losing precision is not acceptable to financial customers. Thus, instead of 
returning NULL, I saw DB2 issues the following error message:


SQL0802N  Arithmetic overflow or other arithmetic exception occurred.  

SQLSTATE=22003


DB2 on z/OS is being used by most of biggest banks and financial intuitions 
since 1980s. Either issuing exceptions (what DB2 does) or returning NULL (what 
we are doing) looks fine to me. If they want to avoid getting NULL or 
exceptions, users should manually putting the round functions by themselves. 


Also see the technote of DB2 zOS: 
http://www-01.ibm.com/support/docview.wss?uid=swg21161024












2017-12-19 8:41 GMT-08:00 Marco Gaido <marcogaid...@gmail.com>:

Hello everybody,


I did some further researches and now I am sharing my findings. I am sorry, it 
is going to be a quite long e-mail, but I'd really appreciate some feedbacks 
when you have time to read it.


Spark's current implementation of arithmetic operations on decimals was 
"copied" from Hive. Thus, the initial goal of the implementation was to be 
compliant with Hive, which itself aims to reproduce SQLServer behavior. 
Therefore I compared these 3 DBs and of course I checked the SQL ANSI standard 
2011 (you can find it at 
http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip)
 and a late draft of the standard 2003 
(http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3:
how to determine the precision and scale of a result;
how to behave when the result is a number which is not representable exactly 
with the result's precision and scale (ie. requires precision loss);
how to behave when the result is out of the range of the representable values 
with the result's precision and scale (ie. it is bigger of the biggest number 
representable or lower the lowest one).
Currently, Spark behaves like follows:
It follows some rules taken from intial Hive implementation;
it returns NULL;
it returns NULL.


The SQL ANSI is pretty clear about points 2 and 3, while it says barely nothing 
about point 1, I am citing SQL ANSI:2011 page 27:


If the result cannot be represented exactly in the result type, then whether it 
is rounded
or truncated is implementation-defined. An exception condition is raised if the 
result is
outside the range of numeric values of the result type, or if the arithmetic 
operation
is not defined for the operands.


Then, as you can see, Spark is not respecting the SQL standard neither for 
point 2 and 3. Someone, then might argue that we need compatibility with Hive. 
Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's behavior 
is:
Rules are a bit changed, to reflect SQLServer implementation as described in 
this blog 
(https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/);
It rounds the result;
It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI standard and 
throw an Exception).
As far as the other DBs are regarded, there is little to say about Oracle and 
Postgres, since they have a nearly infinite precision, thus it is hard also to 
test the behavior in these conditions, but SQLServer has the same precision as 
Hive and Spark. Thus, this is SQLServer behavior:
Rules should be the same as Hive, as described on their post (tests about the 
behavior confirm);
It rounds the result;
It throws an Exception.
Therefore, since I think that Spark should be compliant to SQL ANSI (first) and 
Hive, I propose the following changes:
Update the rules to derive the result type in order to reflect new Hive's one 
(which are SQLServer's one);
Change Spark behavior to round the result, as done by Hive and SQLServer and 
prescribed by the SQL standard;
Change Spark's behavior, introducing a configuration parameter in order to 
determine whether to return null or throw an Exception (by default I propose to 
throw an exception in order to be compliant with the SQL standard, which IMHO 
is more important that being compliant with Hive).
For 1 and 2, I prepared a PR, which is 
https://github.com/apache/spark/pull/20023. For 3, I'd love to get your 
feedbacks in order to agree on what to do and then I will eventually do a PR 
which reflect what decided here by the community.
I would really love to get your feedback either here or on the PR.


Thanks for your patience and your time reading this long email,
Best regards.
Marco




2017-12-13 9:08 GMT+01:00 Reynold Xin <r...@databricks.com>:

Responses inline



On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido <marcogaid...@gmail.com> wrote:

Hi all,


I saw in these weeks that there are a lot of problems related to decimal values 
(SPARK-22036, SPARK-22755, for instance). Some are related to historical 
choices, which I don't know, thus please excuse me if I am saying dumb things:


 - why are we interpreting literal constants in queries as Decimal and not as 
Double? I think it is very unlikely that a user can enter a number which is 
beyond Double precision.



Probably just to be consistent with some popular databases.


 
 - why are we returning null in case of precision loss? Is this approach better 
than just giving a result which might loose some accuracy?



The contract with decimal is that it should never lose precision (it is created 
for financial reports, accounting, etc). Returning null is at least telling the 
user the data type can no longer support the precision required.


 


Thanks,

Marco

R: Decimals

Reply via email to