Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

via GitHub Mon, 13 Nov 2023 17:44:58 -0800


beliefer commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1391880151



##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With 
float type values, if any of the types is INT, BIGINT, or DECIMAL the least 
common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has 
parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - 
scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both 
integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, 
s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - 
s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final 
decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates 
the digits in the fractional part first. For example, `decimal(48, 20)` will be 
reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common 
type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1 | max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1                             | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |

Review Comment:
   AFAIK, the arithmetic operations did not strictly follow this rule.



##########
sql/api/src/main/scala/org/apache/spark/sql/types/DecimalType.scala:
##########
@@ -146,6 +146,18 @@ object DecimalType extends AbstractDataType {
     DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  private[sql] def boundedPreferIntegralDigits(precision: Int, scale: Int): 
DecimalType = {
+    if (precision <= MAX_PRECISION) {
+      DecimalType(precision, scale)
+    } else {
+      // If we have to reduce the precision, we should retain the digits in 
the integral part first,
+      // as they are more significant to the value. Here we reduce the scale 
as well to drop the
+      // digits in the fractional part.

Review Comment:
   Looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Reply via email to