[GitHub] [spark] beliefer opened a new pull request, #37536: [WIP][SPARK-40100][SQL] Add DataType class for Int128 type

GitBox Tue, 16 Aug 2022 03:57:05 -0700


beliefer opened a new pull request, #37536:
URL: https://github.com/apache/spark/pull/37536


   ### What changes were proposed in this pull request?
   Extend Catalyst's type system by a new type: Int128Type represents the 
Int128.
   
   ### Why are the changes needed?
   Spark SQL today supports the Decimal data type. The implementation of Spark 
Decimal holds a BigDecimal or Long value. Spark Decimal provides some operators 
like +, -, *, /, % and so on. These operators rely heavily on the computational 
power of BigDecimal or Long itself. For ease of understanding, take the + as an 
example. The implementation shows below.
   ```
     def + (that: Decimal): Decimal = {
       if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == 
that.scale) {
         Decimal(longVal + that.longVal, Math.max(precision, that.precision) + 
1, scale)
       } else {
         Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal))
       }
     }
   ```
   We can see the + of Long will be called if Spark Decimal holds a Long value. 
Otherwise, the add of BigDecimal will be called if Spark Decimal holds a 
BigDecimal value. The other operators of Spark Decimal adopt the similar way. 
Furthermore, the code shown above calls Decimal.apply to construct a new 
instance of Spark Decimal. As we know, the add operator of BigDecimal 
constructed a new instance of BigDecimal. So, if we call the + operator of 
Spark Decimal who holds a Long value, Spark will construct a new instance of 
`Decimal`. Otherwise, Spark will construct a new instance of BigDecimal and a 
new instance of Decimal.
   Through rough analysis, we know:
   1. The computational power of Spark Decimal may depend on BigDecimal.
   2. The calculation operators of Spark Decimal create a lot of new instances 
of Decimal and may create a lot of new instances of BigDecimal.
   If a large table has a field called 'colA whose type is Spark Decimal, the 
execution of SUM('colA) will involve the creation of a large number of Spark 
Decimal instances and BigDecimal instances. These Spark Decimal instances and 
BigDecimal instances will lead to garbage collection frequently.
   
   In this new feature, we will introduce Int128 type.
   `Int128` is a high-performance data type about 2X~10X more efficient than 
Spark Decimal for typical operations. It uses a finite (128 bit) precision and 
can handle up to decimal(38, X). The implementation of Int128 just uses two 
Long values to represent the high and low bits of 128 bits respectively. Int128 
is lighter than Spark `Decimal`, reduces the cost of new() and garbage 
collection.
   
   This is a starting PR. See more details in 
https://issues.apache.org/jira/browse/SPARK-40097
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, a new data type for Int128. It is still in development.
   
   
   ### How was this patch tested?
   New test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] beliefer opened a new pull request, #37536: [WIP][SPARK-40100][SQL] Add DataType class for Int128 type

Reply via email to