[ https://issues.apache.org/jira/browse/SPARK-40032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577942#comment-17577942 ]
jiaan.geng commented on SPARK-40032: ------------------------------------ We are editing the design doc. > Support Decimal128 type > ----------------------- > > Key: SPARK-40032 > URL: https://issues.apache.org/jira/browse/SPARK-40032 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.4.0 > Reporter: jiaan.geng > Priority: Major > Attachments: Performance comparison between decimal128 and spark > decimal benchmark.pdf > > > Spark SQL today supports the DECIMAL data type. The implementation of Decimal > that can hold a BigDecimal or Long. Decimal provides some operators like +, > -, *, / and so on. > Take the + as example, the implementation show below. > {code:java} > def + (that: Decimal): Decimal = { > if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == > that.scale) { > Decimal(longVal + that.longVal, Math.max(precision, that.precision) + > 1, scale) > } else { > Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) > } > } > {code} > We can see there exists two addition and call Decimal.apply. The add operator > of BigDecimal will construct a new BigDecimal instance. > The implementation of Decimal.apply will call new to construct a new Decimal > instance with the new BigDecimal instance. > As we know, Decimal instance will hold the new BigDecimal instance. > If a large table has a Decimal field called 'colA, the execution of > SUM('colA) will involve the creation of a large number of Decimal instances > and BigDecimal instances. These Decimal instances and BigDecimal instances > will lead to garbage collection to occur frequently. > Decimal128 is a high-performance decimal about 8X more efficient than Java > BigDecimal for typical operations. It uses a finite (128 bit) precision and > can handle up to decimal(38, X). It is also "mutable" so you can change the > contents of an existing object. This helps reduce the cost of new() and > garbage collection. > We have generate a benchmark report for compare Spark Decimal, Java > BigDecimal and Decimal128. Please see the attachment. > In this new feature, we will introduce DECIMAL128 to accelerate decimal > calculation. > h3. Milestone 1 – Spark Decimal equivalency ( The new Decimal type Decimal128 > meets or exceeds all function of the existing SQL Decimal): > * Add a new DataType implementation for Decimal128. > * Support Decimal128 in Dataset/UDF. > * Decimal128 literals > * Decimal128 arithmetic(e.g. Decimal128 + Decimal128, Decimal128 - Decimal) > * Decimal or Math functions/operators: POWER, LOG, Round, etc > * Cast to and from Decimal128, cast String/Decimal to Decimal128, cast > Decimal128 to string (pretty printing)/Decimal, with the * * SQL syntax to > specify the types > * Support sorting Decimal128. > h3. Milestone 2 – Persistence: > * Ability to create tables of type Decimal128 > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return Decimal128 > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org