zhengruifeng opened a new pull request, #52637:
URL: https://github.com/apache/spark/pull/52637
### What changes were proposed in this pull request?
Fix decimal rescaling in createDataFrame
### Why are the changes needed?
this query works in classic, but fails in connect
classic
```py
In [1]: import decimal
...: spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"]).show()
+--------------------+
| d|
+--------------------+
|1.233999999999999986|
+--------------------+
```
```py
In [1]: import decimal
...: spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[1], line 2
1 import decimal
----> 2 spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])
File ~/spark/python/pyspark/sql/connect/session.py:740, in
SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
733 from pyspark.sql.conversion import (
734 LocalDataToArrowConversion,
735 )
737 # Spark Connect will try its best to build the Arrow table with
the
738 # inferred schema in the client side, and then rename the
columns and
739 # cast the datatypes in the server side.
--> 740 _table = LocalDataToArrowConversion.convert(_data, _schema,
prefers_large_types)
...
ArrowInvalid: Rescaling Decimal value would cause data loss
```
The root cause is the data loss in arrow conversion
```
In [8]: d = decimal.Decimal(1.234)
In [9]: d
Out[9]: Decimal('1.2339999999999999857891452847979962825775146484375')
In [10]: pa.scalar(d).cast(pa.decimal128(38, 18))
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[10], line 1
----> 1 pa.scalar(d).cast(pa.decimal128(38, 18))
...
ArrowInvalid: Rescaling Decimal value would cause data loss
```
### Does this PR introduce _any_ user-facing change?
yes, the query works after this PR
### How was this patch tested?
added test
### Was this patch authored or co-authored using generative AI tooling?
no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]