zhengruifeng opened a new pull request, #52637:
URL: https://github.com/apache/spark/pull/52637

   ### What changes were proposed in this pull request?
   Fix decimal rescaling in createDataFrame
   
   ### Why are the changes needed?
   this query works in classic, but fails in connect
   
   classic
   ```py
   In [1]: import decimal
      ...: spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"]).show()
   +--------------------+
   |                   d|
   +--------------------+
   |1.233999999999999986|
   +--------------------+
   ```
   
   ```py
   In [1]: import decimal
      ...: spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In[1], line 2
         1 import decimal
   ----> 2 spark.createDataFrame([(decimal.Decimal(1.234),)], ["d"])
   
   File ~/spark/python/pyspark/sql/connect/session.py:740, in 
SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
       733     from pyspark.sql.conversion import (
       734         LocalDataToArrowConversion,
       735     )
       737     # Spark Connect will try its best to build the Arrow table with 
the
       738     # inferred schema in the client side, and then rename the 
columns and
       739     # cast the datatypes in the server side.
   --> 740     _table = LocalDataToArrowConversion.convert(_data, _schema, 
prefers_large_types)
   
   ...
   
   ArrowInvalid: Rescaling Decimal value would cause data loss
   ```
   
   The root cause is the data loss in arrow conversion
   ```
   In [8]: d = decimal.Decimal(1.234)
   
   In [9]: d
   Out[9]: Decimal('1.2339999999999999857891452847979962825775146484375')
   
   In [10]: pa.scalar(d).cast(pa.decimal128(38, 18))
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In[10], line 1
   ----> 1 pa.scalar(d).cast(pa.decimal128(38, 18))
   
   ...
   
   ArrowInvalid: Rescaling Decimal value would cause data loss
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, the query works after this PR
   
   
   ### How was this patch tested?
   added test
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to