Github user a10y commented on a diff in the pull request:
https://github.com/apache/spark/pull/18945#discussion_r139450187
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1810,17 +1810,20 @@ def _to_scala_map(sc, jm):
return sc._jvm.PythonUtils.toScalaMap(jm)
-def _to_corrected_pandas_type(dt):
+def _to_corrected_pandas_type(field, strict=True):
"""
When converting Spark SQL records to Pandas DataFrame, the inferred
data type may be wrong.
This method gets the corrected data type for Pandas if that type may
be inferred uncorrectly.
"""
import numpy as np
+ dt = field.dataType
if type(dt) == ByteType:
return np.int8
elif type(dt) == ShortType:
return np.int16
elif type(dt) == IntegerType:
+ if not strict and field.nullable:
+ return np.float32
--- End diff --
Is loss of precision a concern here? Some integers from the original
dataset will now be rounded to the nearest representable float32 if I'm not
mistaken.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]