Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

via GitHub Tue, 28 May 2024 20:57:19 -0700


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1618145696



##########
python/pyspark/sql/pandas/types.py:
##########
@@ -86,30 +110,58 @@ def to_arrow_type(dt: DataType) -> "pa.DataType":
         arrow_type = pa.binary()
     elif type(dt) == DateType:
         arrow_type = pa.date32()
-    elif type(dt) == TimestampType:
+    elif type(dt) == TimestampType and timestamp_utc:
         # Timestamps should be in UTC, JVM Arrow timestamps require a timezone 
to be read
         arrow_type = pa.timestamp("us", tz="UTC")
+    elif type(dt) == TimestampType:
+        arrow_type = pa.timestamp("us", tz=None)
     elif type(dt) == TimestampNTZType:
         arrow_type = pa.timestamp("us", tz=None)
     elif type(dt) == DayTimeIntervalType:
         arrow_type = pa.duration("us")
     elif type(dt) == ArrayType:
-        field = pa.field("element", to_arrow_type(dt.elementType), 
nullable=dt.containsNull)
+        field = pa.field(
+            "element",
+            to_arrow_type(dt.elementType, 
error_on_duplicated_field_names_in_struct, timestamp_utc),
+            nullable=dt.containsNull,
+        )
         arrow_type = pa.list_(field)
     elif type(dt) == MapType:
-        key_field = pa.field("key", to_arrow_type(dt.keyType), nullable=False)
-        value_field = pa.field("value", to_arrow_type(dt.valueType), 
nullable=dt.valueContainsNull)
+        key_field = pa.field(
+            "key",
+            to_arrow_type(dt.keyType, 
error_on_duplicated_field_names_in_struct, timestamp_utc),
+            nullable=False,
+        )
+        value_field = pa.field(
+            "value",
+            to_arrow_type(dt.valueType, 
error_on_duplicated_field_names_in_struct, timestamp_utc),
+            nullable=dt.valueContainsNull,
+        )
         arrow_type = pa.map_(key_field, value_field)
     elif type(dt) == StructType:
+        field_names = dt.names
+        if error_on_duplicated_field_names_in_struct and len(set(field_names)) 
!= len(field_names):

Review Comment:
   For Classic, it worked fine before. Duplicated field names in a struct 
already raised the correct PySpark error.
   
   But for Connect, this condition was unhandled by Spark before, and was only 
caught by PyArrow.
   
   For example, see the last few lines of the test 
`test_toArrow_duplicate_field_names` which tests this condition. Without 
enforcing no duplicate field names in structs, this is the error raised in that 
test:
   
   ```
   pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong 
order: Input fields: struct<x_0: string, x_1: int32> output fields: struct<x: 
string, x: int32>
   ```
   
   With the code added here to enforce it, we see the correct error in Connect, 
the same as in Classic:
   
   ```
   pyspark.errors.exceptions.base.UnsupportedOperationException: 
[DUPLICATED_FIELD_NAME_IN_ARROW_STRUCT] Duplicated field names in Arrow Struct 
are not allowed, got ['x', 'x']
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

Reply via email to