[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

GitBox Tue, 17 May 2022 08:44:19 -0700


physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874983582



##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = 
self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   Previously it was allowed to have mixed types in a python list, as long as 
the types could be cast to the type enforced by the schema inferred from the 
first element:
   ```python
   >>> df = spark.createDataFrame([{"a": ["1", 2]}])
   >>> df.show()
   +------+
   |     a|
   +------+
   |[1, 2]|
   +------+
   >>> df.schema
   StructType(List(StructField(a,ArrayType(StringType,true),true)))
   ```
   With this change, creating the DataFrame causes an error:
   ```python
   >>> df = spark.createDataFrame([{"a": ["1", 2]}])
   ...
   TypeError: Unable to infer the type of the field a.
   ```
   Because of this change, I think it makes sense to have the behavior 
configurable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Reply via email to