[ https://issues.apache.org/jira/browse/SPARK-39168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-39168: ------------------------------------ Assignee: (was: Apache Spark) > Consider all values in a python list when inferring schema > ---------------------------------------------------------- > > Key: SPARK-39168 > URL: https://issues.apache.org/jira/browse/SPARK-39168 > Project: Spark > Issue Type: New Feature > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Brian Schaefer > Priority: Major > > Schema inference fails on the following case: > {code:python} > >>> data = [{"a": [1, None], "b": [None, 2]}] > >>> spark.createDataFrame(data) > ValueError: Some of types cannot be determined after inferring > {code} > This is because only the first value in the array is used to infer the > element type for the array: > [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. > The element type of the "b" array is inferred as {{NullType}} but I think it > makes sense to infer the element type as {{{}LongType{}}}. > One approach to address the above would be to infer the type from the first > non-null value in the array. However, consider a case with structs: > {code:python} > >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) > >>> data = [{"a": [{"b": 1}, {"c": 2}]}] > >>> spark.createDataFrame(data).schema > StructType([StructField('a', ArrayType(StructType([StructField('b', > LongType(), True)]), True), True)]) > {code} > The element type of the "a" array is inferred as a struct with one field, > "b". However, it would be convenient to infer the element type as a struct > with both fields "b" and "c". Omitted fields from each dictionary would > become null values in each struct: > {code:java} > +----------------------+ > | a| > +----------------------+ > |[{1, null}, {null, 2}]| > +----------------------+ > {code} > To support both of these cases, the type of each array element could be > inferred, and those types could be merged, similar to the approach > [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576]. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org