[GitHub] [spark] HyukjinKwon commented on a change in pull request #33214: [SPARK-35929][PYTHON] Schema inference of nested structs defaults to map

GitBox Tue, 06 Jul 2021 04:07:47 -0700


HyukjinKwon commented on a change in pull request #33214:
URL: https://github.com/apache/spark/pull/33214#discussion_r663757786




##########
File path: python/pyspark/sql/types.py
##########
@@ -1020,10 +1020,19 @@ def _infer_type(obj):
         return dataType()
 
     if isinstance(obj, dict):
-        for key, value in obj.items():
-            if key is not None and value is not None:
-                return MapType(_infer_type(key), _infer_type(value), True)
-        return MapType(NullType(), NullType(), True)
+        from pyspark.sql.session import SparkSession
+        if (SparkSession._activeSession.conf.get(
+                "spark.sql.pyspark.inferNestedStructByMap").lower() == "true"):

Review comment:
       Can you pass a bool argument to `_infer_type` (and `_infer_schema`), and 
access to the configuration value like 
https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.py#L79?

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,13 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_STRUCT_BY_MAP = 
buildConf("spark.sql.pyspark.inferNestedStructByMap")
+    .internal()

Review comment:
       ```suggestion
   ```

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,13 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_STRUCT_BY_MAP = 
buildConf("spark.sql.pyspark.inferNestedStructByMap")

Review comment:
       ```suggestion
     val INFER_NESTED_DICT_AS_STRUCT = 
buildConf("spark.sql.pyspark.inferNestedDictAsStruct")
   ```

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,13 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_STRUCT_BY_MAP = 
buildConf("spark.sql.pyspark.inferNestedStructByMap")
+    .internal()
+    .doc("When set to false, inferring the nested struct by StructType. 
MapType is default.")

Review comment:
       ```suggestion
       .doc("When set to false, infers the nested dict as a struct. By default, 
it infers it as a map.")
   ```

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,13 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_STRUCT_BY_MAP = 
buildConf("spark.sql.pyspark.inferNestedStructByMap")

Review comment:
       ```suggestion
     val INFER_NESTED_DICT_AS_STRUCT = 
buildConf("spark.sql.pyspark.inferNestedDictAsStruct.enabled")
   ```

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,13 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_STRUCT_BY_MAP = 
buildConf("spark.sql.pyspark.inferNestedStructByMap")
+    .internal()
+    .doc("When set to false, inferring the nested struct by StructType. 
MapType is default.")
+    .version("3.2.0")
+    .booleanConf
+    .createWithDefault(true)

Review comment:
       ```suggestion
       .createWithDefault(false)
   ```
   

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -4040,6 +4047,8 @@ class SQLConf extends Serializable with Logging {
 
   def maxConcurrentOutputFileWriters: Int = 
getConf(SQLConf.MAX_CONCURRENT_OUTPUT_FILE_WRITERS)
 
+  def inferNestedStructByMap: Boolean = 
getConf(SQLConf.INFER_NESTED_STRUCT_BY_MAP)

Review comment:
       ```suggestion
     def inferDictAsStruct: Boolean = 
getConf(SQLConf.INFER_NESTED_DICT_AS_STRUCT)
   ```

##########
File path: python/pyspark/sql/tests/test_types.py
##########
@@ -196,6 +196,12 @@ def test_infer_nested_schema(self):
         df = self.spark.createDataFrame(nestedRdd2)
         self.assertEqual(Row(f1=[[1, 2], [2, 3]], f2=[1, 2]), df.collect()[0])
 
+        with self.sql_conf({"spark.sql.pyspark.inferNestedStructByMap": 
False}):

Review comment:
       Can you create a separate test? We should also add a comment, see "Pull 
Request" in https://spark.apache.org/contributing.html

##########
File path: python/pyspark/sql/tests/test_types.py
##########
@@ -204,6 +204,18 @@ def test_infer_nested_schema(self):
         df = self.spark.createDataFrame(rdd)
         self.assertEqual(Row(field1=1, field2=u'row1'), df.first())
 
+    def test_infer_nested_dict(self):
+        # SPARK-35929: Test inferring nested dict as a struct type.
+        NestedRow = Row("f1", "f2")
+
+        with 
self.sql_conf({"spark.sql.pyspark.inferNestedDictAsStruct.enabled": True}):
+            test = self.spark._wrapped._conf.inferDictAsStruct()
+            test1 = 
self.spark.conf.get("spark.sql.pyspark.inferNestedDictAsStruct.enabled")

Review comment:
       Looks we should remove these

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,12 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_DICT_AS_STRUCT = 
buildConf("spark.sql.pyspark.inferNestedDictAsStruct.enabled")
+    .doc("When set to true, infers the nested dict as a struct. By default, it 
infers it as a map")
+    .version("3.2.0")

Review comment:
       3.3.0 since brach-3.2 is cut out. New features won't go to 3.2

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,12 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_DICT_AS_STRUCT = 
buildConf("spark.sql.pyspark.inferNestedDictAsStruct.enabled")
+    .doc("When set to true, infers the nested dict as a struct. By default, it 
infers it as a map")
+    .version("3.2.0")

Review comment:
       3.3.0 since branch-3.2 is cut out. New features won't go to 3.2

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3327,6 +3327,12 @@ object SQLConf {
     .intConf
     .createWithDefault(0)
 
+  val INFER_NESTED_DICT_AS_STRUCT = 
buildConf("spark.sql.pyspark.inferNestedDictAsStruct.enabled")
+    .doc("When set to true, infers the nested dict as a struct. By default, it 
infers it as a map")

Review comment:
       Can you write where the inference happens? 
`SparkSession.createDataFrame`.

##########
File path: python/pyspark/sql/tests/test_types.py
##########
@@ -204,6 +204,16 @@ def test_infer_nested_schema(self):
         df = self.spark.createDataFrame(rdd)
         self.assertEqual(Row(field1=1, field2=u'row1'), df.first())
 
+    def test_infer_nested_dict(self):
+        # SPARK-35929: Test inferring nested dict as a struct type.
+        NestedRow = Row("f1", "f2")
+
+        with 
self.sql_conf({"spark.sql.pyspark.inferNestedDictAsStruct.enabled": True}):
+            nestedRdd = self.sc.parallelize([NestedRow([{"payment": 200.5, 
"name": "A"}], [1, 2]),
+                                             NestedRow([{"payment": 100.5, 
"name": "B"}], [2, 3])])
+            df = self.spark.createDataFrame(nestedRdd)
+            self.assertEqual(Row(f1=[Row(payment=200.5, name='A')], f2=[1, 
2]), df.collect()[0])

Review comment:
       ```suggestion
               self.assertEqual(Row(f1=[Row(payment=200.5, name='A')], f2=[1, 
2]), df.first())
   ```

##########
File path: python/pyspark/sql/tests/test_types.py
##########
@@ -204,6 +204,16 @@ def test_infer_nested_schema(self):
         df = self.spark.createDataFrame(rdd)
         self.assertEqual(Row(field1=1, field2=u'row1'), df.first())
 
+    def test_infer_nested_dict(self):

Review comment:
       ```suggestion
       def test_infer_nested_dict_as_struct(self):
   ```

##########
File path: python/pyspark/sql/tests/test_types.py
##########
@@ -204,6 +204,16 @@ def test_infer_nested_schema(self):
         df = self.spark.createDataFrame(rdd)
         self.assertEqual(Row(field1=1, field2=u'row1'), df.first())
 
+    def test_infer_nested_dict(self):
+        # SPARK-35929: Test inferring nested dict as a struct type.
+        NestedRow = Row("f1", "f2")
+
+        with 
self.sql_conf({"spark.sql.pyspark.inferNestedDictAsStruct.enabled": True}):
+            nestedRdd = self.sc.parallelize([NestedRow([{"payment": 200.5, 
"name": "A"}], [1, 2]),

Review comment:
       Can we also test with local dataset instead of RDD? e.g., 
`spark.createDataFrame([NestedRow([{"payment": 200.5, "name": ", ...)` because 
the code path is different.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33214: [SPARK-35929][PYTHON] Schema inference of nested structs defaults to map

Reply via email to