Re: [PR] [SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect [spark]

via GitHub Tue, 14 Jan 2025 16:49:55 -0800


harshmotw-db commented on code in PR #49487:
URL: https://github.com/apache/spark/pull/49487#discussion_r1915804666



##########
python/pyspark/sql/connect/conversion.py:
##########
@@ -333,6 +340,7 @@ def convert(data: Sequence[Any], schema: StructType) -> 
"pa.Table":
             LocalDataToArrowConversion._create_converter(
                 field.dataType,
                 field.nullable,
+                variants_as_dicts = True

Review Comment:
   This is mostly a hack because the data produced by these converters are 
almost directly fed to a PyArrow API to create a PyArrow table [later in the 
method](https://github.com/apache/spark/blob/9e6867537d17c013d84f8f5d0cfb2f33e35ce23a/python/pyspark/sql/connect/conversion.py#L384C37-L384C43).
 Now, this API doesn't know how to deal with VariantVal and since it's a third 
party library, we cannot do anything about it.
   
   The Arrow schema is a struct with metadata stating that it is a Variant. So, 
we try to get the data as dict which would be converted into Arrow structs by 
the PyArrow API.
   
   I have set it to true only in this specific part of the codebase so I can 
get `createDataFrame` to work. I am thinking of cleaner ways of doing this but 
if I find something I could merge that as a follow-up.
   
   Ideally Arrow should have its own Variant type (which can be defined using 
Arrow extension types). There was [some 
discussion](https://github.com/apache/arrow/issues/42069) about it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect [spark]

Reply via email to