harshmotw-db commented on code in PR #49487:
URL: https://github.com/apache/spark/pull/49487#discussion_r1915804666
##########
python/pyspark/sql/connect/conversion.py:
##########
@@ -333,6 +340,7 @@ def convert(data: Sequence[Any], schema: StructType) ->
"pa.Table":
LocalDataToArrowConversion._create_converter(
field.dataType,
field.nullable,
+ variants_as_dicts = True
Review Comment:
This is mostly a hack because the data produced by these converters are
almost directly fed to a PyArrow API to create a PyArrow table [later in the
method](https://github.com/apache/spark/blob/9e6867537d17c013d84f8f5d0cfb2f33e35ce23a/python/pyspark/sql/connect/conversion.py#L384C37-L384C43).
Now, this API doesn't know how to deal with VariantVal and since it's a third
party library, we cannot do anything about it.
The Arrow schema is a struct with metadata stating that it is a Variant. So,
we try to get the data as dict which would be converted into Arrow structs by
the PyArrow API.
I have set it to true only in this specific part of the codebase so I can
get `createDataFrame` to work. I am thinking of cleaner ways of doing this but
if I find something I could merge that as a follow-up.
Ideally Arrow should have its own Variant type (which can be defined using
Arrow extension types). There was [some
discussion](https://github.com/apache/arrow/issues/42069) about it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]