Austin Warner created SPARK-52355:
-------------------------------------
Summary: VariantVal schema improperly inferred as
struct<metadata:binary,value:binary>
Key: SPARK-52355
URL: https://issues.apache.org/jira/browse/SPARK-52355
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.0.0
Reporter: Austin Warner
When creating VariantVal objects locally in Python, the schema is improperly
inferred as a struct with metadata and value fields.
{quote}{{>>> from pyspark.sql.types import VariantVal}}
{{>>> df = spark.createDataFrame([(VariantVal.parseJson("[1]"),)],
schema=['value'])}}
{{>>> df.printSchema()}}
{{root}}
{{|-- value: struct (nullable = true)}}
{{| |-- metadata: binary (nullable = true)}}
{{| |-- value: binary (nullable = true)}}
{{>>> df.collect()}}
{{[Row(value=Row(metadata=bytearray(b'\x01\x00\x00'),
value=bytearray(b'\x03\x01\x00\x02\x0c\x01')))]}}
{quote}
When the schema is passed explicitly, everything works as intended
{quote}{{>>> from pyspark.sql.types import VariantVal}}
{{>>> df = spark.createDataFrame([(VariantVal.parseJson("[1]"),)],
schema='value variant')}}
{{>>> df.printSchema()}}
{{root}}
{{|-- value: variant (nullable = true)}}
{{>>> df.collect()}}
{{[Row(value=VariantVal(bytearray(b'\x03\x01\x00\x02\x0c\x01'),
bytearray(b'\x01\x00\x00')))]}}
{{>>> df.collect()[0].value.toJson()}}
{{'[1]'}}
{quote}
This appears to be because the
[{{pyspark.sql.type._infer_schema}}|https://github.com/apache/spark/blob/e3321aa44ea255365222c491657b709ef41dc460/python/pyspark/sql/types.py#L2325-L2380]
function does not include a case for VariantVal objects
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]