[jira] [Commented] (SPARK-35386) parquet read with schema should fail on non-existing columns

Rafal Wojdyla (Jira) Sun, 16 May 2021 18:09:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345811#comment-17345811
 ]


Rafal Wojdyla commented on SPARK-35386:
---------------------------------------

{quote}
I think the logic is that, when users specify a schema, users know and they are 
sure on the data has the specific schema, and then it should be able to read it 
as specified.
{quote}

[~hyukjin.kwon] Is this documented somewhere? This assumption wouldn't be very 
user friendly, since there are not utils to easily check that. Plus current 
behaviour of changing *required* fields to *nullable* seems like a bug given 
that assumption. 

{quote}
To do the assertion, you should manually check with one liner: 
{{assert(spark.read.parquet(...).schema == userSpecifiedSchema)}}
{quote}

This check doesn't really work because the schema on the LHS can include extra 
metadata etc. (in which case it will fail the exact equality).

> parquet read with schema should fail on non-existing columns
> ------------------------------------------------------------
>
>                 Key: SPARK-35386
>                 URL: https://issues.apache.org/jira/browse/SPARK-35386
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark
>    Affects Versions: 3.0.1
>            Reporter: Rafal Wojdyla
>            Priority: Major
>
> When read schema is specified as I user I would prefer/like if spark failed 
> on missing columns.
> {code:python}
> from pyspark.sql.dataframe import DoubleType, StructType
> spark: SparkSession = ...
> spark.read.parquet("/tmp/data.snappy.parquet")
> # inferred schema, includes 3 columns: col1, col2, new_col
> # DataFrame[col1: bigint, col2: bigint, new_col: bigint]
> # let's specify a custom read_schema, with **non nullable** col3 (which is 
> not present):
> read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
> df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
> df.schema
> # we get a DataFrame with **nullable** col3:
> # StructType(List(StructField(col3,DoubleType,true)))
> df.count()
> # 0
> {code}
> Is this a feature or a bug? In this case there's just a single parquet file, 
> I have also tried {{option("mergeSchema", "true")}}, which doesn't help.
> Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35386) parquet read with schema should fail on non-existing columns

Reply via email to