Ryan Blue created SPARK-23418:

             Summary: DataSourceV2 should not allow userSpecifiedSchema without 
                 Key: SPARK-23418
                 URL: https://issues.apache.org/jira/browse/SPARK-23418
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Ryan Blue

DataSourceV2 currently does not reject user-specified schemas when a source 
does not implement ReadSupportWithSchema. This is confusing behavior. Here's a 
quote from a discussion on SPARK-23203:
{quote}I think this will cause confusion when source schemas change. Also, I 
can't think of a situation where it is a good idea to pass a schema that is 

Here's an example of how this will be confusing: think of a job that supplies a 
schema identical to the table's schema and runs fine, so it goes into 
production. What happens when the table's schema changes? If someone adds a 
column to the table, then the job will start failing and report that the source 
doesn't support user-supplied schemas, even though it had previously worked 
just fine with a user-supplied schema. In addition, the change to the table is 
actually compatible with the old job because the new column will be removed by 
a projection.

To fix this situation, it may be tempting to use the user-supplied schema as an 
initial projection. But that doesn't make sense because we don't need two 
projection mechanisms. If we used this as a second way to project, it would be 
confusing that you can't actually leave out columns (at least for CSV) and it 
would be odd that using this path you can coerce types, which should usually be 
done by Spark.

I think it is best not to allow a user-supplied schema when it isn't supported 
by a source.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to