Ryan Blue created SPARK-23418:
---------------------------------
Summary: DataSourceV2 should not allow userSpecifiedSchema without
ReadSupportWithSchema
Key: SPARK-23418
URL: https://issues.apache.org/jira/browse/SPARK-23418
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue
DataSourceV2 currently does not reject user-specified schemas when a source
does not implement ReadSupportWithSchema. This is confusing behavior. Here's a
quote from a discussion on SPARK-23203:
{quote}I think this will cause confusion when source schemas change. Also, I
can't think of a situation where it is a good idea to pass a schema that is
ignored.
Here's an example of how this will be confusing: think of a job that supplies a
schema identical to the table's schema and runs fine, so it goes into
production. What happens when the table's schema changes? If someone adds a
column to the table, then the job will start failing and report that the source
doesn't support user-supplied schemas, even though it had previously worked
just fine with a user-supplied schema. In addition, the change to the table is
actually compatible with the old job because the new column will be removed by
a projection.
To fix this situation, it may be tempting to use the user-supplied schema as an
initial projection. But that doesn't make sense because we don't need two
projection mechanisms. If we used this as a second way to project, it would be
confusing that you can't actually leave out columns (at least for CSV) and it
would be odd that using this path you can coerce types, which should usually be
done by Spark.
I think it is best not to allow a user-supplied schema when it isn't supported
by a source.
{quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]