[ 
https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-26744:
-----------------------------------
    Description: 
The internal API supportDataType in FileFormat validates the output/input 
schema before task execution starts. So that we can avoid launching read/write 
tasks which would fail. Also, users can see clean error messages.

This PR is to implement the same internal API in the FileDataSourceV2 
framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The 
API is added in two places:

1. Read path: the table schema is determined in TableProvider.getTable. The 
actual read schema can be a subset of the table schema. This PR proposes to 
validate the actual read schema in FileScan.
2. Write path: validate the actual output schema in FileWriteBuilder.

  was:
The method supportDataType in FileFormat helps to validate the output/input 
schema before execution starts. So that we can avoid some invalid data source 
IO, and users can see clean error messages.

This PR is to implement the same method in the FileDataSourceV2 framework. 
Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added 
in two places:

1. FileWriteBuilder: this is where we can get the actual write schema
2. FileScan: this is where we can get the actual read schema.


> Support schema validation in File Source V2
> -------------------------------------------
>
>                 Key: SPARK-26744
>                 URL: https://issues.apache.org/jira/browse/SPARK-26744
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Gengliang Wang
>            Priority: Major
>
> The internal API supportDataType in FileFormat validates the output/input 
> schema before task execution starts. So that we can avoid launching 
> read/write tasks which would fail. Also, users can see clean error messages.
> This PR is to implement the same internal API in the FileDataSourceV2 
> framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The 
> API is added in two places:
> 1. Read path: the table schema is determined in TableProvider.getTable. The 
> actual read schema can be a subset of the table schema. This PR proposes to 
> validate the actual read schema in FileScan.
> 2. Write path: validate the actual output schema in FileWriteBuilder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to