GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/22009
[SPARK-24882][SQL] improve data source v2 API
## What changes were proposed in this pull request?
Improve the data source v2 API according to the [design
doc](https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing)
summary of the changes
1. rename `ReadSupport` -> `DataSourceReader` -> `InputPartition` ->
`InputPartitionReader` to `BatchReadSupportProvider` -> `BatchReadSupport` ->
`InputPartition`/`PartitionReaderFactory` -> `PartitionReader`. Similar
renaming also happens at streaming and write APIs.
2. create `ScanConfig` to store query specific information like operator
pushdown result, streaming offsets, etc. This makes batch and streaming
`ReadSupport`(previouslly named `DataSourceReader`) immutable. All other
methods take `ScanConfig` as input, which implies applying operator pushdown
and getting streaming offsets happen before all other things(get input
partitions, report statistics, etc.).
3. separate `InputPartition` to `InputPartition` and
`PartitionReaderFactory`. This is a natural separation, data splitting and
reading are orthogonal and we should not mix them in one interfaces. This also
makes the naming consistent between read and write API:
`PartitionReaderFactory` vs `DataWriterFactory`.
4. separate the batch and streaming interfaces. Sometimes it's painful to
force the streaming interface to extend batch interface, as we may need to
override some batch methods to return false, or even leak the streaming concept
to batch API(e.g. `DataWriterFactory#createWriter(partitionId, taskId,
epochId)`)
## How was this patch tested?
existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark redesign
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22009.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22009
----
commit 770a43dfdc1648dd0fb91eea2249da728cfdb360
Author: Wenchen Fan <wenchen@...>
Date: 2018-08-03T04:54:45Z
improve data source v2 API
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]