GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/22009

    [SPARK-24882][SQL] improve data source v2 API

    ## What changes were proposed in this pull request?
    
    Improve the data source v2 API according to the [design 
doc](https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing)
    
    summary of the changes
    1. rename `ReadSupport` -> `DataSourceReader` -> `InputPartition` -> 
`InputPartitionReader` to `BatchReadSupportProvider` -> `BatchReadSupport` -> 
`InputPartition`/`PartitionReaderFactory` -> `PartitionReader`. Similar 
renaming also happens at streaming and write APIs.
    2. create `ScanConfig` to store query specific information like operator 
pushdown result, streaming offsets, etc. This makes batch and streaming 
`ReadSupport`(previouslly named `DataSourceReader`) immutable. All other 
methods take `ScanConfig` as input, which implies applying operator pushdown 
and getting streaming offsets happen before all other things(get input 
partitions, report statistics, etc.).
    3. separate `InputPartition` to `InputPartition` and 
`PartitionReaderFactory`. This is a natural separation, data splitting and 
reading are orthogonal and we should not mix them in one interfaces. This also 
makes the naming consistent between read and write API: 
`PartitionReaderFactory` vs `DataWriterFactory`.
    4. separate the batch and streaming interfaces. Sometimes it's painful to 
force the streaming interface to extend batch interface, as we may need to 
override some batch methods to return false, or even leak the streaming concept 
to batch API(e.g. `DataWriterFactory#createWriter(partitionId, taskId, 
epochId)`)
    
    ## How was this patch tested?
    
    existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark redesign

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22009.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22009
    
----
commit 770a43dfdc1648dd0fb91eea2249da728cfdb360
Author: Wenchen Fan <wenchen@...>
Date:   2018-08-03T04:54:45Z

    improve data source v2 API

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to