[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Wenchen Fan (JIRA) Mon, 30 Jul 2018 09:40:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562114#comment-16562114
 ]


Wenchen Fan commented on SPARK-24882:
-------------------------------------

[~rdblue] I do agree that creating a catalog via reflection and then using 
catalog to create a `ReadSupport` instance is cleaner. But the problem is then 
we need to make `CatatalogSupport` a must-have for data sources instead of an 
optional plugin. How about we rename the old `ReadSupport` to 
`ReadSupportProvider` for data sources that don't have a catalog? It works like 
a dynamic constructor of `ReadSupport` so that Spark can create `ReadSupport` 
by reflection.

For the builder issue, I'm ok with adding a `ScanConfigBuilder` that is mutable 
and can mix in the `SupportsPushdownXYZ` traits, to make `ScanConfig` 
immutable. I think this model is simpler: The `ScanConfigBuilder` tracks all 
the pushed operators, checks the current status and gives feedback to Spark 
about the next operator pushdown. We can design a pure builder-like pushdown 
API for `ScanConfigBuilder`  later. We need to support more operators pushdown 
to evaluate the design, so it seems safer to keep the pushdown API unchanged 
for now. What do you think?

> separate responsibilities of the data source v2 read API
> --------------------------------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Reply via email to