[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Ryan Blue (JIRA) Mon, 30 Jul 2018 09:01:21 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562058#comment-16562058
 ]


Ryan Blue commented on SPARK-24882:
-----------------------------------

[~cloud_fan], when you say that "ReadSupport is created via reflection and 
should be very light-weight", that raises a red flag for me. As I've mentioned 
in the discussions on table catalog support, I think that we should instantiate 
the catalog through reflection, not the data source. Starting with the data 
source is backward and only works now because we have just one global catalog. 
In my catalog support PR, the catalog is what gets instantiated via reflection. 
In that model, the initialization you need would be in the Table instance's 
life-cycle and the catalog would return tables that implement ReadSupport. So, 
I think we should consider the impact of multiple catalog support.

I think that if we go with the catalog proposal, it would make this change 
cleaner because we would be able to remove the extra DataSourceReader level. 
This would also end up working like a lot of other Table and Scan interfaces. 
Iceberg, for example, has a Table that you call newScan on to get a 
configurable TableScan. That's very similar to how this API would end up: Table 
implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader 
that doesn't really read data just doesn't make sense.

For the builder discussion: Does the current proposal work for the 
{{Filter(Limit(Filter(Scan)))}} case? I don't think it does because 
implementations expect predicates to be pushed just once. I think that these 
cases should probably be handled by a more generic push-down that passes parts 
of the plan tree. If that's the case, then the builder is simpler and immutable.

I'd rather go with the builder for now to set the expectation that ScanConfig 
is immutable. We *could* add an intermediate class, like 
{{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown 
interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a 
final, immutable {{ScanConfig}}. But, that's really just a builder with a more 
difficult to implement interface.

In the event that we do find pushdown operations that are incompatible with a 
builder – and not incompatible with both a builder and the current proposal 
like your example – then we can always add a new way to build or configure a 
{{ScanConfig}} later. The important thing is that the ScanConfig is immutable 
to provide the guarantees that you want in this API: that the ScanConfig won't 
change between calls to {{estimateStatistics}} and {{planInputPartitions}}.

> separate responsibilities of the data source v2 read API
> --------------------------------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

Reply via email to