[GitHub] spark issue #22547: [SPARK-25528][SQL] data source V2 read side API refactor...

rdblue Fri, 19 Oct 2018 15:37:33 -0700

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/22547
  
    After looking at the changes, I want to reiterate that request for a design 
doc. I think that code is a great way to prototype a design, but that we need 
to step back and make sure that the design makes sense when you view it from a 
high level.
    
    I have two main motivations for that point. First, there are some classes 
that I don't see a justification for, like having a separate ScanConfig, 
BatchScan, and PartitionReaderFactory. Are all of those separate classes 
necessary? Can a ScanConfigBuilder return a BatchScan? Can BatchScan expose a 
createBatchReader(InputPartition) method?
    
    My second motivation for saying we need a clear design doc is that I think 
that the current way to interact with v2 doesn't fit well with catalogs. This 
is based around Format, which is based on the v1 method of loading read and 
write implementations. But that isn't the primary way that v2 will be used be 
used. It happens to be the only way to call into the v2 API from Spark today, 
but the primary use of v2 is to integrate sources that are actually modeled as 
tables in some catalog.
    
    For example, Format exposes getTable that returns a Table implementation 
from DataSourceOptions. Those options have tableName and databaseName methods. 
But tables that are identified by name shouldn't be loaded by a Format, they 
should be loaded by a catalog. It also uses the options for both table options 
and read options because there isn't a way to pass both. But most tables will 
be created with table options by a catalog and will accept read-specific 
options passed to the DataFrameReader.
    
    I think we would approach a usable API much sooner if this work was planned 
based on a shared understanding of how catalogs and tables will interact in the 
future. Not having a catalog API right now is affecting the way tables work in 
this PR, and that's a concern for me.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22547: [SPARK-25528][SQL] data source V2 read side API refactor...

Reply via email to