Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/22547
After looking at the changes, I want to reiterate that request for a design
doc. I think that code is a great way to prototype a design, but that we need
to step back and make sure that the design makes sense when you view it from a
high level.
I have two main motivations for that point. First, there are some classes
that I don't see a justification for, like having a separate ScanConfig,
BatchScan, and PartitionReaderFactory. Are all of those separate classes
necessary? Can a ScanConfigBuilder return a BatchScan? Can BatchScan expose a
createBatchReader(InputPartition) method?
My second motivation for saying we need a clear design doc is that I think
that the current way to interact with v2 doesn't fit well with catalogs. This
is based around Format, which is based on the v1 method of loading read and
write implementations. But that isn't the primary way that v2 will be used be
used. It happens to be the only way to call into the v2 API from Spark today,
but the primary use of v2 is to integrate sources that are actually modeled as
tables in some catalog.
For example, Format exposes getTable that returns a Table implementation
from DataSourceOptions. Those options have tableName and databaseName methods.
But tables that are identified by name shouldn't be loaded by a Format, they
should be loaded by a catalog. It also uses the options for both table options
and read options because there isn't a way to pass both. But most tables will
be created with table options by a catalog and will accept read-specific
options passed to the DataFrameReader.
I think we would approach a usable API much sooner if this work was planned
based on a shared understanding of how catalogs and tables will interact in the
future. Not having a catalog API right now is affecting the way tables work in
this PR, and that's a concern for me.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]