I’d just pass them [partitioning/bucketing] as options, until there are
clear (and strong) use cases to do them otherwise.

I don’t think it makes sense to pass partitioning and bucketing information
*into* this API. The writer should already know the table structure and
should pass relevant information back out to Spark so it can sort and group
data for storage.

I think the idea of passing the table structure into the writer comes from
the current implementation, where the table may not exist before a data
frame is written. But that isn’t something that should be carried forward.
I think the writer should be responsible for writing into an
already-configured table. That’s the normal case we should design for.
Creating a table at the same time (CTAS) is a convenience, but should be
implemented by creating an empty table and then running the same writer
that would have been used for an insert into an existing table.

Otherwise, there’s confusion about how to handle the options. What should
the writer do when partitioning passed in doesn’t match the table’s
partitioning? We already have this situation in the DataFrameWriter API,
where calling partitionBy and then insertInto throws an exception. I’d like
to keep that case out of this API by setting the expectation that tables
this writes to already exist.

rb
​

On Wed, Sep 20, 2017 at 9:52 AM, Reynold Xin <r...@databricks.com> wrote:

>
>
> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Hi all,
>>
>> I want to have some discussion about Data Source V2 write path before
>> starting a voting.
>>
>> The Data Source V1 write path asks implementations to write a DataFrame
>> directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not good
>> for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>>
>> To solve these pain points, I'm proposing a data source writing framework
>> which is very similar to the reading framework, i.e., WriteSupport ->
>> DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
>> prototype to see what it looks like: https://github.com/apach
>> e/spark/pull/19269
>>
>> There are some other details need further discussion:
>> 1. *partitioning/bucketing*
>> Currently only the built-in file-based data sources support them, but
>> there is nothing stopping us from exposing them to all data sources. One
>> question is, shall we make them as mix-in interfaces for data source v2
>> reader/writer, or just encode them into data source options(a
>> string-to-string map)? Ideally it's more like options, Spark just transfers
>> these user-given informations to data sources, and doesn't do anything for
>> it.
>>
>
>
> I'd just pass them as options, until there are clear (and strong) use
> cases to do them otherwise.
>
>
> +1 on the rest.
>
>
>
>>
>> 2. *input data requirement*
>> Data sources should be able to ask Spark to preprocess the input data,
>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>> to add clustering request and sorting within partitions request, any more?
>>
>> 3. *transaction*
>> I think we can just follow `FileCommitProtocol`, which is the internal
>> framework Spark uses to guarantee transaction for built-in file-based data
>> sources. Generally speaking, we need task level and job level commit/abort.
>> Again you can see more details in my prototype about it:
>> https://github.com/apache/spark/pull/19269
>>
>> 4. *data source table*
>> This is the trickiest one. In Spark you can create a table which points
>> to a data source, so you can read/write this data source easily by
>> referencing the table name. Ideally data source table is just a pointer
>> which points to a data source with a list of predefined options, to save
>> users from typing these options again and again for each query.
>> If that's all, then everything is good, we don't need to add more
>> interfaces to Data Source V2. However, data source tables provide special
>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>> sources to have some extra ability.
>> Currently these special operators only work for built-in file-based data
>> sources, and I don't think we will extend it in the near future, I propose
>> to mark them as out of the scope.
>>
>>
>> Any comments are welcome!
>> Thanks,
>> Wenchen
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to