Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Wenchen Fan Mon, 09 Oct 2017 18:08:29 -0700

I'm adding my own +1 (binding).

On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> wrote:


> I'm going to update the proposal: for the last point, although the
> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
> mixes data and metadata operations, we are still able to separate them in
> the data source write API. We can have a mix-in trait `MetadataSupport`
> which has a method `create(options)`, so that data sources can mix in this
> trait and provide metadata creation support. Spark will call this `create`
> method inside `DataFrameWriter.save` if the specified data source has it.
>
> Note that file format data sources can ignore this new trait and still
> write data without metadata(it doesn't have metadata anyway).
>
> With this updated proposal, I'm calling a new vote for the data source v2
> write path.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Hi all,
>>
>> After we merge the infrastructure of data source v2 read path, and have
>> some discussion for the write path, now I'm sending this email to call a
>> vote for Data Source v2 write path.
>>
>> The full document of the Data Source API V2 is:
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic infrastructure for the
>> write path:
>> https://github.com/apache/spark/pull/19269
>>
>>
>> The Data Source V1 write path asks implementations to write a DataFrame
>> directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not good
>> for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>> To solve these pain points, I'm proposing the data source v2 writing
>> framework which is very similar to the reading framework, i.e.,
>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>
>> Data Source V2 write path follows the existing FileCommitProtocol, and
>> have task/job level commit/abort, so that data sources can implement
>> transaction easier.
>>
>> We can create a mix-in trait for DataSourceV2Writer to specify the
>> requirement for input data, like clustering and ordering.
>>
>> Spark provides a very simple protocol for uses to connect to data
>> sources. A common way to write a dataframe to data sources:
>> `df.write.format(...).option(...).mode(...).save()`.
>> Spark passes the options and save mode to data sources, and schedules the
>> write job on the input data. And the data source should take care of the
>> metadata, e.g., the JDBC data source can create the table if it doesn't
>> exist, or fail the job and ask users to create the table in the
>> corresponding database first. Data sources can define some options for
>> users to carry some metadata information like partitioning/bucketing.
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>
>

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Reply via email to