Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Wenchen Fan
This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1
votes.

Thanks all!

+1 votes (binding):
Wenchen Fan
Reynold Xin
Cheng Liang


+1 votes (non-binding):
Xiao Li
Weichen Xu
Vaquar khan
Liwei Lin
Dongjoon Hyun


On Tue, Oct 17, 2017 at 12:30 AM, Dongjoon Hyun 
wrote:

> +1
>
> On Sun, Oct 15, 2017 at 11:43 PM, Cheng Lian 
> wrote:
>
>> +1
>>
>> On 10/12/17 20:10, Liwei Lin wrote:
>>
>> +1 !
>>
>> Cheers,
>> Liwei
>>
>> On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan 
>> wrote:
>>
>>> +1
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Oct 11, 2017 10:14 PM, "Weichen Xu" 
>>> wrote:
>>>
>>> +1
>>>
>>> On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:
>>>
 +1

 Xiao

 On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:

> +1
>
> One thing with MetadataSupport - It's a bad idea to call it that
> unless adding new functions in that trait wouldn't break source/binary
> compatibility in the future.
>
>
> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan 
> wrote:
>
>> I'm adding my own +1 (binding).
>>
>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
>> wrote:
>>
>>> I'm going to update the proposal: for the last point, although the
>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>>> mixes data and metadata operations, we are still able to separate them 
>>> in
>>> the data source write API. We can have a mix-in trait `MetadataSupport`
>>> which has a method `create(options)`, so that data sources can mix in 
>>> this
>>> trait and provide metadata creation support. Spark will call this 
>>> `create`
>>> method inside `DataFrameWriter.save` if the specified data source has 
>>> it.
>>>
>>> Note that file format data sources can ignore this new trait and
>>> still write data without metadata(it doesn't have metadata anyway).
>>>
>>> With this updated proposal, I'm calling a new vote for the data
>>> source v2 write path.
>>>
>>> The vote will be up for the next 72 hours. Please reply with your
>>> vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following
>>> technical reasons.
>>>
>>> Thanks!
>>>
>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
>>> wrote:
>>>
 Hi all,

 After we merge the infrastructure of data source v2 read path, and
 have some discussion for the write path, now I'm sending this email to 
 call
 a vote for Data Source v2 write path.

 The full document of the Data Source API V2 is:
 https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
 -Z8qU5Frf6WMQZ6jJVM/edit

 The ready-for-review PR that implements the basic infrastructure
 for the write path:
 https://github.com/apache/spark/pull/19269


 The Data Source V1 write path asks implementations to write a
 DataFrame directly, which is painful:
 1. Exposing upper-level API like DataFrame to Data Source API is
 not good for maintenance.
 2. Data sources may need to preprocess the input data before
 writing, like cluster/sort the input by some columns. It's better to 
 do the
 preprocessing in Spark instead of in the data source.
 3. Data sources need to take care of transaction themselves, which
 is hard. And different data sources may come up with a very similar
 approach for the transaction, which leads to many duplicated codes.

 To solve these pain points, I'm proposing the data source v2
 writing framework which is very similar to the reading framework, i.e.,
 WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.

 Data Source V2 write path follows the existing FileCommitProtocol,
 and have task/job level commit/abort, so that data sources can 
 implement
 transaction easier.

 We can create a mix-in trait for DataSourceV2Writer to specify the
 requirement for input data, like clustering and ordering.

 Spark provides a very simple protocol for uses to connect to data
 sources. A common way to write a dataframe to data sources:
 `df.write.format(...).option(...).mode(...).save()`.
 Spark passes the options and save mode to data sources, and
 schedules the write job on the input data. And the data source should 
 take
 care of the metadata, e.g., the JDBC data source can create the table 
 if it
 

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Cheng Lian

+1


On 10/12/17 20:10, Liwei Lin wrote:

+1 !

Cheers,
Liwei

On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan > wrote:


+1

Regards,
Vaquar khan

On Oct 11, 2017 10:14 PM, "Weichen Xu" > wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li
> wrote:

+1

Xiao

On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin
> wrote:

+1

One thing with MetadataSupport - It's a bad idea to
call it that unless adding new functions in that trait
wouldn't break source/binary compatibility in the future.


On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan
> wrote:

I'm adding my own +1 (binding).

On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan
>
wrote:

I'm going to update the proposal: for the last
point, although the user-facing API
(`df.write.format(...).option(...).mode(...).save()`)
mixes data and metadata operations, we are
still able to separate them in the data source
write API. We can have a mix-in trait
`MetadataSupport` which has a method
`create(options)`, so that data sources can
mix in this trait and provide metadata
creation support. Spark will call this
`create` method inside `DataFrameWriter.save`
if the specified data source has it.

Note that file format data sources can ignore
this new trait and still write data without
metadata(it doesn't have metadata anyway).

With this updated proposal, I'm calling a new
vote for the data source v2 write path.

The vote will be up for the next 72 hours.
Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because
of the following technical reasons.

Thanks!

On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan
> wrote:

Hi all,

After we merge the infrastructure of data
source v2 read path, and have some
discussion for the write path, now I'm
sending this email to call a vote for Data
Source v2 write path.

The full document of the Data Source API
V2 is:

https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit



The ready-for-review PR that implements
the basic infrastructure for the write path:
https://github.com/apache/spark/pull/19269



The Data Source V1 write path asks
implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame
to Data Source API is not good for
maintenance.
2. Data sources may need to preprocess the
input data before writing, like
cluster/sort the input by some columns.
It's better to do the preprocessing in
Spark instead of in the data source.
3. Data sources need to take care of
transaction themselves, which is hard. And
different data sources may come up with a
very similar approach for the transaction,
which leads to many duplicated codes.

To solve these pain points, 

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-12 Thread Liwei Lin
+1 !

Cheers,
Liwei

On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan  wrote:

> +1
>
> Regards,
> Vaquar khan
>
> On Oct 11, 2017 10:14 PM, "Weichen Xu"  wrote:
>
> +1
>
> On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:
>
>> +1
>>
>> Xiao
>>
>> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:
>>
>>> +1
>>>
>>> One thing with MetadataSupport - It's a bad idea to call it that unless
>>> adding new functions in that trait wouldn't break source/binary
>>> compatibility in the future.
>>>
>>>
>>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>>>
 I'm adding my own +1 (binding).

 On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
 wrote:

> I'm going to update the proposal: for the last point, although the
> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
> mixes data and metadata operations, we are still able to separate them in
> the data source write API. We can have a mix-in trait `MetadataSupport`
> which has a method `create(options)`, so that data sources can mix in this
> trait and provide metadata creation support. Spark will call this `create`
> method inside `DataFrameWriter.save` if the specified data source has it.
>
> Note that file format data sources can ignore this new trait and still
> write data without metadata(it doesn't have metadata anyway).
>
> With this updated proposal, I'm calling a new vote for the data source
> v2 write path.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>
> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
> wrote:
>
>> Hi all,
>>
>> After we merge the infrastructure of data source v2 read path, and
>> have some discussion for the write path, now I'm sending this email to 
>> call
>> a vote for Data Source v2 write path.
>>
>> The full document of the Data Source API V2 is:
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic infrastructure for
>> the write path:
>> https://github.com/apache/spark/pull/19269
>>
>>
>> The Data Source V1 write path asks implementations to write a
>> DataFrame directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>> good for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>> To solve these pain points, I'm proposing the data source v2 writing
>> framework which is very similar to the reading framework, i.e.,
>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>
>> Data Source V2 write path follows the existing FileCommitProtocol,
>> and have task/job level commit/abort, so that data sources can implement
>> transaction easier.
>>
>> We can create a mix-in trait for DataSourceV2Writer to specify the
>> requirement for input data, like clustering and ordering.
>>
>> Spark provides a very simple protocol for uses to connect to data
>> sources. A common way to write a dataframe to data sources:
>> `df.write.format(...).option(...).mode(...).save()`.
>> Spark passes the options and save mode to data sources, and schedules
>> the write job on the input data. And the data source should take care of
>> the metadata, e.g., the JDBC data source can create the table if it 
>> doesn't
>> exist, or fail the job and ask users to create the table in the
>> corresponding database first. Data sources can define some options for
>> users to carry some metadata information like partitioning/bucketing.
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your
>> vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following
>> technical reasons.
>>
>> Thanks!
>>
>
>

>>>
>
>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-12 Thread vaquar khan
+1

Regards,
Vaquar khan

On Oct 11, 2017 10:14 PM, "Weichen Xu"  wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:

> +1
>
> Xiao
>
> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:
>
>> +1
>>
>> One thing with MetadataSupport - It's a bad idea to call it that unless
>> adding new functions in that trait wouldn't break source/binary
>> compatibility in the future.
>>
>>
>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>>
>>> I'm adding my own +1 (binding).
>>>
>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
>>> wrote:
>>>
 I'm going to update the proposal: for the last point, although the
 user-facing API (`df.write.format(...).option(...).mode(...).save()`)
 mixes data and metadata operations, we are still able to separate them in
 the data source write API. We can have a mix-in trait `MetadataSupport`
 which has a method `create(options)`, so that data sources can mix in this
 trait and provide metadata creation support. Spark will call this `create`
 method inside `DataFrameWriter.save` if the specified data source has it.

 Note that file format data sources can ignore this new trait and still
 write data without metadata(it doesn't have metadata anyway).

 With this updated proposal, I'm calling a new vote for the data source
 v2 write path.

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

 On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
 wrote:

> Hi all,
>
> After we merge the infrastructure of data source v2 read path, and
> have some discussion for the write path, now I'm sending this email to 
> call
> a vote for Data Source v2 write path.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
> -Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for
> the write path:
> https://github.com/apache/spark/pull/19269
>
>
> The Data Source V1 write path asks implementations to write a
> DataFrame directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not
> good for maintenance.
> 2. Data sources may need to preprocess the input data before writing,
> like cluster/sort the input by some columns. It's better to do the
> preprocessing in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
> To solve these pain points, I'm proposing the data source v2 writing
> framework which is very similar to the reading framework, i.e.,
> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>
> Data Source V2 write path follows the existing FileCommitProtocol, and
> have task/job level commit/abort, so that data sources can implement
> transaction easier.
>
> We can create a mix-in trait for DataSourceV2Writer to specify the
> requirement for input data, like clustering and ordering.
>
> Spark provides a very simple protocol for uses to connect to data
> sources. A common way to write a dataframe to data sources:
> `df.write.format(...).option(...).mode(...).save()`.
> Spark passes the options and save mode to data sources, and schedules
> the write job on the input data. And the data source should take care of
> the metadata, e.g., the JDBC data source can create the table if it 
> doesn't
> exist, or fail the job and ask users to create the table in the
> corresponding database first. Data sources can define some options for
> users to carry some metadata information like partitioning/bucketing.
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


>>>
>>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-11 Thread Weichen Xu
+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:

> +1
>
> Xiao
>
> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:
>
>> +1
>>
>> One thing with MetadataSupport - It's a bad idea to call it that unless
>> adding new functions in that trait wouldn't break source/binary
>> compatibility in the future.
>>
>>
>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>>
>>> I'm adding my own +1 (binding).
>>>
>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
>>> wrote:
>>>
 I'm going to update the proposal: for the last point, although the
 user-facing API (`df.write.format(...).option(...).mode(...).save()`)
 mixes data and metadata operations, we are still able to separate them in
 the data source write API. We can have a mix-in trait `MetadataSupport`
 which has a method `create(options)`, so that data sources can mix in this
 trait and provide metadata creation support. Spark will call this `create`
 method inside `DataFrameWriter.save` if the specified data source has it.

 Note that file format data sources can ignore this new trait and still
 write data without metadata(it doesn't have metadata anyway).

 With this updated proposal, I'm calling a new vote for the data source
 v2 write path.

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

 On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
 wrote:

> Hi all,
>
> After we merge the infrastructure of data source v2 read path, and
> have some discussion for the write path, now I'm sending this email to 
> call
> a vote for Data Source v2 write path.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for
> the write path:
> https://github.com/apache/spark/pull/19269
>
>
> The Data Source V1 write path asks implementations to write a
> DataFrame directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not
> good for maintenance.
> 2. Data sources may need to preprocess the input data before writing,
> like cluster/sort the input by some columns. It's better to do the
> preprocessing in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
> To solve these pain points, I'm proposing the data source v2 writing
> framework which is very similar to the reading framework, i.e.,
> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>
> Data Source V2 write path follows the existing FileCommitProtocol, and
> have task/job level commit/abort, so that data sources can implement
> transaction easier.
>
> We can create a mix-in trait for DataSourceV2Writer to specify the
> requirement for input data, like clustering and ordering.
>
> Spark provides a very simple protocol for uses to connect to data
> sources. A common way to write a dataframe to data sources:
> `df.write.format(...).option(...).mode(...).save()`.
> Spark passes the options and save mode to data sources, and schedules
> the write job on the input data. And the data source should take care of
> the metadata, e.g., the JDBC data source can create the table if it 
> doesn't
> exist, or fail the job and ask users to create the table in the
> corresponding database first. Data sources can define some options for
> users to carry some metadata information like partitioning/bucketing.
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


>>>
>>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-11 Thread Xiao Li
+1

Xiao
On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:

> +1
>
> One thing with MetadataSupport - It's a bad idea to call it that unless
> adding new functions in that trait wouldn't break source/binary
> compatibility in the future.
>
>
> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>
>> I'm adding my own +1 (binding).
>>
>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan  wrote:
>>
>>> I'm going to update the proposal: for the last point, although the
>>> user-facing API (`df.write.format(...).option(...).mode(...).save()`) mixes
>>> data and metadata operations, we are still able to separate them in the
>>> data source write API. We can have a mix-in trait `MetadataSupport` which
>>> has a method `create(options)`, so that data sources can mix in this trait
>>> and provide metadata creation support. Spark will call this `create` method
>>> inside `DataFrameWriter.save` if the specified data source has it.
>>>
>>> Note that file format data sources can ignore this new trait and still
>>> write data without metadata(it doesn't have metadata anyway).
>>>
>>> With this updated proposal, I'm calling a new vote for the data source
>>> v2 write path.
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
>>> wrote:
>>>
 Hi all,

 After we merge the infrastructure of data source v2 read path, and have
 some discussion for the write path, now I'm sending this email to call a
 vote for Data Source v2 write path.

 The full document of the Data Source API V2 is:

 https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit

 The ready-for-review PR that implements the basic infrastructure for
 the write path:
 https://github.com/apache/spark/pull/19269


 The Data Source V1 write path asks implementations to write a DataFrame
 directly, which is painful:
 1. Exposing upper-level API like DataFrame to Data Source API is not
 good for maintenance.
 2. Data sources may need to preprocess the input data before writing,
 like cluster/sort the input by some columns. It's better to do the
 preprocessing in Spark instead of in the data source.
 3. Data sources need to take care of transaction themselves, which is
 hard. And different data sources may come up with a very similar approach
 for the transaction, which leads to many duplicated codes.

 To solve these pain points, I'm proposing the data source v2 writing
 framework which is very similar to the reading framework, i.e.,
 WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.

 Data Source V2 write path follows the existing FileCommitProtocol, and
 have task/job level commit/abort, so that data sources can implement
 transaction easier.

 We can create a mix-in trait for DataSourceV2Writer to specify the
 requirement for input data, like clustering and ordering.

 Spark provides a very simple protocol for uses to connect to data
 sources. A common way to write a dataframe to data sources:
 `df.write.format(...).option(...).mode(...).save()`.
 Spark passes the options and save mode to data sources, and schedules
 the write job on the input data. And the data source should take care of
 the metadata, e.g., the JDBC data source can create the table if it doesn't
 exist, or fail the job and ask users to create the table in the
 corresponding database first. Data sources can define some options for
 users to carry some metadata information like partitioning/bucketing.


 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

>>>
>>>
>>
>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Reynold Xin
+1

One thing with MetadataSupport - It's a bad idea to call it that unless
adding new functions in that trait wouldn't break source/binary
compatibility in the future.


On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:

> I'm adding my own +1 (binding).
>
> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan  wrote:
>
>> I'm going to update the proposal: for the last point, although the
>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>> mixes data and metadata operations, we are still able to separate them in
>> the data source write API. We can have a mix-in trait `MetadataSupport`
>> which has a method `create(options)`, so that data sources can mix in this
>> trait and provide metadata creation support. Spark will call this `create`
>> method inside `DataFrameWriter.save` if the specified data source has it.
>>
>> Note that file format data sources can ignore this new trait and still
>> write data without metadata(it doesn't have metadata anyway).
>>
>> With this updated proposal, I'm calling a new vote for the data source v2
>> write path.
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> After we merge the infrastructure of data source v2 read path, and have
>>> some discussion for the write path, now I'm sending this email to call a
>>> vote for Data Source v2 write path.
>>>
>>> The full document of the Data Source API V2 is:
>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>
>>> The ready-for-review PR that implements the basic infrastructure for the
>>> write path:
>>> https://github.com/apache/spark/pull/19269
>>>
>>>
>>> The Data Source V1 write path asks implementations to write a DataFrame
>>> directly, which is painful:
>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>> good for maintenance.
>>> 2. Data sources may need to preprocess the input data before writing,
>>> like cluster/sort the input by some columns. It's better to do the
>>> preprocessing in Spark instead of in the data source.
>>> 3. Data sources need to take care of transaction themselves, which is
>>> hard. And different data sources may come up with a very similar approach
>>> for the transaction, which leads to many duplicated codes.
>>>
>>> To solve these pain points, I'm proposing the data source v2 writing
>>> framework which is very similar to the reading framework, i.e.,
>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>>
>>> Data Source V2 write path follows the existing FileCommitProtocol, and
>>> have task/job level commit/abort, so that data sources can implement
>>> transaction easier.
>>>
>>> We can create a mix-in trait for DataSourceV2Writer to specify the
>>> requirement for input data, like clustering and ordering.
>>>
>>> Spark provides a very simple protocol for uses to connect to data
>>> sources. A common way to write a dataframe to data sources:
>>> `df.write.format(...).option(...).mode(...).save()`.
>>> Spark passes the options and save mode to data sources, and schedules
>>> the write job on the input data. And the data source should take care of
>>> the metadata, e.g., the JDBC data source can create the table if it doesn't
>>> exist, or fail the job and ask users to create the table in the
>>> corresponding database first. Data sources can define some options for
>>> users to carry some metadata information like partitioning/bucketing.
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>
>>
>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Wenchen Fan
I'm adding my own +1 (binding).

On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan  wrote:

> I'm going to update the proposal: for the last point, although the
> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
> mixes data and metadata operations, we are still able to separate them in
> the data source write API. We can have a mix-in trait `MetadataSupport`
> which has a method `create(options)`, so that data sources can mix in this
> trait and provide metadata creation support. Spark will call this `create`
> method inside `DataFrameWriter.save` if the specified data source has it.
>
> Note that file format data sources can ignore this new trait and still
> write data without metadata(it doesn't have metadata anyway).
>
> With this updated proposal, I'm calling a new vote for the data source v2
> write path.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan  wrote:
>
>> Hi all,
>>
>> After we merge the infrastructure of data source v2 read path, and have
>> some discussion for the write path, now I'm sending this email to call a
>> vote for Data Source v2 write path.
>>
>> The full document of the Data Source API V2 is:
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic infrastructure for the
>> write path:
>> https://github.com/apache/spark/pull/19269
>>
>>
>> The Data Source V1 write path asks implementations to write a DataFrame
>> directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not good
>> for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>> To solve these pain points, I'm proposing the data source v2 writing
>> framework which is very similar to the reading framework, i.e.,
>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>
>> Data Source V2 write path follows the existing FileCommitProtocol, and
>> have task/job level commit/abort, so that data sources can implement
>> transaction easier.
>>
>> We can create a mix-in trait for DataSourceV2Writer to specify the
>> requirement for input data, like clustering and ordering.
>>
>> Spark provides a very simple protocol for uses to connect to data
>> sources. A common way to write a dataframe to data sources:
>> `df.write.format(...).option(...).mode(...).save()`.
>> Spark passes the options and save mode to data sources, and schedules the
>> write job on the input data. And the data source should take care of the
>> metadata, e.g., the JDBC data source can create the table if it doesn't
>> exist, or fail the job and ask users to create the table in the
>> corresponding database first. Data sources can define some options for
>> users to carry some metadata information like partitioning/bucketing.
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>
>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Wenchen Fan
I'm going to update the proposal: for the last point, although the
user-facing API (`df.write.format(...).option(...).mode(...).save()`) mixes
data and metadata operations, we are still able to separate them in the
data source write API. We can have a mix-in trait `MetadataSupport` which
has a method `create(options)`, so that data sources can mix in this trait
and provide metadata creation support. Spark will call this `create` method
inside `DataFrameWriter.save` if the specified data source has it.

Note that file format data sources can ignore this new trait and still
write data without metadata(it doesn't have metadata anyway).

With this updated proposal, I'm calling a new vote for the data source v2
write path.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thanks!

On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan  wrote:

> Hi all,
>
> After we merge the infrastructure of data source v2 read path, and have
> some discussion for the write path, now I'm sending this email to call a
> vote for Data Source v2 write path.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for the
> write path:
> https://github.com/apache/spark/pull/19269
>
>
> The Data Source V1 write path asks implementations to write a DataFrame
> directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not good
> for maintenance.
> 2. Data sources may need to preprocess the input data before writing, like
> cluster/sort the input by some columns. It's better to do the preprocessing
> in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
> To solve these pain points, I'm proposing the data source v2 writing
> framework which is very similar to the reading framework, i.e.,
> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>
> Data Source V2 write path follows the existing FileCommitProtocol, and
> have task/job level commit/abort, so that data sources can implement
> transaction easier.
>
> We can create a mix-in trait for DataSourceV2Writer to specify the
> requirement for input data, like clustering and ordering.
>
> Spark provides a very simple protocol for uses to connect to data sources.
> A common way to write a dataframe to data sources:
> `df.write.format(...).option(...).mode(...).save()`.
> Spark passes the options and save mode to data sources, and schedules the
> write job on the input data. And the data source should take care of the
> metadata, e.g., the JDBC data source can create the table if it doesn't
> exist, or fail the job and ask users to create the table in the
> corresponding database first. Data sources can define some options for
> users to carry some metadata information like partitioning/bucketing.
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>