+1
On 10/12/17 20:10, Liwei Lin wrote:
+1 !
Cheers,
Liwei
On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.k...@gmail.com
<mailto:vaquar.k...@gmail.com>> wrote:
+1
Regards,
Vaquar khan
On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com
<mailto:weichen...@databricks.com>> wrote:
+1
On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li
<gatorsm...@gmail.com <mailto:gatorsm...@gmail.com>> wrote:
+1
Xiao
On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin
<r...@databricks.com <mailto:r...@databricks.com>> wrote:
+1
One thing with MetadataSupport - It's a bad idea to
call it that unless adding new functions in that trait
wouldn't break source/binary compatibility in the future.
On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan
<cloud0...@gmail.com <mailto:cloud0...@gmail.com>> wrote:
I'm adding my own +1 (binding).
On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan
<cloud0...@gmail.com <mailto:cloud0...@gmail.com>>
wrote:
I'm going to update the proposal: for the last
point, although the user-facing API
(`df.write.format(...).option(...).mode(...).save()`)
mixes data and metadata operations, we are
still able to separate them in the data source
write API. We can have a mix-in trait
`MetadataSupport` which has a method
`create(options)`, so that data sources can
mix in this trait and provide metadata
creation support. Spark will call this
`create` method inside `DataFrameWriter.save`
if the specified data source has it.
Note that file format data sources can ignore
this new trait and still write data without
metadata(it doesn't have metadata anyway).
With this updated proposal, I'm calling a new
vote for the data source v2 write path.
The vote will be up for the next 72 hours.
Please reply with your vote:
+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because
of the following technical reasons.
Thanks!
On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan
<cloud0...@gmail.com
<mailto:cloud0...@gmail.com>> wrote:
Hi all,
After we merge the infrastructure of data
source v2 read path, and have some
discussion for the write path, now I'm
sending this email to call a vote for Data
Source v2 write path.
The full document of the Data Source API
V2 is:
https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
<https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit>
The ready-for-review PR that implements
the basic infrastructure for the write path:
https://github.com/apache/spark/pull/19269
<https://github.com/apache/spark/pull/19269>
The Data Source V1 write path asks
implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame
to Data Source API is not good for
maintenance.
2. Data sources may need to preprocess the
input data before writing, like
cluster/sort the input by some columns.
It's better to do the preprocessing in
Spark instead of in the data source.
3. Data sources need to take care of
transaction themselves, which is hard. And
different data sources may come up with a
very similar approach for the transaction,
which leads to many duplicated codes.
To solve these pain points, I'm proposing
the data source v2 writing framework which
is very similar to the reading framework,
i.e., WriteSupport -> DataSourceV2Writer
-> DataWriterFactory -> DataWriter.
Data Source V2 write path follows the
existing FileCommitProtocol, and have
task/job level commit/abort, so that data
sources can implement transaction easier.
We can create a mix-in trait for
DataSourceV2Writer to specify the
requirement for input data, like
clustering and ordering.
Spark provides a very simple protocol for
uses to connect to data sources. A common
way to write a dataframe to data sources:
`df.write.format(...).option(...).mode(...).save()`.
Spark passes the options and save mode to
data sources, and schedules the write job
on the input data. And the data source
should take care of the metadata, e.g.,
the JDBC data source can create the table
if it doesn't exist, or fail the job and
ask users to create the table in the
corresponding database first. Data sources
can define some options for users to carry
some metadata information like
partitioning/bucketing.
The vote will be up for the next 72 hours.
Please reply with your vote:
+1: Yeah, let's go forward and implement
the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea
because of the following technical reasons.
Thanks!