I'm adding my own +1 (binding). On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
> I'm going to update the proposal: for the last point, although the > user-facing API (`df.write.format(...).option(...).mode(...).save()`) > mixes data and metadata operations, we are still able to separate them in > the data source write API. We can have a mix-in trait `MetadataSupport` > which has a method `create(options)`, so that data sources can mix in this > trait and provide metadata creation support. Spark will call this `create` > method inside `DataFrameWriter.save` if the specified data source has it. > > Note that file format data sources can ignore this new trait and still > write data without metadata(it doesn't have metadata anyway). > > With this updated proposal, I'm calling a new vote for the data source v2 > write path. > > The vote will be up for the next 72 hours. Please reply with your vote: > > +1: Yeah, let's go forward and implement the SPIP. > +0: Don't really care. > -1: I don't think this is a good idea because of the following technical > reasons. > > Thanks! > > On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi all, >> >> After we merge the infrastructure of data source v2 read path, and have >> some discussion for the write path, now I'm sending this email to call a >> vote for Data Source v2 write path. >> >> The full document of the Data Source API V2 is: >> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >> -Z8qU5Frf6WMQZ6jJVM/edit >> >> The ready-for-review PR that implements the basic infrastructure for the >> write path: >> https://github.com/apache/spark/pull/19269 >> >> >> The Data Source V1 write path asks implementations to write a DataFrame >> directly, which is painful: >> 1. Exposing upper-level API like DataFrame to Data Source API is not good >> for maintenance. >> 2. Data sources may need to preprocess the input data before writing, >> like cluster/sort the input by some columns. It's better to do the >> preprocessing in Spark instead of in the data source. >> 3. Data sources need to take care of transaction themselves, which is >> hard. And different data sources may come up with a very similar approach >> for the transaction, which leads to many duplicated codes. >> >> To solve these pain points, I'm proposing the data source v2 writing >> framework which is very similar to the reading framework, i.e., >> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. >> >> Data Source V2 write path follows the existing FileCommitProtocol, and >> have task/job level commit/abort, so that data sources can implement >> transaction easier. >> >> We can create a mix-in trait for DataSourceV2Writer to specify the >> requirement for input data, like clustering and ordering. >> >> Spark provides a very simple protocol for uses to connect to data >> sources. A common way to write a dataframe to data sources: >> `df.write.format(...).option(...).mode(...).save()`. >> Spark passes the options and save mode to data sources, and schedules the >> write job on the input data. And the data source should take care of the >> metadata, e.g., the JDBC data source can create the table if it doesn't >> exist, or fail the job and ask users to create the table in the >> corresponding database first. Data sources can define some options for >> users to carry some metadata information like partitioning/bucketing. >> >> >> The vote will be up for the next 72 hours. Please reply with your vote: >> >> +1: Yeah, let's go forward and implement the SPIP. >> +0: Don't really care. >> -1: I don't think this is a good idea because of the following technical >> reasons. >> >> Thanks! >> > >