Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Cheng Lian Sun, 15 Oct 2017 23:44:04 -0700

+1


On 10/12/17 20:10, Liwei Lin wrote:

+1 !

Cheers,
Liwei

On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan <vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>> wrote:


    +1

    Regards,
    Vaquar khan

    On Oct 11, 2017 10:14 PM, "Weichen Xu" <weichen...@databricks.com
    <mailto:weichen...@databricks.com>> wrote:

        +1

        On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li
        <gatorsm...@gmail.com <mailto:gatorsm...@gmail.com>> wrote:

            +1

            Xiao

            On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin
            <r...@databricks.com <mailto:r...@databricks.com>> wrote:

                +1

                One thing with MetadataSupport - It's a bad idea to
                call it that unless adding new functions in that trait
                wouldn't break source/binary compatibility in the future.


                On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan
                <cloud0...@gmail.com <mailto:cloud0...@gmail.com>> wrote:

                    I'm adding my own +1 (binding).

                    On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan
                    <cloud0...@gmail.com <mailto:cloud0...@gmail.com>>
                    wrote:

                        I'm going to update the proposal: for the last
                        point, although the user-facing API
                        (`df.write.format(...).option(...).mode(...).save()`)
                        mixes data and metadata operations, we are
                        still able to separate them in the data source
                        write API. We can have a mix-in trait
                        `MetadataSupport` which has a method
                        `create(options)`, so that data sources can
                        mix in this trait and provide metadata
                        creation support. Spark will call this
                        `create` method inside `DataFrameWriter.save`
                        if the specified data source has it.

                        Note that file format data sources can ignore
                        this new trait and still write data without
                        metadata(it doesn't have metadata anyway).

                        With this updated proposal, I'm calling a new
                        vote for the data source v2 write path.

                        The vote will be up for the next 72 hours.
                        Please reply with your vote:

                        +1: Yeah, let's go forward and implement the SPIP.
                        +0: Don't really care.
                        -1: I don't think this is a good idea because
                        of the following technical reasons.

                        Thanks!

                        On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan
                        <cloud0...@gmail.com
                        <mailto:cloud0...@gmail.com>> wrote:

                            Hi all,

                            After we merge the infrastructure of data
                            source v2 read path, and have some
                            discussion for the write path, now I'm
                            sending this email to call a vote for Data
                            Source v2 write path.

                            The full document of the Data Source API
                            V2 is:
                            
https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
                            
<https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit>

                            The ready-for-review PR that implements
                            the basic infrastructure for the write path:
                            https://github.com/apache/spark/pull/19269
                            <https://github.com/apache/spark/pull/19269>


                            The Data Source V1 write path asks
                            implementations to write a DataFrame
                            directly, which is painful:
                            1. Exposing upper-level API like DataFrame
                            to Data Source API is not good for
                            maintenance.
                            2. Data sources may need to preprocess the
                            input data before writing, like
                            cluster/sort the input by some columns.
                            It's better to do the preprocessing in
                            Spark instead of in the data source.
                            3. Data sources need to take care of
                            transaction themselves, which is hard. And
                            different data sources may come up with a
                            very similar approach for the transaction,
                            which leads to many duplicated codes.

                            To solve these pain points, I'm proposing
                            the data source v2 writing framework which
                            is very similar to the reading framework,
                            i.e., WriteSupport -> DataSourceV2Writer
                            -> DataWriterFactory -> DataWriter.

                            Data Source V2 write path follows the
                            existing FileCommitProtocol, and have
                            task/job level commit/abort, so that data
                            sources can implement transaction easier.

                            We can create a mix-in trait for
                            DataSourceV2Writer to specify the
                            requirement for input data, like
                            clustering and ordering.

                            Spark provides a very simple protocol for
                            uses to connect to data sources. A common
                            way to write a dataframe to data sources:
                            `df.write.format(...).option(...).mode(...).save()`.
                            Spark passes the options and save mode to
                            data sources, and schedules the write job
                            on the input data. And the data source
                            should take care of the metadata, e.g.,
                            the JDBC data source can create the table
                            if it doesn't exist, or fail the job and
                            ask users to create the table in the
                            corresponding database first. Data sources
                            can define some options for users to carry
                            some metadata information like
                            partitioning/bucketing.


                            The vote will be up for the next 72 hours.
                            Please reply with your vote:

                            +1: Yeah, let's go forward and implement
                            the SPIP.
                            +0: Don't really care.
                            -1: I don't think this is a good idea
                            because of the following technical reasons.

                            Thanks!

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

Reply via email to