[GitHub] spark pull request #21145: SPARK-24073: Rename DataReaderFactory to ReadTask...

rdblue Tue, 24 Apr 2018 12:59:09 -0700

GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/21145


    SPARK-24073: Rename DataReaderFactory to ReadTask.

    
    ## What changes were proposed in this pull request?
    
    This reverses the changes in SPARK-23219, which renamed ReadTask to
    DataReaderFactory. The intent of that change was to make the read and
    write API match (write side uses DataWriterFactory), but the underlying
    problem is that the two classes are not equivalent.
    
    ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
    specific read task for a partition of the data to be read, in contrast
    to DataWriterFactory where the same factory instance is used in all
    write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
    with an explicit create operation to mirror the close operation. This is
    no longer clear from the API, where DataReaderFactory appears to be more
    generic than it is and it isn't clear why a set of them is produced for
    a read.
    
    ## How was this patch tested?
    
    Existing tests, which have been updated to use the new name.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark 
SPARK-24073-revert-data-reader-factory-rename

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21145.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21145
    
----
commit c364c05d3141bbe0ed29a2b02cecfa541d9c8212
Author: Ryan Blue <blue@...>
Date:   2018-04-24T19:55:25Z

    SPARK-24073: Rename DataReaderFactory to ReadTask.
    
    This reverses the changes in SPARK-23219, which renamed ReadTask to
    DataReaderFactory. The intent of that change was to make the read and
    write API match (write side uses DataWriterFactory), but the underlying
    problem is that the two classes are not equivalent.
    
    ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
    specific read task for a partition of the data to be read, in contrast
    to DataWriterFactory where the same factory instance is used in all
    write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
    with an explicit create operation to mirror the close operation. This is
    no longer clear from the API, where DataReaderFactory appears to be more
    generic than it is and it isn't clear why a set of them is produced for
    a read.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21145: SPARK-24073: Rename DataReaderFactory to ReadTask...

Reply via email to