GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21145
SPARK-24073: Rename DataReaderFactory to ReadTask.
## What changes were proposed in this pull request?
This reverses the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.
ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
specific read task for a partition of the data to be read, in contrast
to DataWriterFactory where the same factory instance is used in all
write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
with an explicit create operation to mirror the close operation. This is
no longer clear from the API, where DataReaderFactory appears to be more
generic than it is and it isn't clear why a set of them is produced for
a read.
## How was this patch tested?
Existing tests, which have been updated to use the new name.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rdblue/spark
SPARK-24073-revert-data-reader-factory-rename
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21145.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21145
----
commit c364c05d3141bbe0ed29a2b02cecfa541d9c8212
Author: Ryan Blue <blue@...>
Date: 2018-04-24T19:55:25Z
SPARK-24073: Rename DataReaderFactory to ReadTask.
This reverses the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.
ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
specific read task for a partition of the data to be read, in contrast
to DataWriterFactory where the same factory instance is used in all
write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
with an explicit create operation to mirror the close operation. This is
no longer clear from the API, where DataReaderFactory appears to be more
generic than it is and it isn't clear why a set of them is produced for
a read.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]