GitHub user rxin opened a pull request:
https://github.com/apache/spark/pull/15562
Spark 18021
## What changes were proposed in this pull request?
Currently each data source OutputWriter is responsible for specifying the
entire file name for each file output. This, however, does not make any sense
because we rely on file name for certain behaviors in Spark SQL, e.g. bucket
id. The current approach allows individual data sources to break the
implementation of bucketing.
We don't want to move file name entirely also out of the data sources,
because different data sources do want to specify different extensions.
This patch breaks file name specification into two parts: the first part is
a prefix specified by the caller of OutputWriter (in WriteOutput), and the
second part is the suffix that can be specified by the OutputWriter itself.
Note that a side effect of this change is that now all file based data sources
also support bucketing automatically.
There are also some other minor cleanups:
- Removed the UUID passed through generic Configuration string
- Some minor rewrites for better clarity
- Renamed "path" in multiple places to "stagingDir", to more accurately
reflect its meaning
## How was this patch tested?
This should be covered by existing data source tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rxin/spark SPARK-18021
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15562.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15562
----
commit 426ed1f3d8b680adb83b3983ddbd368612be1e5f
Author: Reynold Xin <[email protected]>
Date: 2016-10-20T05:29:23Z
[SPARK-18012][SQL] Simplify WriterContainer follow-up
commit 6b79d88b9c66aa7a9faed335297e1646972e6526
Author: Reynold Xin <[email protected]>
Date: 2016-10-20T06:16:47Z
[SPARK-18021][SQL] Refactor file name specification for data sources
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]