GitHub user yanbohappy opened a pull request:
https://github.com/apache/spark/pull/4607
[SPARK-5821] [SPARK-5746] [SQL] JSON data source refactor initial draft
JSON data source refactor
1, The path in "CREATE TABLE AS SELECT" must be a directory. Because in
this scenario we need to write or append files to the existed table, underlying
directory is more reasonable for append operation, authentication and
authorization.
For SPARK-5821, if we don't have write permission for the parent directory,
the CTAS command will failure.
Another reason is that we can't append to HDFS files which represent RDD,
if we want to implement append semantics, we need new files and add to a
specific directory.
2, New INSERT OVERWRITE implementation.
First insert the new generated table to a temporary directory which named
as "_temporary" under the path directory. After insert finished, we deleted the
original files. At last we rename "_temporary" for "data".
This can fix the bug which mentioned at SPARK-5746.
Why to rename "_temporary" for "data" rather than move all files in
"_temporary" to path and then delete "_temporary"? Because that spark
RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS
files which named as "part-*****" like files under the path. If the original
files were produced by this mean, and then we use "INSERT" without overwrite,
the new generated table files are also named as "part-*****" which will produce
corrupted table.
This is the initial draft and need optimization. Looking forward your
opinions and comments.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yanbohappy/spark JSONDataSourceRefactor
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4607.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4607
----
commit 8683a483c074f692152159d63a101f78c3c3fe58
Author: Yanbo Liang <[email protected]>
Date: 2015-02-14T17:37:05Z
JSON data source refactor initial draft
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]