[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

yanbohappy Sat, 14 Feb 2015 10:03:07 -0800

GitHub user yanbohappy opened a pull request:

    https://github.com/apache/spark/pull/4607


    [SPARK-5821] [SPARK-5746] [SQL] JSON data source refactor initial draft

    JSON data source refactor
    1, The path in "CREATE TABLE AS SELECT" must be a directory. Because in 
this scenario we need to write or append files to the existed table, underlying 
directory is more reasonable for append operation, authentication and 
authorization.
    For SPARK-5821, if we don't have write permission for the parent directory, 
the CTAS command will failure.
    Another reason is that we can't append to HDFS files which represent RDD, 
if we want to implement append semantics, we need new files and add to a 
specific directory.
    2, New INSERT OVERWRITE implementation.
    First insert the new generated table to a temporary directory which named 
as "_temporary" under the path directory. After insert finished, we deleted the 
original files. At last we rename "_temporary" for "data".
    This can fix the bug which mentioned at SPARK-5746.
    Why to rename "_temporary" for "data" rather than move all files in 
"_temporary" to path and then delete "_temporary"? Because that spark 
RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS 
files which named as "part-*****" like files under the path. If the original 
files were produced by this mean, and then we use "INSERT" without overwrite, 
the new generated table files are also named as "part-*****" which will produce 
corrupted table.
    This is the initial draft and need optimization. Looking forward your 
opinions and comments.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanbohappy/spark JSONDataSourceRefactor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4607
    
----
commit 8683a483c074f692152159d63a101f78c3c3fe58
Author: Yanbo Liang <[email protected]>
Date:   2015-02-14T17:37:05Z

    JSON data source refactor initial draft

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

Reply via email to