GitHub user yanbohappy opened a pull request:

    https://github.com/apache/spark/pull/4610

    JSON external data source INSERT improvements initial draft

    
    
    JSON external data source INSERT operation improvements and bug fix:
    1, The path in "CREATE TABLE AS SELECT" must be a directory whether it 
exists, we use directory to represent the table.
    Because in this scenario we need to write(INSERT OVERWRITE) or 
append(INSERT INTO) data to the existing table, we can't append to HDFS files 
which represent RDD. If we want to implement append semantics, we need new 
files and add them to the specific directory.
    Another reason is that if the table based on a directory is more reasonable 
for access control, authentication and authorization. As SPARK-5821 mentioned, 
if we don't have write permission for the parent directory of the table, the 
CTAS command will failure. It's reasonable that it will not be granted some 
access rights to the directory which out of the table scope. So the table base 
on a directory may be a better choice. 
    This restriction is only to "CREATE TABLE AS SELECT", other DDL like 
"CREATE TABLE" can base on ordinary file or directory, due to the later one 
only scanning table without inserting new data.
    
    2, New INSERT OVERWRITE implementation.
    First insert the new generated table to a temporary directory which named 
as "_temporary" under the path directory. After insert finished, we deleted the 
original files. At last we rename "_temporary" for "data".
    This can fix the bug which mentioned at SPARK-5746.
    
    3, Why to rename "_temporary" for "data" rather than move all files in 
"_temporary" to path and then delete "_temporary"? Because that spark 
RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS 
files which named as "part-***** " like files under the path. If the original 
files were produced by this mean, and then we use "INSERT" without overwrite, 
the new generated table files are also named as "part-***** " which will 
produce corrupted table.
    
    Todo:
    1, If there is an existing RDD base on path a/b/c which has already been 
cached, after "INSERT" operation we need to recomputing this RDD by rescan the 
directory. Can we trigger a rescan execution operation after "INSERT"?
    2, Is it enough that rename  "_temporary" to "data" which mentioned above? 
If the base directory is produced by another "CTAS" command, there will be 
"data" directory under it. Can we append a unique number after "data" or just 
use the jobId or taskId to identify the subdirectory ?
    I think to resolve these problem in a follow up PR may be better.
    
    This is the initial draft and need optimization. Looking forward your 
comments.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanbohappy/spark jsonInsertImprovements

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4610.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4610
    
----
commit 307278ff6bdcd1d1a5a50650fb3dfa6da3db070f
Author: Yanbo Liang <[email protected]>
Date:   2015-02-15T05:23:14Z

    JSON external data source INSERT improvements initial draft

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to