[jira] [Commented] (TAJO-2087) Implement DirectOutputCommitter

ASF GitHub Bot (JIRA) Wed, 16 Mar 2016 03:39:51 -0700

    [ 
https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197160#comment-15197160
 ]


ASF GitHub Bot commented on TAJO-2087:
--------------------------------------

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/979

    TAJO-2087: Implement DirectOutputCommitter

    Here is prototype codes for ``DirectOutputCommitter``. This PR is not ready 
to review, it shows my approach to implement ``DirectOutputCommitter``. Current 
version works as following:
    
    - Register commit history to catalog (TODO).
    - Each tasks will write the output data directly to the final location.
    - In a commit phase, delete existing files with query type as follows. 
First, backup existing files or directories to staging directory. And then 
delete backup files or directories. 
    - Update the status of commit history to catalog (TODO).
    - If query fails, QueryMaster will delete committed files and update the 
status of query history to catalog (TODO).
    - When ``TajoMaster`` starting, it will check the status of query histories 
to catalog. If it find running query, it will delete committed files and update 
the status of query history (TODO).
    - Add unit test cases for failed query (TODO).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo direct-output-committer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #979
    
----
commit 083bed51db1e68ed840961e2e169695dde60e116
Author: JaeHwa Jung <[email protected]>
Date:   2016-02-24T02:08:08Z

    Add the list of output files and backup files to TaskAttemptContext

commit b39c8d1bcb153d53aae028577935499034bd4b6f
Author: JaeHwa Jung <[email protected]>
Date:   2016-02-24T05:31:55Z

    Add outputFiles and backupFiles to Protocol Buffer

commit e3b26ea738ba33e1a6c8b8c856793f5a584eb861
Author: JaeHwa Jung <[email protected]>
Date:   2016-02-24T05:48:02Z

    Add property for setting Direct Output Committer to TajoConf and SessionVars

commit 9efb4662957ff39ff215a3c829ece5e69d9ebe36
Author: JaeHwa Jung <[email protected]>
Date:   2016-02-25T01:59:26Z

    Remove related property from SessionVars

commit 234f2829768f18fab7c7894aab2ccf7780ae3ffb
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-04T02:44:52Z

    Add temporary codes for testing

commit 7effec1fc663d246ffd3e25bfd4a98c803b22607
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-15T09:01:43Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into 
direct-output-committer

commit cb762766848c2af5d25e20ab552a2041c67924cc
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-15T09:30:36Z

    Prefix of output file name must be the id of query.

commit dce41c6be686916a346dc15a033bea39cc79550b
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-16T05:50:32Z

    Implement direct Output Committer to FileTablespace

commit 908ccd2b6c2ebbd602892b979c1ff41d7ed4a820
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-16T06:30:45Z

    Implement a method for renaming recursively directories

commit bd1e1b3f16e8b6263ef4e762b621a4ba2235aa34
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-16T06:43:56Z

    Remove proto modifications

commit 95e513a04bcfee10643ebe17b0e21074057f0be2
Author: JaeHwa Jung <[email protected]>
Date:   2016-03-16T10:21:06Z

    Add session variable and add more unit test cases

----


> Implement DirectOutputCommitter
> -------------------------------
>
>                 Key: TAJO-2087
>                 URL: https://issues.apache.org/jira/browse/TAJO-2087
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: QueryMaster, S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp 
> directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual 
> consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of 
> {{part-00000}} we should use names like {{UUID_000000}} where all files 
> generated by a single insert into use the same prefix. The prefix is consists 
> of UUID and each query id. It will guarantees that a new insert into will not 
> stomp on data produced by an earlier query. After finishing query 
> successfully, Tajo will delete all files that don't begin with same UUID.  Of 
> course, when executing the insert into statement, Tajo never delete existing 
> files. But if query failed or killed, Tajo will delete all file that begin 
> with same UUID. I was inspired by Qubole's slide 
> (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to 
> tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. 
> Before inserting files, Tajo will store existing file names to some storage. 
> And whenever finishing task attempt, Tajo will store insert file names to 
> some storage. And Tajo will delete or maintain files using stored file names 
> according to query final status.
> Other way is writing the data to local disk. This output committer works as 
> follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or 
> INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark 
> slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TAJO-2087) Implement DirectOutputCommitter

Reply via email to