[
https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197160#comment-15197160
]
ASF GitHub Bot commented on TAJO-2087:
--------------------------------------
GitHub user blrunner opened a pull request:
https://github.com/apache/tajo/pull/979
TAJO-2087: Implement DirectOutputCommitter
Here is prototype codes for ``DirectOutputCommitter``. This PR is not ready
to review, it shows my approach to implement ``DirectOutputCommitter``. Current
version works as following:
- Register commit history to catalog (TODO).
- Each tasks will write the output data directly to the final location.
- In a commit phase, delete existing files with query type as follows.
First, backup existing files or directories to staging directory. And then
delete backup files or directories.
- Update the status of commit history to catalog (TODO).
- If query fails, QueryMaster will delete committed files and update the
status of query history to catalog (TODO).
- When ``TajoMaster`` starting, it will check the status of query histories
to catalog. If it find running query, it will delete committed files and update
the status of query history (TODO).
- Add unit test cases for failed query (TODO).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/blrunner/tajo direct-output-committer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/979.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #979
----
commit 083bed51db1e68ed840961e2e169695dde60e116
Author: JaeHwa Jung <[email protected]>
Date: 2016-02-24T02:08:08Z
Add the list of output files and backup files to TaskAttemptContext
commit b39c8d1bcb153d53aae028577935499034bd4b6f
Author: JaeHwa Jung <[email protected]>
Date: 2016-02-24T05:31:55Z
Add outputFiles and backupFiles to Protocol Buffer
commit e3b26ea738ba33e1a6c8b8c856793f5a584eb861
Author: JaeHwa Jung <[email protected]>
Date: 2016-02-24T05:48:02Z
Add property for setting Direct Output Committer to TajoConf and SessionVars
commit 9efb4662957ff39ff215a3c829ece5e69d9ebe36
Author: JaeHwa Jung <[email protected]>
Date: 2016-02-25T01:59:26Z
Remove related property from SessionVars
commit 234f2829768f18fab7c7894aab2ccf7780ae3ffb
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-04T02:44:52Z
Add temporary codes for testing
commit 7effec1fc663d246ffd3e25bfd4a98c803b22607
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-15T09:01:43Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into
direct-output-committer
commit cb762766848c2af5d25e20ab552a2041c67924cc
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-15T09:30:36Z
Prefix of output file name must be the id of query.
commit dce41c6be686916a346dc15a033bea39cc79550b
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-16T05:50:32Z
Implement direct Output Committer to FileTablespace
commit 908ccd2b6c2ebbd602892b979c1ff41d7ed4a820
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-16T06:30:45Z
Implement a method for renaming recursively directories
commit bd1e1b3f16e8b6263ef4e762b621a4ba2235aa34
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-16T06:43:56Z
Remove proto modifications
commit 95e513a04bcfee10643ebe17b0e21074057f0be2
Author: JaeHwa Jung <[email protected]>
Date: 2016-03-16T10:21:06Z
Add session variable and add more unit test cases
----
> Implement DirectOutputCommitter
> -------------------------------
>
> Key: TAJO-2087
> URL: https://issues.apache.org/jira/browse/TAJO-2087
> Project: Tajo
> Issue Type: Sub-task
> Components: QueryMaster, S3
> Reporter: Jaehwa Jung
> Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp
> directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual
> consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of
> {{part-00000}} we should use names like {{UUID_000000}} where all files
> generated by a single insert into use the same prefix. The prefix is consists
> of UUID and each query id. It will guarantees that a new insert into will not
> stomp on data produced by an earlier query. After finishing query
> successfully, Tajo will delete all files that don't begin with same UUID. Of
> course, when executing the insert into statement, Tajo never delete existing
> files. But if query failed or killed, Tajo will delete all file that begin
> with same UUID. I was inspired by Qubole's slide
> (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to
> tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}.
> Before inserting files, Tajo will store existing file names to some storage.
> And whenever finishing task attempt, Tajo will store insert file names to
> some storage. And Tajo will delete or maintain files using stored file names
> according to query final status.
> Other way is writing the data to local disk. This output committer works as
> follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or
> INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark
> slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)