[
https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249497#comment-15249497
]
ASF GitHub Bot commented on TAJO-2087:
--------------------------------------
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/979#issuecomment-212331340
I implemented unit test cases for verifying following cases.
* ``DirectOutputCommitter`` can recover existing files successfully in
query failure case.
* ``DirectOutputCommitter`` can remove output files successfully in query
failure case.
* When executing ``INSERT INTO`` query, ``DirectOutputCommitter`` can
maintain existing files.
For the reference, I found that outputs of ``TestInsertQuery`` and
``TestTablePartitions`` with ``DirectOutputCommitter`` were equals to outputs
of them without ``DirectOutputCommitter``.
> Support DirectOutputCommitter for AWS S3 file system
> ----------------------------------------------------
>
> Key: TAJO-2087
> URL: https://issues.apache.org/jira/browse/TAJO-2087
> Project: Tajo
> Issue Type: Sub-task
> Components: QueryMaster, S3
> Reporter: Jaehwa Jung
> Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp
> directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual
> consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of
> {{part-00000}} we should use names like {{UUID_000000}} where all files
> generated by a single insert into use the same prefix. The prefix is consists
> of UUID and each query id. It will guarantees that a new insert into will not
> stomp on data produced by an earlier query. After finishing query
> successfully, Tajo will delete all files that don't begin with same UUID. Of
> course, when executing the insert into statement, Tajo never delete existing
> files. But if query failed or killed, Tajo will delete all file that begin
> with same UUID. I was inspired by Qubole's slide
> (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to
> tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}.
> Before inserting files, Tajo will store existing file names to some storage.
> And whenever finishing task attempt, Tajo will store insert file names to
> some storage. And Tajo will delete or maintain files using stored file names
> according to query final status.
> Other way is writing the data to local disk. This output committer works as
> follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or
> INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark
> slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)