[jira] [Commented] (TAJO-2087) Support DirectOutputCommitter for AWS S3 file system

ASF GitHub Bot (JIRA) Wed, 20 Apr 2016 01:42:56 -0700

    [ 
https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249497#comment-15249497
 ]


ASF GitHub Bot commented on TAJO-2087:
--------------------------------------

Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/979#issuecomment-212331340
  
    I implemented unit test cases for verifying following cases.
    * ``DirectOutputCommitter`` can recover existing files successfully in 
query failure case. 
    *  ``DirectOutputCommitter`` can remove output files successfully in query 
failure case.
    * When executing ``INSERT INTO`` query, ``DirectOutputCommitter`` can 
maintain existing files.
    
    For the reference, I found that outputs of ``TestInsertQuery`` and 
``TestTablePartitions`` with ``DirectOutputCommitter`` were equals to outputs 
of them without ``DirectOutputCommitter``.


> Support DirectOutputCommitter for AWS S3 file system
> ----------------------------------------------------
>
>                 Key: TAJO-2087
>                 URL: https://issues.apache.org/jira/browse/TAJO-2087
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: QueryMaster, S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp 
> directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual 
> consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of 
> {{part-00000}} we should use names like {{UUID_000000}} where all files 
> generated by a single insert into use the same prefix. The prefix is consists 
> of UUID and each query id. It will guarantees that a new insert into will not 
> stomp on data produced by an earlier query. After finishing query 
> successfully, Tajo will delete all files that don't begin with same UUID.  Of 
> course, when executing the insert into statement, Tajo never delete existing 
> files. But if query failed or killed, Tajo will delete all file that begin 
> with same UUID. I was inspired by Qubole's slide 
> (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to 
> tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. 
> Before inserting files, Tajo will store existing file names to some storage. 
> And whenever finishing task attempt, Tajo will store insert file names to 
> some storage. And Tajo will delete or maintain files using stored file names 
> according to query final status.
> Other way is writing the data to local disk. This output committer works as 
> follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or 
> INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark 
> slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TAJO-2087) Support DirectOutputCommitter for AWS S3 file system

Reply via email to