[ 
https://issues.apache.org/jira/browse/TAJO-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaehwa Jung updated TAJO-2087:
------------------------------
    Summary: Support DirectOutputCommitter for AWS S3 file system  (was: 
Implement DirectOutputCommitter)

> Support DirectOutputCommitter for AWS S3 file system
> ----------------------------------------------------
>
>                 Key: TAJO-2087
>                 URL: https://issues.apache.org/jira/browse/TAJO-2087
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: QueryMaster, S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, Tajo output committer works as following:
> * Each task write output to a temp directory.
> * {{FileTablespace::commitTable}} renames first successful task's temp 
> directory to final destination.
> But above approach will occurs {{FileNotFoundException}} because of eventual 
> consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
> There may be three different ways for implement it.
> First way is changing the name scheme for the files Tajo creates. Instead of 
> {{part-00000}} we should use names like {{UUID_000000}} where all files 
> generated by a single insert into use the same prefix. The prefix is consists 
> of UUID and each query id. It will guarantees that a new insert into will not 
> stomp on data produced by an earlier query. After finishing query 
> successfully, Tajo will delete all files that don't begin with same UUID.  Of 
> course, when executing the insert into statement, Tajo never delete existing 
> files. But if query failed or killed, Tajo will delete all file that begin 
> with same UUID. I was inspired by Qubole's slide 
> (http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
> Second way is storing insert file names and existing file names name to 
> tables of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. 
> Before inserting files, Tajo will store existing file names to some storage. 
> And whenever finishing task attempt, Tajo will store insert file names to 
> some storage. And Tajo will delete or maintain files using stored file names 
> according to query final status.
> Other way is writing the data to local disk. This output committer works as 
> follows:
> * Each task write output to local disk instead of S3 (in CTAS statement or 
> INERT statement)
> * Copies first successful task's temp directory to S3.
> For the reference, I was inspired by Netflix integrating spark 
> slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
> I wish to implement DirectOutputCommitter with the first way.
> Please feel free to comment if you have any questions/ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to