[jira] [Created] (TAJO-2087) Implement DirectOutputCommitter

Jaehwa Jung (JIRA) Wed, 02 Mar 2016 00:12:55 -0800

Jaehwa Jung created TAJO-2087:
---------------------------------

             Summary: Implement DirectOutputCommitter
                 Key: TAJO-2087
                 URL: https://issues.apache.org/jira/browse/TAJO-2087
             Project: Tajo
          Issue Type: Sub-task
          Components: QueryMaster, S3
            Reporter: Jaehwa Jung
            Assignee: Jaehwa Jung

Currently, Tajo output committer works as following:

* Each task write output to a temp directory.
* {{FileTablespace::commitTable}} renames first successful task's temp
directory to final destination.

But above approach will occurs {{FileNotFoundException}} because of eventual
consistency of S3. To resolve it, we need to implement DirectOutputCommitter.

There may be three different ways for implement it.

First way is changing the name scheme for the files Tajo creates. Instead of
{{part-00000}} we should use names like {{UUID_000000}} where all files
generated by a single insert into use the same prefix. The prefix is consists
of UUID and each query id. It will guarantees that a new insert into will not
stomp on data produced by an earlier query. After finishing query successfully,
Tajo will delete all files that don't begin with same UUID. Of course, when
executing the insert into statement, Tajo never delete existing files. But if
query failed or killed, Tajo will delete all file that begin with same UUID. I
was inspired by Qubole's slide
(http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)

Second way is storing insert file names and existing file names name to tables
of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. Before
inserting files, Tajo will store existing file names to some storage. And
whenever finishing task attempt, Tajo will store insert file names to some
storage. And Tajo will delete or maintain files using stored file names
according to query final status.

Other way is writing the data to local disk. This output committer works as
follows:

* Each task write output to local disk instead of S3 (in CTAS statement or
INERT statement)
* Copies first successful task's temp directory to S3.

For the reference, I was inspired by Netflix integrating spark
slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).

I wish to implement DirectOutputCommitter with the first way.
Please feel free to comment if you have any questions/ideas.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TAJO-2087) Implement DirectOutputCommitter

Reply via email to