Jaehwa Jung created TAJO-2087:
---------------------------------
Summary: Implement DirectOutputCommitter
Key: TAJO-2087
URL: https://issues.apache.org/jira/browse/TAJO-2087
Project: Tajo
Issue Type: Sub-task
Components: QueryMaster, S3
Reporter: Jaehwa Jung
Assignee: Jaehwa Jung
Currently, Tajo output committer works as following:
* Each task write output to a temp directory.
* {{FileTablespace::commitTable}} renames first successful task's temp
directory to final destination.
But above approach will occurs {{FileNotFoundException}} because of eventual
consistency of S3. To resolve it, we need to implement DirectOutputCommitter.
There may be three different ways for implement it.
First way is changing the name scheme for the files Tajo creates. Instead of
{{part-00000}} we should use names like {{UUID_000000}} where all files
generated by a single insert into use the same prefix. The prefix is consists
of UUID and each query id. It will guarantees that a new insert into will not
stomp on data produced by an earlier query. After finishing query successfully,
Tajo will delete all files that don't begin with same UUID. Of course, when
executing the insert into statement, Tajo never delete existing files. But if
query failed or killed, Tajo will delete all file that begin with same UUID. I
was inspired by Qubole's slide
(http://www.slideshare.net/qubolemarketing/new-york-city-hadoop-meetup-4-232015)
Second way is storing insert file names and existing file names name to tables
of {{CatalogStore}} or member variables of {{TaskAttemptContext}}. Before
inserting files, Tajo will store existing file names to some storage. And
whenever finishing task attempt, Tajo will store insert file names to some
storage. And Tajo will delete or maintain files using stored file names
according to query final status.
Other way is writing the data to local disk. This output committer works as
follows:
* Each task write output to local disk instead of S3 (in CTAS statement or
INERT statement)
* Copies first successful task's temp directory to S3.
For the reference, I was inspired by Netflix integrating spark
slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
I wish to implement DirectOutputCommitter with the first way.
Please feel free to comment if you have any questions/ideas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)