[
https://issues.apache.org/jira/browse/TAJO-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143060#comment-15143060
]
ASF GitHub Bot commented on TAJO-1905:
--------------------------------------
GitHub user blrunner opened a pull request:
https://github.com/apache/tajo/pull/959
TAJO-1905: Insert clause to partitioned table fails on S3
Currently, Tajo output committer works as following:
* Each task write output to a temp directory.
* ``FileTablespace::commitTable`` renames first successful task's temp
directory to final destination.
But above approach will occurs FileNotFoundException because of eventual
consistency of S3. To resolve it, I implemented output committer for S3 and the
committer works as following:
* Each task write output to local disk instead of S3 (in CTAS statement or
INERT statement)
* ``S3TableSpace::commitTable`` copies first successful task's temp
directory to S3.
This PR depends on https://github.com/apache/tajo/pull/952. CTAS statement
and INSERT statement for partition table ran successfully with this PR. For the
reference, I was inspired by Netflix integrating spark
slide(http://www.slideshare.net/piaozhexiu/netflix-integrating-spark-at-petabyte-scale-53391704).
To resolve this issue basically, each task need to write output to final
destination and we need to implement pluggable output committer. But this way
looks like a long time work. I think that this PR may be an interim work for
the pluggable output committer.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/blrunner/tajo TAJO-1905
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/959.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #959
----
----
> Insert clause to partitioned table fails on S3
> ----------------------------------------------
>
> Key: TAJO-1905
> URL: https://issues.apache.org/jira/browse/TAJO-1905
> Project: Tajo
> Issue Type: Sub-task
> Components: QueryMaster, S3
> Reporter: Jinho Kim
> Assignee: Jaehwa Jung
> Fix For: 0.12.0
>
>
> Here is the error log
> {noformat}
> 2015-10-02 18:54:40,399 ERROR org.apache.hadoop.fs.s3a.S3AFileSystem: rename:
> src not found
> s3a://bucket/tpch-1g-p/lineitem/.staging/q_1443779192380_0001/RESULT/l_shipdate=1996-01-30
> 2015-10-02 18:54:51,357 ERROR org.apache.hadoop.fs.s3a.S3AFileSystem: rename:
> src not found
> s3a://bucket/tpch-1g-p/lineitem/.staging/q_1443779192380_0001/RESULT/l_shipdate=1993-11-09
> 2015-10-02 18:55:03,955 ERROR org.apache.tajo.querymaster.Query: No such file
> or directory: s3a://bucket/lineitem/l_shipdate=1994-02-02
> java.io.FileNotFoundException: No such file or directory:
> s3a://bucket/tpch-1g-p/lineitem/l_shipdate=1994-02-02
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
> at org.apache.hadoop.fs.FileSystem.getContentSummary(FileSystem.java:1467)
> at
> org.apache.tajo.querymaster.Query$QueryCompletedTransition.getPartitionsWithContentsSummary(Query.java:550)
> at
> org.apache.tajo.querymaster.Query$QueryCompletedTransition.finalizeQuery(Query.java:512)
> at
> org.apache.tajo.querymaster.Query$QueryCompletedTransition.transition(Query.java:446)
> at
> org.apache.tajo.querymaster.Query$QueryCompletedTransition.transition(Query.java:435)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at org.apache.tajo.querymaster.Query.handle(Query.java:874)
> at org.apache.tajo.querymaster.Query.handle(Query.java:63)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)
> 2015-10-02 18:55:03,958 INFO org.apache.tajo.querymaster.Query:
> q_1443779192380_0001 Query Transitioned from QUERY_RUNNING to QUERY_ERROR
> 2015-10-02 18:55:03,958 INFO org.apache.tajo.querymaster.Query: Processing
> q_1443779192380_0001 of type DIAGNOSTIC_UPDATE
> 2015-10-02 18:55:03,958 INFO org.apache.tajo.querymaster.QueryMasterTask:
> Query completion notified from q_1443779192380_0001 final state: QUERY_ERROR
> 2015-10-02 18:55:03,960 INFO org.apache.tajo.querymaster.QueryMasterTask:
> Stopping QueryMasterTask:q_1443779192380_0001
> 2015-10-02 18:55:03,960 INFO org.apache.tajo.querymaster.QueryMasterTask:
> Cleanup resources of all workers. Query: q_1443779192380_0001, workers: 1
> 2015-10-02 18:55:03,962 INFO org.apache.tajo.querymaster.QueryMasterTask:
> Stopped QueryMasterTask:q_1443779192380_0001
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)