Sergey Shelukhin created TEZ-2587:
-------------------------------------
Summary: Tez should provide attemptId (or some other ways of
linking multiple threads for the same task)
Key: TEZ-2587
URL: https://issues.apache.org/jira/browse/TEZ-2587
Project: Apache Tez
Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth
There are at least 2 threads calling Hive code for every task; thread #1
{noformat}
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:303)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:131)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97)
at
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:152)
at
org.apache.tez.mapreduce.lib.MRReaderMapred.<init>(MRReaderMapred.java:73)
at
org.apache.tez.mapreduce.input.MultiMRInput.initFromEvent(MultiMRInput.java:177)
at
org.apache.tez.mapreduce.input.MultiMRInput.handleEvents(MultiMRInput.java:146)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:650)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:103)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:720)
at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
at java.lang.Thread.run(Thread.java:745)
{noformat}
Thread #2
{noformat}
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
Right now, there's no way for these threads to communicate with each other or
share data.
While processor caller has access to some context objects and stuff, the input
thread doesn't have access to anything.
Hive used globals to work around that, however this is both ugly, and no longer
works if multiple tasks run in the same process.
There should be some way for the threads to talk... either IO thread should
have access to ProcessorContext somehow, or maybe both should have attemptId
added to the supplied conf. Perhaps it's possible to add a global method to get
ProcessorContext by attemptId then, or if not we can arrange our own ugly
globals by attemptId.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)