[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270711#comment-14270711 ] Tsuyoshi OZAWA commented on TEZ-1421: - Sure, wait a moment. > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi OZAWA >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1931) Publish tez version info to Timeline
[ https://issues.apache.org/jira/browse/TEZ-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270637#comment-14270637 ] Jonathan Eagles commented on TEZ-1931: -- This will be good to go into 0.6.0, [~hitesh] > Publish tez version info to Timeline > > > Key: TEZ-1931 > URL: https://issues.apache.org/jira/browse/TEZ-1931 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Critical > > We are not publishing any version info to Timeline. This will be useful to > compare different dags/apps over time and also to catch issues if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Fix Version/s: 0.7.0 > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.7.0 > > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, > TEZ-1923.4.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > {code} > In MergeManager, in memory merging is invoked under the following condition > {code} > if (!inMemoryMerg
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Attachment: TEZ-1923.4.patch Right; Incorporated it in the latest patch. Will check in to master asap. > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, > TEZ-1923.4.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > {code} > In MergeManager, in memory merging is invoked under
[jira] [Commented] (TEZ-1932) Add Prakash Ramachandran to team list
[ https://issues.apache.org/jira/browse/TEZ-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270516#comment-14270516 ] Rajesh Balamohan commented on TEZ-1932: --- +1 > Add Prakash Ramachandran to team list > - > > Key: TEZ-1932 > URL: https://issues.apache.org/jira/browse/TEZ-1932 > Project: Apache Tez > Issue Type: Bug >Reporter: Prakash Ramachandran >Assignee: Prakash Ramachandran >Priority: Minor > Attachments: TEZ-1932.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1932) Add Prakash Ramachandran to team list
[ https://issues.apache.org/jira/browse/TEZ-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-1932: -- Attachment: TEZ-1932.1.patch [~rajesh.balamohan] can you review > Add Prakash Ramachandran to team list > - > > Key: TEZ-1932 > URL: https://issues.apache.org/jira/browse/TEZ-1932 > Project: Apache Tez > Issue Type: Bug >Reporter: Prakash Ramachandran >Assignee: Prakash Ramachandran >Priority: Minor > Attachments: TEZ-1932.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1932) Add Prakash Ramachandran to team list
Prakash Ramachandran created TEZ-1932: - Summary: Add Prakash Ramachandran to team list Key: TEZ-1932 URL: https://issues.apache.org/jira/browse/TEZ-1932 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1931) Publish tez version info to Timeline
[ https://issues.apache.org/jira/browse/TEZ-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270392#comment-14270392 ] Hitesh Shah commented on TEZ-1931: -- [~jeagles] Should have a patch soon. Are you ok with putting this into 0.6.0 ? > Publish tez version info to Timeline > > > Key: TEZ-1931 > URL: https://issues.apache.org/jira/browse/TEZ-1931 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Critical > > We are not publishing any version info to Timeline. This will be useful to > compare different dags/apps over time and also to catch issues if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1931) Publish tez version info to Timeline
Hitesh Shah created TEZ-1931: Summary: Publish tez version info to Timeline Key: TEZ-1931 URL: https://issues.apache.org/jira/browse/TEZ-1931 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Critical We are not publishing any version info to Timeline. This will be useful to compare different dags/apps over time and also to catch issues if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270281#comment-14270281 ] Siddharth Seth commented on TEZ-1923: - +1. Looks good. Thanks [~rajesh.balamohan] Minor {code}+ LOG.info("Starting inMemoryMerger's merge since commitMemory=" + + commitMemory + " > mergeThreshold=" + mergeThreshold + + ". Current usedMemory=" + usedMemory); {code} This log line can be misleading since startMemToDiskMerge may not start the merge if another is already running. Should be after the condition in startMemToDiskMerge. > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=
[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1274: -- Fix Version/s: 0.7.0 > Remove Key/Value type checks in IFile > - > > Key: TEZ-1274 > URL: https://issues.apache.org/jira/browse/TEZ-1274 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Rajesh Balamohan > Fix For: 0.7.0 > > Attachments: TEZ-1274.1.patch > > > We check key and value types for each record - this should be removed from > the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Attachment: TEZ-1923.3.patch Addressing review comments. > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > {code} > In MergeManager, in memory merging is invoked under the following condition > {code} > if (!inMemoryMerger.isInProgre
[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270181#comment-14270181 ] Jonathan Eagles commented on TEZ-1421: -- [~ozawa], can you verify this still exists? I have tried to reproduce this using the setup you described but am unable get find this NPE. If you are still able to reproduce, please specify the steps needed for setup. > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi OZAWA >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1904: Attachment: TEZ-1904.3.txt Updated to fix some more inconsistent_sync warnings. > Fix findbugs warnings in tez-runtime-library > > > Key: TEZ-1904 > URL: https://issues.apache.org/jira/browse/TEZ-1904 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Siddharth Seth > Attachments: TEZ-1904.1.txt, TEZ-1904.2.txt, TEZ-1904.3.txt > > > https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat
[ https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269943#comment-14269943 ] Siddharth Seth commented on TEZ-1929: - [~rajesh.balamohan] - do you have the AM logs as well ? This could be a result of pre-emption - either because the wrong task is running, or because YARN decided the application is over it's resource limit. > AM intermittently sending kill signal to running task in heartbeat > -- > > Key: TEZ-1929 > URL: https://issues.apache.org/jira/browse/TEZ-1929 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan > Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot > 2015-01-08 at 2.28.04 PM.png, tasklog.txt > > > Observed this behavior 3 or 4 times > - Ran a hive query with tez (query_17 at 10 TB scale) > - Occasionally, Map_7 task will get into failed state in the middle of > fetching data from other sources (only one task is available in Map_7). > {code} > 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: > Completed fetch for attempt: InputAttemptIdentifier > [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, > pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, > CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, > TimeTaken=5, Rate=1.29 MB/s > 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All > inputs fetched for input vertex : Map 6 > 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 > of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) > 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: > Shutting down FetchScheduler, Was Interrupted: false > 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: > Scheduler thread completed > 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: > Received should die response from AM > 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked > to die via task heartbeat > 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while > waiting for task to complete. Interrupting task > 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... > returning > 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie > notification via hearbeats. Shutting down > 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) > at > org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) > at > org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83) > at > org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) > {code} > From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat > is unable to identify the containerId from registeredContainers. Need to > verify this. > I will attach the sample task log and the tez-ui details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1930) Merger (MemToDiskMerge) can exceed memory limits in a corner case
Siddharth Seth created TEZ-1930: --- Summary: Merger (MemToDiskMerge) can exceed memory limits in a corner case Key: TEZ-1930 URL: https://issues.apache.org/jira/browse/TEZ-1930 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Merger.reserve allows one segment to go over the allocated memory. If the segment size is large, this can be problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269939#comment-14269939 ] Siddharth Seth commented on TEZ-1923: - {code} + if (usedMemory > memoryLimit) { +LOG.info("Starting inMemoryMerger's merge since usedMemory=" + +memoryLimit + " > memoryLimit=" + memoryLimit + +". commitMemory=" + commitMemory + ", mergeThreshold=" + mergeThreshold); +startMemToDiskMerge(); + } {code} This will, at best, attempt to start the memToDiskMerger - there's no guarantee that it'll actually run since one may already be in progress. It ends up not waiting for the MemToMemMerger to complete - which would free up some memory - and potentially trigger another merge based on thresholds. The usedMemory at this point will be determined by a race between the current thread and the memtomemmerge thread (whether the unconditional reserve has been done yet or not). Meanwhile, Fetchers block in any case - since memory isn't available. I think it's better to leave this section of the patch out - to be fixed in the MemToMem merger jiras. {code} + if ((usedMemory + mergeOutputSize) > memoryLimit) { +LOG.info("Not enough memory to carry out mem-to-mem merging. usedMemory=" + usedMemory + +" > memoryLimit=" + memoryLimit); +return; + } {code} usedMemory may not be visible correctly - since it isn't inside the main MergeManager lock. This could also be part of the MemToMemMerger fixes. {code}merger.waitForShuffleToMergeMemory();{code} Would this be a problem in terms of connection timeouts - since this wait is while the connection is established. IThis could be in the run() method similar to merger.waitForInMemoryMerge() instead. > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not
[jira] [Commented] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269872#comment-14269872 ] Siddharth Seth commented on TEZ-1274: - +1. Looks good. Minor, please remove the unused constants - WRONG_KEY_CLASS etc before commit. > Remove Key/Value type checks in IFile > - > > Key: TEZ-1274 > URL: https://issues.apache.org/jira/browse/TEZ-1274 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Rajesh Balamohan > Attachments: TEZ-1274.1.patch > > > We check key and value types for each record - this should be removed from > the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1904: Attachment: TEZ-1904.2.txt Thanks for taking a look. Updated patch with comments addressed. > Fix findbugs warnings in tez-runtime-library > > > Key: TEZ-1904 > URL: https://issues.apache.org/jira/browse/TEZ-1904 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Siddharth Seth > Attachments: TEZ-1904.1.txt, TEZ-1904.2.txt > > > https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1904) Fix findbugs warnings in tez-runtime-library
[ https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269485#comment-14269485 ] Rajesh Balamohan commented on TEZ-1904: --- Minor comments: - In pipelinedSorter, comparator isn't used in SpanMerger's constructor. Should we remove it? - In SecureShuffleUtils, toHex() is not used anywhere. Should we remove it, if relevant? - IFileInputStream.getChecksum() is not used anywhere. Should we remove it, if relevant? > Fix findbugs warnings in tez-runtime-library > > > Key: TEZ-1904 > URL: https://issues.apache.org/jira/browse/TEZ-1904 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Siddharth Seth > Attachments: TEZ-1904.1.txt > > > https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile
[ https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1274: -- Attachment: TEZ-1274.1.patch [~sseth] Please review when you find time. Removed unwanted checks. It is left to the caller to ensure that proper type checks and length checks are done (since its all in tez code, it should be fine removing these checks). We also don't want the length checks (if negative key/value lengths are passed, outputstream would anyways throw IndexOutOfBoundsException). > Remove Key/Value type checks in IFile > - > > Key: TEZ-1274 > URL: https://issues.apache.org/jira/browse/TEZ-1274 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Rajesh Balamohan > Attachments: TEZ-1274.1.patch > > > We check key and value types for each record - this should be removed from > the tight loop. Maybe an assertion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat
[ https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1929: -- Attachment: tasklog.txt Screen Shot 2015-01-08 at 2.09.11 PM.png Screen Shot 2015-01-08 at 2.28.04 PM.png > AM intermittently sending kill signal to running task in heartbeat > -- > > Key: TEZ-1929 > URL: https://issues.apache.org/jira/browse/TEZ-1929 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan > Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot > 2015-01-08 at 2.28.04 PM.png, tasklog.txt > > > Observed this behavior 3 or 4 times > - Ran a hive query with tez (query_17 at 10 TB scale) > - Occasionally, Map_7 task will get into failed state in the middle of > fetching data from other sources (only one task is available in Map_7). > {code} > 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: > Completed fetch for attempt: InputAttemptIdentifier > [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, > pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, > CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, > TimeTaken=5, Rate=1.29 MB/s > 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All > inputs fetched for input vertex : Map 6 > 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 > of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) > 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: > Shutting down FetchScheduler, Was Interrupted: false > 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: > Scheduler thread completed > 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: > Received should die response from AM > 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked > to die via task heartbeat > 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while > waiting for task to complete. Interrupting task > 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... > returning > 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie > notification via hearbeats. Shutting down > 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) > at > org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) > at > org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83) > at > org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) > {code} > From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat > is unable to identify the containerId from registeredContainers. Need to > verify this. > I will attach the sample task log and the tez-ui details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat
Rajesh Balamohan created TEZ-1929: - Summary: AM intermittently sending kill signal to running task in heartbeat Key: TEZ-1929 URL: https://issues.apache.org/jira/browse/TEZ-1929 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Observed this behavior 3 or 4 times - Ran a hive query with tez (query_17 at 10 TB scale) - Occasionally, Map_7 task will get into failed state in the middle of fetching data from other sources (only one task is available in Map_7). {code} 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: Completed fetch for attempt: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, TimeTaken=5, Rate=1.29 MB/s 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All inputs fetched for input vertex : Map 6 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Shutting down FetchScheduler, Was Interrupted: false 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: Scheduler thread completed 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Received should die response from AM 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked to die via task heartbeat 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while waiting for task to complete. Interrupting task 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) {code} >From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat >is unable to identify the containerId from registeredContainers. Need to >verify this. I will attach the sample task log and the tez-ui details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1573) Exception from InputInitializer and VertexManagerPlugin is not propogated to client
[ https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang resolved TEZ-1573. - Resolution: Won't Fix Fixed in other jiras. > Exception from InputInitializer and VertexManagerPlugin is not propogated to > client > --- > > Key: TEZ-1573 > URL: https://issues.apache.org/jira/browse/TEZ-1573 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1573) Exception from InputInitializer and VertexManagerPlugin is not propogated to client
[ https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268966#comment-14268966 ] Jeff Zhang commented on TEZ-1573: - It has been resolved in TEZ-1703 & TEZ-1267. > Exception from InputInitializer and VertexManagerPlugin is not propogated to > client > --- > > Key: TEZ-1573 > URL: https://issues.apache.org/jira/browse/TEZ-1573 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jeff Zhang >Assignee: Jeff Zhang > -- This message was sent by Atlassian JIRA (v6.3.4#6332)