[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486423#comment-14486423
 ] 

Chris K Wensel commented on TEZ-2237:
-------------------------------------

don't follow, from the logs, cascading logs the start of the outputs. then logs 
the starts of the inputs.

{noformat}
vertex name: E7274DEFA42F4EFD9D950071DD0A9021

2015-03-31 17:36:07,244 INFO [TezChild] tez.FlowProcessor: flow node ordinal: 
62, tez vertex name: E7274DEFA42F4EFD9D950071DD0A9021
2015-03-31 17:36:07,457 INFO [TezChild] tez.FlowProcessor: sourcing from: 
GroupBy(_pipe_346+_pipe_347)[by:[{1}:'key']] streamed: true
2015-03-31 17:36:07,457 INFO [TezChild] tez.FlowProcessor: sourcing from: 
Boundary(79A2D805A47147D087972D1BBB528A97) streamed: false
2015-03-31 17:36:07,457 INFO [TezChild] tez.FlowProcessor: sinking to: 
CoGroup(_pipe_351*_pipe_352)[by: _pipe_351:[{1}:'key0'] _pipe_352:[{1}:'key1']]
2015-03-31 17:36:07,457 INFO [TezChild] element.TezGroupGate: calling 
OrderedPartitionedKVOutput#start() on: CoGroup(_pipe_351*_pipe_352)[by: 
_pipe_351:[{1}:'key0'] _pipe_352:[{1}:'key1']] A111A3C4861D44BE93027D0C30A764DC

2015-03-31 19:02:31,751 INFO [TezChild] element.TezBoundaryStage: calling 
UnorderedKVInput#start() on: Boundary(79A2D805A47147D087972D1BBB528A97) 
79A2D805A47147D087972D1BBB528A97
2015-03-31 19:02:31,752 INFO [TezChild] element.TezGroupGate: calling 
OrderedGroupedKVInput#start() on: GroupBy(_pipe_346+_pipe_347)[by:[{1}:'key']] 
7502F02C33714606AB4B03B7614469D0, for 1 inputs
2015-03-31 19:02:31,752 INFO [TezChild] input.OrderedGroupedKVInput: 
OrderedGroupedKVInput#start(): OrderedGroupedKVInput was already started, not 
starting again!

vertex name: DAD5EEB9962B4A65B1394B186692F7BB

2015-04-01 02:49:09,880 INFO [TezChild] tez.FlowProcessor: flow node ordinal: 
78, tez vertex name: DAD5EEB9962B4A65B1394B186692F7BB
2015-04-01 02:49:10,119 INFO [TezChild] tez.FlowProcessor: sourcing from: 
GroupBy(_pipe_332+_pipe_333)[by:[{1}:'key']] streamed: true
2015-04-01 02:49:10,119 INFO [TezChild] tez.FlowProcessor: sourcing from: 
Boundary(ECCC5DB0C5C04B2EBED0FC3187C8487A) streamed: false
2015-04-01 02:49:10,119 INFO [TezChild] tez.FlowProcessor: sinking to: 
CoGroup(_pipe_336*_pipe_339)[by: _pipe_336:[{1}:'key0'] _pipe_339:[{1}:'key1']]
2015-04-01 02:49:10,119 INFO [TezChild] element.TezGroupGate: calling 
OrderedPartitionedKVOutput#start() on: CoGroup(_pipe_336*_pipe_339)[by: 
_pipe_336:[{1}:'key0'] _pipe_339:[{1}:'key1']] F7789B9D978D44B9803AF2E09AE72663

2015-04-01 02:49:10,171 INFO [TezChild] element.TezBoundaryStage: calling 
UnorderedKVInput#start() on: Boundary(ECCC5DB0C5C04B2EBED0FC3187C8487A) 
ECCC5DB0C5C04B2EBED0FC3187C8487A
2015-04-01 02:49:10,171 INFO [TezChild] element.TezGroupGate: calling 
OrderedGroupedKVInput#start() on: GroupBy(_pipe_332+_pipe_333)[by:[{1}:'key']] 
DEF94DA9BECF4A5BA6C85388B1EAAD41, for 1 inputs
2015-04-01 02:49:10,171 INFO [TezChild] input.OrderedGroupedKVInput: 
OrderedGroupedKVInput#start(): OrderedGroupedKVInput was already started, not 
starting again!
{noformat}


> Complex DAG freezes and fails (was BufferTooSmallException raised in 
> UnorderedPartitionedKVWriter then DAG lingers)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2237
>                 URL: https://issues.apache.org/jira/browse/TEZ-2237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>            Reporter: Cyrille Chépélov
>         Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, 
> TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png, 
> alloc_vcores.png, application_1427324000018_1444.yarn-logs.red.txt.gz, 
> application_1427324000018_1908.red.txt.bz2, 
> application_1427964335235_2070.txt.red.txt.bz2, 
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, 
> gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt, 
> ordered-grouped-kv-input-traces.diff, start_containers.png, 
> stop_containers.png, 
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to