[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490392#comment-14490392
 ] 

Siddharth Seth commented on TEZ-2237:
-------------------------------------

The patch generates the correct number of empty events when an Output is not 
started. That essentially allows for an Output to be configured, but if it 
isn't used - the Output will just indicate to the system that nothing was 
generated.

The specific DAG that was hanging will not run through without the patch.

If you look at 
https://issues.apache.org/jira/secure/attachment/12724128/oneOutOfTwoOutputsStarted.txt
 - it show two configured Outputs, out of which only one is being started.

Either Cascading does not need the second output, or it may be losing data by 
not sending anything on this Output. That's something Cascading may want to 
look into.

I'm going to leave this discussion at this point, and finalize this patch over 
the next week or so.

> Complex DAG freezes and fails (was BufferTooSmallException raised in 
> UnorderedPartitionedKVWriter then DAG lingers)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2237
>                 URL: https://issues.apache.org/jira/browse/TEZ-2237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>            Reporter: Cyrille Chépélov
>         Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, 
> TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png, 
> alloc_vcores.png, application_1427324000018_1444.yarn-logs.red.txt.gz, 
> application_1427324000018_1908.red.txt.bz2, 
> application_1427964335235_2070.txt.red.txt.bz2, 
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, 
> gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt, 
> oneOutOfTwoOutputsStarted.txt, ordered-grouped-kv-input-traces.diff, 
> output-starts.txt, start_containers.png, stop_containers.png, 
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to