[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382883#comment-14382883
 ] 

Hitesh Shah commented on TEZ-2237:
----------------------------------

Just looking at the 215 attempt log:

{code}
2015-03-25 16:59:07,931 INFO [TezChild] 
resources.WeightedScalingMemoryDistributor: Scaling Requests. NumRequests: 3, 
numScaledRequests: 14, TotalRequested: 1652555776, TotalRequestedScaled: 
7.260639817142856E8, TotalJVMHeap: 859832320, TotalAvailable: 563190169, 
TotalRequested/TotalJVMHeap:1,92
2015-03-25 16:59:07,931 INFO [TezChild] resources.MemoryDistributor: Informing: 
INPUT, A204F8787224475D824CFDE543AC7BB0, 
org.apache.tez.runtime.library.input.UnorderedKVInput: requested=773849088, 
allocated=42875422
2015-03-25 16:59:07,931 INFO [TezChild] resources.MemoryDistributor: Informing: 
INPUT, E8AFAB9CF046437E826DAC7422935041, 
org.apache.tez.runtime.library.input.OrderedGroupedKVInput: 
requested=773849088, allocated=514505068
2015-03-25 16:59:07,931 INFO [TezChild] resources.MemoryDistributor: Informing: 
OUTPUT, F25A57D1DABF413880E7980733748791, 
org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput: 
requested=104857600, allocated=5809677
{code}

The output buffer is being scaled down to ~5MB. \cc [~rajesh.balamohan] 
[~sseth] [~gopalv]

> BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG 
> lingers
> -------------------------------------------------------------------------------
>
>                 Key: TEZ-2237
>                 URL: https://issues.apache.org/jira/browse/TEZ-2237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>            Reporter: Cyrille Chépélov
>         Attachments: appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, 
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to