[
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482945#comment-14482945
]
Cyrille Chépélov edited comment on TEZ-2237 at 4/7/15 10:13 AM:
----------------------------------------------------------------
[~sseth] to reply to your question about "Attempting to close output" messages;
*yes* there have been some emitted within the successful run:
{code}
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:48,467 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:27,078 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
A5192F6D978D41F4BA9137A9284EE2CF before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:42,864 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
A5192F6D978D41F4BA9137A9284EE2CF before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:33,121 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:48,622 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:41,622 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
A5192F6D978D41F4BA9137A9284EE2CF before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:33,057 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
{code}
70 such messages in total, in one DAG out of the 20 in the cascade. Attaching
application_1427964335235_2518.red.txt.xz here. Same DAG as in the "0432" and
"1444" logs previously attached.
*edit*: can't attach here (even XZ-compressed, the file is > 10MiB). Please
find it here instead:
https://extranet.transparencyrights.com/owncloud/public.php?service=files&t=14e6ddee5313391d51d83037c6d26497
was (Author: cchepelov):
[~sseth] to reply to your question about "Attempting to close output" messages;
*yes* there have been some emitted within the successful run:
{code}
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:48,467 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:27,078 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
A5192F6D978D41F4BA9137A9284EE2CF before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:42,864 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
A5192F6D978D41F4BA9137A9284EE2CF before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:33,121 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:48,622 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:41,622 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
A5192F6D978D41F4BA9137A9284EE2CF before it was started. Generating empty events.
application_1427964335235_2518.red.txt.xz:2015-04-06 15:06:33,057 WARN
[TezChild] output.OrderedPartitionedKVOutput: Attempting to close output
AAE2905F058D4DF3A82183EA916CFF8D before it was started. Generating empty events.
{code}
70 such messages in total, in one DAG out of the 20 in the cascade. Attaching
application_1427964335235_2518.red.txt.xz here. Same DAG as in the "0432" and
"1444" logs previously attached.
> Complex DAG freezes and fails (was BufferTooSmallException raised in
> UnorderedPartitionedKVWriter then DAG lingers)
> -------------------------------------------------------------------------------------------------------------------
>
> Key: TEZ-2237
> URL: https://issues.apache.org/jira/browse/TEZ-2237
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system
> disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
> Reporter: Cyrille Chépélov
> Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt,
> TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png,
> alloc_vcores.png, application_1427324000018_1444.yarn-logs.red.txt.gz,
> application_1427324000018_1908.red.txt.bz2,
> application_1427964335235_2070.txt.red.txt.bz2,
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz,
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz,
> gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt,
> ordered-grouped-kv-input-traces.diff, start_containers.png,
> stop_containers.png,
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz,
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG),
> after about a hour of processing, several BufferTooSmallException are raised
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active",
> tying up memory and CPU resources as far as YARN is concerned, while little
> if any actual processing takes place.
> It seems two separate issues are at hand:
> 1. BufferTooSmallException are raised even though, small as the actually
> allocated buffers seem to be (around a couple megabytes were allotted whereas
> 100MiB were requested), the actual keys and values are never bigger than 24
> and 1024 bytes respectively.
> 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop
> (stop requests appear to be sent 7 hours after the BTSE exceptions are
> raised, but 9 hours after these stop requests, the DAG was still lingering on
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from
> validating the results compared to traditional MR1-based results. The lack of
> conclusion renders the cluster queue unavailable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)