[ 
https://issues.apache.org/jira/browse/FLINK-15047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987859#comment-16987859
 ] 

Xintong Song commented on FLINK-15047:
--------------------------------------

[~tison], [~hpeter], [~gjy]
I've also looked into this failure case, and have already find the cause.
A solution is already included in the linked PR, and the explanation of the 
cause is as follows.

The problem is caused together by FLINK-13983 and FLINK-13184. The test case 
[~hpeter] introduced in FLINK-14033 just triggers the PR. It seems that 
FLINK-14033 and FLINK-13984 passed travis and get merged separately, but when 
both changes are merged into the master branch, the test failure is triggered.

When I try to debug this case, I find that this case does not output 
'jobmanager.log' and 'taskmanager.log', because it does not specify 
{{YarnConfigOptionsInternal#APPLICATION_LOG_CONFIG_FILE}}. So I specified this 
configuration in the case, and surprisingly find that the case is fixed.

After a few tries, I find that the test case fails when {{configuration}} 
contains less than 2 key-value pairs, and success when there's more than 2. 
(This is later proved not always true, but it helped at this time.) So I 
removed {{AkkaOptions.ASK_TIMEOUT}} from the {{configuration}}, keeps only the 
one for log4j config file. In this way I finally get log files of a failure 
case.

Looking into the log files, I find an exception in 'taskmanager.log' saying 
'jobmanager.rpc.address' is null, which caused the task executor fail. 
Currently, we are passing task executor specific configurations through dynamic 
properties in the starting command (see FLINK-13184). So I looked in to the 
starting command of the task executors.

This is the starting command in a failure case:
{code:java}/bin/bash -c 
/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin/java 
-Xmx268435450 -Xms268435450 -XX:MaxDirectMemorySize=214748366 
-XX:MaxMetaspaceSize=134217728 
-Dlog.file=/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.log
 -Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskExecutorRunner -D 
taskmanager.memory.shuffle.max=80530638b -D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=322122552b -D 
taskmanager.memory.task.heap.size=134217722b -D 
taskmanager.memory.task.off-heap.size=0b -D 
taskmanager.memory.managed.off-heap.size=322122552b -D 
taskmanager.memory.shuffle.min=80530638b --configDir . -Dweb.port=0 
-Dtaskmanager.memory.managed.size=322122552 bytes 
-Djobmanager.rpc.address=30.25.94.103 -Drest.address=30.25.94.103 
-Dweb.tmpdir=/var/folders/nk/cj8wv8r97rn7w7dhwqzghpr40000gn/T/flink-web-ceab8aec-cf07-4c83-9cfc-9599d39a5980
 -Djobmanager.rpc.port=53359 1> 
/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.out
 2> 
/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.err{code}

And this is the starting command in a success case:
{code:java}
/bin/bash -c 
/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin/java 
-Xmx268435450 -Xms268435450 -XX:MaxDirectMemorySize=214748366 
-XX:MaxMetaspaceSize=134217728 
-Dlog.file=/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.log
 -Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskExecutorRunner -D 
taskmanager.memory.shuffle.max=80530638b -D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=322122552b -D 
taskmanager.memory.task.heap.size=134217722b -D 
taskmanager.memory.task.off-heap.size=0b -D 
taskmanager.memory.managed.off-heap.size=322122552b -D 
taskmanager.memory.shuffle.min=80530638b --configDir . -Dweb.port=0 
-Dtaskmanager.memory.managed.size=322122552 bytes 
-Djobmanager.rpc.address=30.25.94.103 -Drest.address=30.25.94.103 
-Dweb.tmpdir=/var/folders/nk/cj8wv8r97rn7w7dhwqzghpr40000gn/T/flink-web-ceab8aec-cf07-4c83-9cfc-9599d39a5980
 -Djobmanager.rpc.port=53359 1> 
/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.out
 2> 
/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.err
{code}

You will find "-Dtaskmanager.memory.managed.size=322122552 bytes" in both 
commands. The space between the number and "bytes" prevents parsing of the 
subsequent dynamic properties. In the failure case, the space comes before 
"jobmanager.rpc.address", so the address is not parsed. In the success case, 
the space comes after the address, so the address is not affected.

The dynamic properties are generated by {{Utils#getDynamicProperties}}, where 
{{streram().flatMap().toArray()}} is used. So the order of the properties 
depends on the internal implementation of java stream, probably related to 
number of configurations.

The cause of the space in this case is that, in FLINK-13983 we use 
{{tmResourceSpec.getManagedMemorySize().toString()}} in 
{{ActiveResourceManagerFactory#createActiveResourceManagerConfiguration}} to 
explicitly set managed memory size into the {{configuration}}. 
{{MemorySize#toString}} generates strings with spaces. I've verified that 
changing it to {{tmResourceSpec.getManagedMemorySize().getBytes + "b"}} can fix 
the problem.

I already included the fix in our PR, so the test case should be fixed soon. 

And according to the findings during debugging this case, I would suggest two 
follow-ups.
- {{YarnDistributedCacheITCase}} should generate jobmanager / taskmanager logs. 
{{YarnTestUtils.createClusterDescriptorWithLogging}} may be used here.
- Flink allows config options to have values that contain spaces. E.g., the 
default value of {{AkkaOption#ASK_TIMEOUT}} is "10 s". To prevent such problem 
in future, we should also allow spaces when dynamic properties. E.g., surround 
the values with double quotation marks, or escaping special characters.

What do you think?

> YarnDistributedCacheITCase is unstable
> --------------------------------------
>
>                 Key: FLINK-15047
>                 URL: https://issues.apache.org/jira/browse/FLINK-15047
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.10.0
>            Reporter: Zili Chen
>            Assignee: Xintong Song
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.10.0
>
>
> See also https://api.travis-ci.com/v3/job/262854881/log.txt
> cc [~ZhenqiuHuang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to