[ https://issues.apache.org/jira/browse/FLINK-15047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987859#comment-16987859 ]
Xintong Song commented on FLINK-15047: -------------------------------------- [~tison], [~hpeter], [~gjy] I've also looked into this failure case, and have already find the cause. A solution is already included in the linked PR, and the explanation of the cause is as follows. The problem is caused together by FLINK-13983 and FLINK-13184. The test case [~hpeter] introduced in FLINK-14033 just triggers the PR. It seems that FLINK-14033 and FLINK-13984 passed travis and get merged separately, but when both changes are merged into the master branch, the test failure is triggered. When I try to debug this case, I find that this case does not output 'jobmanager.log' and 'taskmanager.log', because it does not specify {{YarnConfigOptionsInternal#APPLICATION_LOG_CONFIG_FILE}}. So I specified this configuration in the case, and surprisingly find that the case is fixed. After a few tries, I find that the test case fails when {{configuration}} contains less than 2 key-value pairs, and success when there's more than 2. (This is later proved not always true, but it helped at this time.) So I removed {{AkkaOptions.ASK_TIMEOUT}} from the {{configuration}}, keeps only the one for log4j config file. In this way I finally get log files of a failure case. Looking into the log files, I find an exception in 'taskmanager.log' saying 'jobmanager.rpc.address' is null, which caused the task executor fail. Currently, we are passing task executor specific configurations through dynamic properties in the starting command (see FLINK-13184). So I looked in to the starting command of the task executors. This is the starting command in a failure case: {code:java}/bin/bash -c /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin/java -Xmx268435450 -Xms268435450 -XX:MaxDirectMemorySize=214748366 -XX:MaxMetaspaceSize=134217728 -Dlog.file=/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.shuffle.max=80530638b -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=322122552b -D taskmanager.memory.task.heap.size=134217722b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.managed.off-heap.size=322122552b -D taskmanager.memory.shuffle.min=80530638b --configDir . -Dweb.port=0 -Dtaskmanager.memory.managed.size=322122552 bytes -Djobmanager.rpc.address=30.25.94.103 -Drest.address=30.25.94.103 -Dweb.tmpdir=/var/folders/nk/cj8wv8r97rn7w7dhwqzghpr40000gn/T/flink-web-ceab8aec-cf07-4c83-9cfc-9599d39a5980 -Djobmanager.rpc.port=53359 1> /Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.out 2> /Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.err{code} And this is the starting command in a success case: {code:java} /bin/bash -c /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin/java -Xmx268435450 -Xms268435450 -XX:MaxDirectMemorySize=214748366 -XX:MaxMetaspaceSize=134217728 -Dlog.file=/Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.shuffle.max=80530638b -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=322122552b -D taskmanager.memory.task.heap.size=134217722b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.managed.off-heap.size=322122552b -D taskmanager.memory.shuffle.min=80530638b --configDir . -Dweb.port=0 -Dtaskmanager.memory.managed.size=322122552 bytes -Djobmanager.rpc.address=30.25.94.103 -Drest.address=30.25.94.103 -Dweb.tmpdir=/var/folders/nk/cj8wv8r97rn7w7dhwqzghpr40000gn/T/flink-web-ceab8aec-cf07-4c83-9cfc-9599d39a5980 -Djobmanager.rpc.port=53359 1> /Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.out 2> /Users/xintongsong/workspace/xintongsong-flink/flink-yarn-tests/target/flink-yarn-tests-with-distributed-cache/flink-yarn-tests-with-distributed-cache-logDir-nm-1_0/application_1575456859645_0001/container_1575456859645_0001_01_000002/taskmanager.err {code} You will find "-Dtaskmanager.memory.managed.size=322122552 bytes" in both commands. The space between the number and "bytes" prevents parsing of the subsequent dynamic properties. In the failure case, the space comes before "jobmanager.rpc.address", so the address is not parsed. In the success case, the space comes after the address, so the address is not affected. The dynamic properties are generated by {{Utils#getDynamicProperties}}, where {{streram().flatMap().toArray()}} is used. So the order of the properties depends on the internal implementation of java stream, probably related to number of configurations. The cause of the space in this case is that, in FLINK-13983 we use {{tmResourceSpec.getManagedMemorySize().toString()}} in {{ActiveResourceManagerFactory#createActiveResourceManagerConfiguration}} to explicitly set managed memory size into the {{configuration}}. {{MemorySize#toString}} generates strings with spaces. I've verified that changing it to {{tmResourceSpec.getManagedMemorySize().getBytes + "b"}} can fix the problem. I already included the fix in our PR, so the test case should be fixed soon. And according to the findings during debugging this case, I would suggest two follow-ups. - {{YarnDistributedCacheITCase}} should generate jobmanager / taskmanager logs. {{YarnTestUtils.createClusterDescriptorWithLogging}} may be used here. - Flink allows config options to have values that contain spaces. E.g., the default value of {{AkkaOption#ASK_TIMEOUT}} is "10 s". To prevent such problem in future, we should also allow spaces when dynamic properties. E.g., surround the values with double quotation marks, or escaping special characters. What do you think? > YarnDistributedCacheITCase is unstable > -------------------------------------- > > Key: FLINK-15047 > URL: https://issues.apache.org/jira/browse/FLINK-15047 > Project: Flink > Issue Type: Bug > Components: Tests > Affects Versions: 1.10.0 > Reporter: Zili Chen > Assignee: Xintong Song > Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > See also https://api.travis-ci.com/v3/job/262854881/log.txt > cc [~ZhenqiuHuang] -- This message was sent by Atlassian Jira (v8.3.4#803005)