[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731827#comment-17731827 ]
wgcn commented on FLINK-32319: ------------------------------ hi~ [~Wencong Liu] thanks for your response, I will try the config, I have a question. I just roughly looked at the meaning of this config. Why does this config need to be set so large? Is it related to parallel reading? We are using version 1.12 of Flink in our production environment, and I have never paid attention to this config before. Is this because some mechanisms were added after version 1.12 > flink can't the partition of network after restart > -------------------------------------------------- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn > flink configuration : > ``` > $internal.application.program-args sql2 > $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf > $internal.yarn.log-config-file > /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties > akka.ask.timeout 100s > blob.server.port 15402 > classloader.check-leaked-classloader false > classloader.resolve-order parent-first > env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 > execution.attached true > execution.checkpointing.aligned-checkpoint-timeout 10 min > execution.checkpointing.externalized-checkpoint-retention > RETAIN_ON_CANCELLATION > execution.checkpointing.interval 10 min > execution.checkpointing.min-pause 10 min > execution.savepoint-restore-mode NO_CLAIM > execution.savepoint.ignore-unclaimed-state false > execution.shutdown-on-attached-exit false > execution.target embedded > high-availability zookeeper > high-availability.cluster-id application_1684133071014_7202676 > high-availability.storageDir hdfs://xxxx/user/flink/recovery > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorum xxxxx > internal.cluster.execution-mode NORMAL > internal.io.tmpdirs.use-local-default true > io.tmp.dirs > /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 > jobmanager.execution.failover-strategy region > jobmanager.memory.heap.size 9261023232b > jobmanager.memory.jvm-metaspace.size 268435456b > jobmanager.memory.jvm-overhead.max 1073741824b > jobmanager.memory.jvm-overhead.min 1073741824b > jobmanager.memory.off-heap.size 134217728b > jobmanager.memory.process.size 10240m > jobmanager.rpc.address xxxx > jobmanager.rpc.port 31332 > metrics.reporter.promgateway.deleteOnShutdown true > metrics.reporter.promgateway.factory.class > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory > metrics.reporter.promgateway.hostUrl xxxx:9091 > metrics.reporter.promgateway.interval 60 SECONDS > metrics.reporter.promgateway.jobName join_phase3_v7 > metrics.reporter.promgateway.randomJobNameSuffix false > parallelism.default 128 > pipeline.classpaths > pipeline.jars > file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar > rest.address xxxx > rest.bind-address xxxxx > rest.bind-port 50000-50500 > rest.flamegraph.enabled true > restart-strategy.failure-rate.delay 10 s > restart-strategy.failure-rate.failure-rate-interval 1 min > restart-strategy.failure-rate.max-failures-per-interval 6 > restart-strategy.type exponential-delay > state.backend.type filesystem > state.checkpoints.dir hdfs://xxxxxx/user/flink/checkpoints-data/wgcn > state.checkpoints.num-retained 3 > taskmanager.memory.managed.fraction 0 > taskmanager.memory.network.max 600mb > taskmanager.memory.process.size 10240m > taskmanager.memory.segment-size 128kb > taskmanager.network.memory.buffers-per-channel 8 > taskmanager.network.memory.floating-buffers-per-gate 800 > taskmanager.numberOfTaskSlots 2 > web.port 0 > web.tmpdir /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534 > yarn.application-attempt-failures-validity-interval 60000 > yarn.application-attempts 3 > yarn.application.name join_phase3_v7 > yarn.heartbeat.container-request-interval 700 > ``` > Reporter: wgcn > Priority: Major > Attachments: image-2023-06-13-07-14-48-958.png > > > flink can't the partition of network after restart, lead that job can not > restoring > !image-2023-06-13-07-14-48-958.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)