[
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732799#comment-17732799
]
wgcn edited comment on FLINK-32319 at 6/14/23 11:26 PM:
--------------------------------------------------------
Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max from
10000 to 20000 and 30000, this problem keeps occurring,Is this related to the
config "-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network
buffer required by the task manager is not very large, so I decreased it.
was (Author: 1026688210):
Hi~[~Wencong Liu],I Increased taskmanager.network.request-backoff.max to 20000
and 30000, this problem keeps occurring,Is this related to the config
"-Dtaskmanager.memory.network.max=600 MB"? I calculated that the network buffer
required by the task manager is not very large, so I decreased it.
> flink can't the partition of network after restart
> --------------------------------------------------
>
> Key: FLINK-32319
> URL: https://issues.apache.org/jira/browse/FLINK-32319
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.17.1
> Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn
> flink configuration :
> ```
> $internal.application.program-args sql2
> $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout 100s
> blob.server.port 15402
> classloader.check-leaked-classloader false
> classloader.resolve-order parent-first
> env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000
> execution.attached true
> execution.checkpointing.aligned-checkpoint-timeout 10 min
> execution.checkpointing.externalized-checkpoint-retention
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval 10 min
> execution.checkpointing.min-pause 10 min
> execution.savepoint-restore-mode NO_CLAIM
> execution.savepoint.ignore-unclaimed-state false
> execution.shutdown-on-attached-exit false
> execution.target embedded
> high-availability zookeeper
> high-availability.cluster-id application_1684133071014_7202676
> high-availability.storageDir hdfs://xxxx/user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorum xxxxx
> internal.cluster.execution-mode NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategy region
> jobmanager.memory.heap.size 9261023232b
> jobmanager.memory.jvm-metaspace.size 268435456b
> jobmanager.memory.jvm-overhead.max 1073741824b
> jobmanager.memory.jvm-overhead.min 1073741824b
> jobmanager.memory.off-heap.size 134217728b
> jobmanager.memory.process.size 10240m
> jobmanager.rpc.address xxxx
> jobmanager.rpc.port 31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl xxxx:9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix false
> parallelism.default 128
> pipeline.classpaths
> pipeline.jars
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar
> rest.address xxxx
> rest.bind-address xxxxx
> rest.bind-port 50000-50500
> rest.flamegraph.enabled true
> restart-strategy.failure-rate.delay 10 s
> restart-strategy.failure-rate.failure-rate-interval 1 min
> restart-strategy.failure-rate.max-failures-per-interval 6
> restart-strategy.type exponential-delay
> state.backend.type filesystem
> state.checkpoints.dir hdfs://xxxxxx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained 3
> taskmanager.memory.managed.fraction 0
> taskmanager.memory.network.max 600mb
> taskmanager.memory.process.size 10240m
> taskmanager.memory.segment-size 128kb
> taskmanager.network.memory.buffers-per-channel 8
> taskmanager.network.memory.floating-buffers-per-gate 800
> taskmanager.numberOfTaskSlots 2
> web.port 0
> web.tmpdir /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534
> yarn.application-attempt-failures-validity-interval 60000
> yarn.application-attempts 3
> yarn.application.name join_phase3_v7
> yarn.heartbeat.container-request-interval 700
> ```
> Reporter: wgcn
> Priority: Major
> Attachments: image-2023-06-13-07-14-48-958.png
>
>
> flink can't the partition of network after restart, lead that job can not
> restoring
> !image-2023-06-13-07-14-48-958.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)