[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

wgcn (Jira) Mon, 12 Jun 2023 19:33:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731827#comment-17731827
 ]


wgcn commented on FLINK-32319:
------------------------------

hi~ [~Wencong Liu]  thanks for your response,  I will try the config, I have a 
question. I just roughly looked at the meaning of this config. Why does this 
config need to be set so large? Is it related to parallel reading?  We are 
using version 1.12 of Flink in our production environment, and I have never 
paid attention to this config before. Is this because some mechanisms were 
added after version 1.12 

> flink can't the partition of network after restart
> --------------------------------------------------
>
>                 Key: FLINK-32319
>                 URL: https://issues.apache.org/jira/browse/FLINK-32319
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.17.1
>         Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-args    sql2
> $internal.deployment.config-dir       /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file        
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout      100s
> blob.server.port      15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order     parent-first
> env.java.opts.taskmanager     -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attached    true
> execution.checkpointing.aligned-checkpoint-timeout    10 min
> execution.checkpointing.externalized-checkpoint-retention     
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval      10 min
> execution.checkpointing.min-pause     10 min
> execution.savepoint-restore-mode      NO_CLAIM
> execution.savepoint.ignore-unclaimed-state    false
> execution.shutdown-on-attached-exit   false
> execution.target      embedded
> high-availability     zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs://xxxx/user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorum    xxxxx
> internal.cluster.execution-mode       NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategy        region
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max    1073741824b
> jobmanager.memory.jvm-overhead.min    1073741824b
> jobmanager.memory.off-heap.size       134217728b
> jobmanager.memory.process.size        10240m
> jobmanager.rpc.address        xxxx
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class    
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  xxxx:9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix      false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar
> rest.address  xxxx
> rest.bind-address     xxxxx
> rest.bind-port        50000-50500
> rest.flamegraph.enabled       true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval       6
> restart-strategy.type exponential-delay
> state.backend.type    filesystem
> state.checkpoints.dir hdfs://xxxxxx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained        3
> taskmanager.memory.managed.fraction   0
> taskmanager.memory.network.max        600mb
> taskmanager.memory.process.size       10240m
> taskmanager.memory.segment-size       128kb
> taskmanager.network.memory.buffers-per-channel        8
> taskmanager.network.memory.floating-buffers-per-gate  800
> taskmanager.numberOfTaskSlots 2
> web.port      0
> web.tmpdir    /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534
> yarn.application-attempt-failures-validity-interval   60000
> yarn.application-attempts     3
> yarn.application.name join_phase3_v7
> yarn.heartbeat.container-request-interval     700
> ```
>            Reporter: wgcn
>            Priority: Major
>         Attachments: image-2023-06-13-07-14-48-958.png
>
>
> flink can't the partition of network after restart, lead that job can not 
> restoring
>  !image-2023-06-13-07-14-48-958.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

Reply via email to