[jira] [Updated] (FLINK-32319) flink can't the partition of network after restart

wgcn (Jira) Mon, 12 Jun 2023 16:22:07 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wgcn updated FLINK-32319:
-------------------------
    Environment: 
centos 7.
jdk 8.
flink1.17.1 application mode on yarn 

flink configuration :

```

$internal.application.program-args      sql2
$internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf
$internal.yarn.log-config-file  
/data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
akka.ask.timeout        100s
blob.server.port        15402
classloader.check-leaked-classloader    false
classloader.resolve-order       parent-first
env.java.opts.taskmanager       -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
execution.attached      true
execution.checkpointing.aligned-checkpoint-timeout      10 min
execution.checkpointing.externalized-checkpoint-retention       
RETAIN_ON_CANCELLATION
execution.checkpointing.interval        10 min
execution.checkpointing.min-pause       10 min
execution.savepoint-restore-mode        NO_CLAIM
execution.savepoint.ignore-unclaimed-state      false
execution.shutdown-on-attached-exit     false
execution.target        embedded
high-availability       zookeeper
high-availability.cluster-id    application_1684133071014_7202676
high-availability.storageDir    hdfs://xxxx/user/flink/recovery
high-availability.zookeeper.path.root   /flink
high-availability.zookeeper.quorum      xxxxx
internal.cluster.execution-mode NORMAL
internal.io.tmpdirs.use-local-default   true
io.tmp.dirs     
/data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
jobmanager.execution.failover-strategy  region
jobmanager.memory.heap.size     9261023232b
jobmanager.memory.jvm-metaspace.size    268435456b
jobmanager.memory.jvm-overhead.max      1073741824b
jobmanager.memory.jvm-overhead.min      1073741824b
jobmanager.memory.off-heap.size 134217728b
jobmanager.memory.process.size  10240m
jobmanager.rpc.address  xxxx
jobmanager.rpc.port     31332
metrics.reporter.promgateway.deleteOnShutdown   true
metrics.reporter.promgateway.factory.class      
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
metrics.reporter.promgateway.hostUrl    xxxx:9091
metrics.reporter.promgateway.interval   60 SECONDS
metrics.reporter.promgateway.jobName    join_phase3_v7
metrics.reporter.promgateway.randomJobNameSuffix        false
parallelism.default     128
pipeline.classpaths     
pipeline.jars   
file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar
rest.address    xxxx
rest.bind-address       xxxxx
rest.bind-port  50000-50500
rest.flamegraph.enabled true
restart-strategy.failure-rate.delay     10 s
restart-strategy.failure-rate.failure-rate-interval     1 min
restart-strategy.failure-rate.max-failures-per-interval 6
restart-strategy.type   exponential-delay
state.backend.type      filesystem
state.checkpoints.dir   hdfs://xxxxxx/user/flink/checkpoints-data/wgcn
state.checkpoints.num-retained  3
taskmanager.memory.managed.fraction     0
taskmanager.memory.network.max  600mb
taskmanager.memory.process.size 10240m
taskmanager.memory.segment-size 128kb
taskmanager.network.memory.buffers-per-channel  8
taskmanager.network.memory.floating-buffers-per-gate    800
taskmanager.numberOfTaskSlots   2
web.port        0
web.tmpdir      /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534
yarn.application-attempt-failures-validity-interval     60000
yarn.application-attempts       3
yarn.application.name   join_phase3_v7
yarn.heartbeat.container-request-interval       700
```

  was:
centos 7.
jdk 8.
flink1.17.1 application mode on yarn 


> flink can't the partition of network after restart
> --------------------------------------------------
>
>                 Key: FLINK-32319
>                 URL: https://issues.apache.org/jira/browse/FLINK-32319
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.17.1
>         Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-args    sql2
> $internal.deployment.config-dir       /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file        
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout      100s
> blob.server.port      15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order     parent-first
> env.java.opts.taskmanager     -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attached    true
> execution.checkpointing.aligned-checkpoint-timeout    10 min
> execution.checkpointing.externalized-checkpoint-retention     
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval      10 min
> execution.checkpointing.min-pause     10 min
> execution.savepoint-restore-mode      NO_CLAIM
> execution.savepoint.ignore-unclaimed-state    false
> execution.shutdown-on-attached-exit   false
> execution.target      embedded
> high-availability     zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs://xxxx/user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorum    xxxxx
> internal.cluster.execution-mode       NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategy        region
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max    1073741824b
> jobmanager.memory.jvm-overhead.min    1073741824b
> jobmanager.memory.off-heap.size       134217728b
> jobmanager.memory.process.size        10240m
> jobmanager.rpc.address        xxxx
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class    
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  xxxx:9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix      false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar
> rest.address  xxxx
> rest.bind-address     xxxxx
> rest.bind-port        50000-50500
> rest.flamegraph.enabled       true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval       6
> restart-strategy.type exponential-delay
> state.backend.type    filesystem
> state.checkpoints.dir hdfs://xxxxxx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained        3
> taskmanager.memory.managed.fraction   0
> taskmanager.memory.network.max        600mb
> taskmanager.memory.process.size       10240m
> taskmanager.memory.segment-size       128kb
> taskmanager.network.memory.buffers-per-channel        8
> taskmanager.network.memory.floating-buffers-per-gate  800
> taskmanager.numberOfTaskSlots 2
> web.port      0
> web.tmpdir    /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534
> yarn.application-attempt-failures-validity-interval   60000
> yarn.application-attempts     3
> yarn.application.name join_phase3_v7
> yarn.heartbeat.container-request-interval     700
> ```
>            Reporter: wgcn
>            Priority: Major
>         Attachments: image-2023-06-13-07-14-48-958.png
>
>
> flink can't the partition of network after restart, lead that job can not 
> restoring
>  !image-2023-06-13-07-14-48-958.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-32319) flink can't the partition of network after restart

Reply via email to