[jira] [Updated] (FLINK-37378) Yarn log aggregation fails with Kerberos DT issues

slankka (Jira) Tue, 25 Feb 2025 04:53:07 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


slankka updated FLINK-37378:
----------------------------
    Description: 
Thanks to [~gaborgsomogyi] , he created FLINK-28608, we found it is helpful to 
solve log aggregation failure of long running flink on yarn applications. So I 
suggest that the configuation of token provider renewer should be documented.

It's difficult to prove, but still have a way to verify this.
{code:java}
dfs.namenode.delegation.key.update-interval 86400000 (1 day)  # change to 
180000 3min
dfs.namenode.delegation.token.max-lifetime 604800000 (7 days) # change to 
360000 5min
dfs.namenode.delegation.token.renew-interval 86400000 (1 day) # change to 
180000 3min {code}
 

normally after 7 days( by default), you will find Yarn log aggregation status 
is TIMEDOUT.

It's no matter what release of hadoop we are using. (Apache Hadoop 3.3.6 in 
fact.)

 

*How we found the problem?*

The log aggregation success log example (Flink-1.13.0):
{code:java}
token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/[email protected], 
renewer=yarn, realUser=, issueDate=1739273095368, maxDate=1739877895368{code}
The failed example (Flink-1.17.0):
{code:java}
token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/[email protected], 
renewer=, realUser=, issueDate=1739953940508, maxDate=1739954300508 {code}
 

*Solution we found*

If flink deploys on Yarn, this configuration is important to keep Yarn log 
aggregation works  after Flink job terminated（FAILED, FINSHED,KiLLED) since 
started for 7 days.

it's not configured by default. If flink runs for 7 days, without this conf, 
yarn log aggregation fails.
{code:java}
# since Flink-1.16
security.kerberos.token.provider.%s.renewer

# if deploys on Yarn
security.kerberos.token.provider.hadoopfs.renewer: yarn {code}
 

BTW, we also found that people [dinchamion (Greg 'Dinchamion' Fazekas) · 
GitHub|https://github.com/dinchamion] (not me) in cloudera points out the 
importance of this at Links, but he did not create a pull request yet.

Proof link:

[https://github.com/cloudera/flink-tutorials/pull/44]

 

  was:
Thanks to [~gaborgsomogyi] , he 

The configuation of token provider renewer should be documented.

 

If flink deploys on Yarn, this configuration is important to keep Yarn log 
aggregation works  after Flink job terminated（FAILED, FINSHED,KiLLED) since 
started for 7 days.

it's not configured by default. If flink runs for 7 days, without this conf, 
yarn log aggregation fails.

 
{code:java}
# since Flink-1.16
security.kerberos.token.provider.%s.renewer

# if deploys on Yarn
security.kerberos.token.provider.hadoopfs.renewer: yarn {code}
Links:

[https://github.com/cloudera/flink-tutorials/pull/44]

 


> Yarn log aggregation fails with Kerberos DT issues
> --------------------------------------------------
>
>                 Key: FLINK-37378
>                 URL: https://issues.apache.org/jira/browse/FLINK-37378
>             Project: Flink
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: slankka
>            Priority: Major
>              Labels: docuentation
>
> Thanks to [~gaborgsomogyi] , he created FLINK-28608, we found it is helpful 
> to solve log aggregation failure of long running flink on yarn applications. 
> So I suggest that the configuation of token provider renewer should be 
> documented.
> It's difficult to prove, but still have a way to verify this.
> {code:java}
> dfs.namenode.delegation.key.update-interval 86400000 (1 day)  # change to 
> 180000 3min
> dfs.namenode.delegation.token.max-lifetime 604800000 (7 days) # change to 
> 360000 5min
> dfs.namenode.delegation.token.renew-interval 86400000 (1 day) # change to 
> 180000 3min {code}
>  
> normally after 7 days( by default), you will find Yarn log aggregation status 
> is TIMEDOUT.
> It's no matter what release of hadoop we are using. (Apache Hadoop 3.3.6 in 
> fact.)
>  
> *How we found the problem?*
> The log aggregation success log example (Flink-1.13.0):
> {code:java}
> token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/[email protected], 
> renewer=yarn, realUser=, issueDate=1739273095368, maxDate=1739877895368{code}
> The failed example (Flink-1.17.0):
> {code:java}
> token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/[email protected], 
> renewer=, realUser=, issueDate=1739953940508, maxDate=1739954300508 {code}
>  
> *Solution we found*
> If flink deploys on Yarn, this configuration is important to keep Yarn log 
> aggregation works  after Flink job terminated（FAILED, FINSHED,KiLLED) since 
> started for 7 days.
> it's not configured by default. If flink runs for 7 days, without this conf, 
> yarn log aggregation fails.
> {code:java}
> # since Flink-1.16
> security.kerberos.token.provider.%s.renewer
> # if deploys on Yarn
> security.kerberos.token.provider.hadoopfs.renewer: yarn {code}
>  
> BTW, we also found that people [dinchamion (Greg 'Dinchamion' Fazekas) · 
> GitHub|https://github.com/dinchamion] (not me) in cloudera points out the 
> importance of this at Links, but he did not create a pull request yet.
> Proof link:
> [https://github.com/cloudera/flink-tutorials/pull/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37378) Yarn log aggregation fails with Kerberos DT issues

Reply via email to