[ 
https://issues.apache.org/jira/browse/TEZ-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved TEZ-4600.
-------------------------------
    Resolution: Fixed

> Secret managers in Tez should respect the algorithm set by hadoop
> -----------------------------------------------------------------
>
>                 Key: TEZ-4600
>                 URL: https://issues.apache.org/jira/browse/TEZ-4600
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.5
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> after YARN-11738, hadoop can use a core-site config to use a default algorithm
> https://github.com/apache/hadoop/commit/b9060fc00df89a4c73d5b98947688b200b79901f
> {code}
>   static {
>     Configuration conf = new Configuration();
>     String algorithm = conf.get(
>       
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_GENERATOR_ALGORITHM_KEY,
>       
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_GENERATOR_ALGORITHM_DEFAULT);
>     LOG.info("Selected hash algorithm: {}", algorithm);
>     SELECTED_ALGORITHM = algorithm;
>     int length = conf.getInt(
>       
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_LENGTH_KEY,
>       
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_LENGTH_DEFAULT);
>     LOG.info("Selected hash key length:{}", length);
>     SELECTED_LENGTH = length;
>   }
> {code}
> in case of a non-default value, key mismatch happens (as tez uses the 
> hardcoded value from TEZ-1596), and tez becomes broken in different places
> 1. dagclient <-> AM communication
> {code}
> Caused by: org.apache.hadoop.ipc.RemoteException: DIGEST-SHA: digest response 
> format violation. Mismatched response.
> {code}
> this is because of the ClientToAMTokenSecretManager used in DAGAppMaster, 
> which doesn't apply the changed, non-default algo coming from DAG payload
> in TezAM, new Configuration() is not suitable especially in static 
> initializer time, because the actual configuration values come as a payload 
> from the upstream application (like HiveServer2)
> 2. secure shuffle: for which the key is handled by the JobTokenSecretManager, 
> so if the algo in fetchers differs from the one in ShuffleHandler, shuffle 
> fetchers faile
> on fetcher side:
> {code}
> 2025-02-25 08:36:51,482 [WARN] [Fetcher_B {Map_1 -> Reducer_2} #0] 
> |shuffle.Fetcher|: Fetch Failure while connecting from 
> ccycloud-2.lbodor-fips.root.comops.s
> ite to: ccycloud-2.lbodor-fips.root.comops.site:13562, attempt: 
> InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, 
> pathComponent=attempt_174047246119
> 9_0001_1_00_000000_0_10002, spillType=0, spillId=-1] Informing ShuffleManager:
> java.io.IOException: Server returned HTTP response code: 401 for URL: 
> https://ccycloud-2.lbodor-fips.root.comops.site:13562/mapOutput?job=job_1740472461199_0001&dag=1&reduce=0&map=attempt_1740472461199_0001_1_00_000000_0_10002&keepAlive=true
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>         at 
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
>         at 
> org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnectionInternal(Fetcher.java:565)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:534)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:573)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:492)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:290)
>         at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
>         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
>         at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> on ShuffleHandler site (this was a hadoop ShuffleHandler btw):
> {code}
> 2025-02-25 08:31:22,781 WARN org.apache.hadoop.mapred.ShuffleHandler: Shuffle 
> failure
> java.io.IOException: Verification of the hashReply failed
>         at 
> org.apache.hadoop.mapreduce.security.SecureShuffleUtils.verifyReply(SecureShuffleUtils.java:106)
>         at 
> org.apache.hadoop.mapred.ShuffleChannelHandler.verifyRequest(ShuffleChannelHandler.java:470)
>         at 
> org.apache.hadoop.mapred.ShuffleChannelHandler.channelRead0(ShuffleChannelHandler.java:259)
>         at 
> org.apache.hadoop.mapred.ShuffleChannelHandler.channelRead0(ShuffleChannelHandler.java:130)
>         at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>         at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>         at 
> io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
>         at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
>         at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
>         at 
> io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> with a fix, both client<-> am comm and SSL shuffle should work
> UPDATE: only 2) applies to upstream tez, 1) was a downstream-only problem 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to