[ https://issues.apache.org/jira/browse/TEZ-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ayush Saxena resolved TEZ-4600. ------------------------------- Resolution: Fixed > Secret managers in Tez should respect the algorithm set by hadoop > ----------------------------------------------------------------- > > Key: TEZ-4600 > URL: https://issues.apache.org/jira/browse/TEZ-4600 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Fix For: 0.10.5 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > after YARN-11738, hadoop can use a core-site config to use a default algorithm > https://github.com/apache/hadoop/commit/b9060fc00df89a4c73d5b98947688b200b79901f > {code} > static { > Configuration conf = new Configuration(); > String algorithm = conf.get( > > CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_GENERATOR_ALGORITHM_KEY, > > CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_GENERATOR_ALGORITHM_DEFAULT); > LOG.info("Selected hash algorithm: {}", algorithm); > SELECTED_ALGORITHM = algorithm; > int length = conf.getInt( > > CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_LENGTH_KEY, > > CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_LENGTH_DEFAULT); > LOG.info("Selected hash key length:{}", length); > SELECTED_LENGTH = length; > } > {code} > in case of a non-default value, key mismatch happens (as tez uses the > hardcoded value from TEZ-1596), and tez becomes broken in different places > 1. dagclient <-> AM communication > {code} > Caused by: org.apache.hadoop.ipc.RemoteException: DIGEST-SHA: digest response > format violation. Mismatched response. > {code} > this is because of the ClientToAMTokenSecretManager used in DAGAppMaster, > which doesn't apply the changed, non-default algo coming from DAG payload > in TezAM, new Configuration() is not suitable especially in static > initializer time, because the actual configuration values come as a payload > from the upstream application (like HiveServer2) > 2. secure shuffle: for which the key is handled by the JobTokenSecretManager, > so if the algo in fetchers differs from the one in ShuffleHandler, shuffle > fetchers faile > on fetcher side: > {code} > 2025-02-25 08:36:51,482 [WARN] [Fetcher_B {Map_1 -> Reducer_2} #0] > |shuffle.Fetcher|: Fetch Failure while connecting from > ccycloud-2.lbodor-fips.root.comops.s > ite to: ccycloud-2.lbodor-fips.root.comops.site:13562, attempt: > InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, > pathComponent=attempt_174047246119 > 9_0001_1_00_000000_0_10002, spillType=0, spillId=-1] Informing ShuffleManager: > java.io.IOException: Server returned HTTP response code: 401 for URL: > https://ccycloud-2.lbodor-fips.root.comops.site:13562/mapOutput?job=job_1740472461199_0001&dag=1&reduce=0&map=attempt_1740472461199_0001_1_00_000000_0_10002&keepAlive=true > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) > at > sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268) > at > org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnectionInternal(Fetcher.java:565) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:534) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:573) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:492) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:290) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > on ShuffleHandler site (this was a hadoop ShuffleHandler btw): > {code} > 2025-02-25 08:31:22,781 WARN org.apache.hadoop.mapred.ShuffleHandler: Shuffle > failure > java.io.IOException: Verification of the hashReply failed > at > org.apache.hadoop.mapreduce.security.SecureShuffleUtils.verifyReply(SecureShuffleUtils.java:106) > at > org.apache.hadoop.mapred.ShuffleChannelHandler.verifyRequest(ShuffleChannelHandler.java:470) > at > org.apache.hadoop.mapred.ShuffleChannelHandler.channelRead0(ShuffleChannelHandler.java:259) > at > org.apache.hadoop.mapred.ShuffleChannelHandler.channelRead0(ShuffleChannelHandler.java:130) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) > at > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) > at > io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > {code} > with a fix, both client<-> am comm and SSL shuffle should work > UPDATE: only 2) applies to upstream tez, 1) was a downstream-only problem -- This message was sent by Atlassian Jira (v8.20.10#820010)