[ https://issues.apache.org/jira/browse/TEZ-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100835#comment-17100835 ]
László Bodor edited comment on TEZ-4157 at 5/6/20, 2:11 PM: ------------------------------------------------------------ thanks [~jeagles], these options saved my life...so I started from that strange http response header {Content-type: unknown/unknown} tldr: it seems like a netty bug to me, where http response encoder is reused improperly and the workaround is (you can find in [^TEZ-4157.03.patch] ): {code} if (keepAliveParam || connectionKeepAliveEnabled){ pipeline.replace(pipeline.get("encoder"), "encoder", new HttpResponseEncoder()); } {code} deep inside in the pipeline, there is the encoder which works according to its internal state: https://github.com/netty/netty/blob/4.1/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java#L86 while [writing the second response|https://github.com/apache/tez/blob/master/tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java#L1090], the same encoder instance is reused (looked at object hashcode) only if keepalive is enabled, and its internal state is not ST_INIT (0) for the second usage, so it throws that IllegalStateException, and the result somehow silently is what you too got in your debug messages, a http response with a totally messed up header: {code} {Content-type: unknown/unknown} {code} with the workaround, it works properly (unfortunately there is no reset() call on that encoder) I want to double-check my workaround and file a netty bug if needed, in the meantime could you please take a look at the patch? I mean, I'm about to test it on a cluster (with hive), what else do we need in order to make this change merged? was (Author: abstractdog): thanks [~jeagles], these options saved my life...so I started from that strange http response header ({Content-type: unknown/unknown}) tldr: it seems like a netty bug to me, where http response encoder is reused improperly and the workaround is (you can find in [^TEZ-4157.03.patch] ): {code} if (keepAliveParam || connectionKeepAliveEnabled){ pipeline.replace(pipeline.get("encoder"), "encoder", new HttpResponseEncoder()); } {code} deep inside in the pipeline, there is the encoder which works according to its internal state: https://github.com/netty/netty/blob/4.1/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java#L86 while [writing the second response|https://github.com/apache/tez/blob/master/tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java#L1090], the same encoder instance is reused (looked at object hashcode) only if keepalive is enabled, and its internal state is not ST_INIT (0) for the second usage, so it throws that IllegalStateException, and the result somehow silently is what you too got in your debug messages, a http response with a totally messed up header: {code} {Content-type: unknown/unknown} {code} with the workaround, it works properly (unfortunately there is no reset() call on that encoder) I want to double-check my workaround and file a netty bug if needed, in the meantime could you please take a look at the patch? I mean, I'm about to test it on a cluster (with hive), what else do we need in order to make this change merged? > ShuffleHandler: upgrade to netty4 > --------------------------------- > > Key: TEZ-4157 > URL: https://issues.apache.org/jira/browse/TEZ-4157 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: TEZ-4157.01.patch, TEZ-4157.02.patch, TEZ-4157.03.patch > > > -In the dependency tree, there are 2 occurrences of compile scope direct > netty dependencies, however, they're not used at all. I compiled locally > successfully without them. E.g. when investigating blackduck alerts > (complaining about netty deps for current 3.10.5.Final), it would be cleaner > to start from a dependency tree where Tez doesn't depend on netty directly in > order to eliminate its responsibility (and move the focus to underlying > hadoop for instance).- > Tez depends on netty3 almost only in ShuffleHandler and some related classes. > We can eliminate netty3 by upgrading it, but this effort might involve some > testing due to fundamental [changes from > netty3->netty4|https://netty.io/wiki/new-and-noteworthy-in-4.0.html] + we > don't have a reference yet, as [hadoop's > ShuffleHandler|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java] > is still on netty3. > As per the netty documentation, we can also expect some performance > improvement (e.g. Pooled buffers). -- This message was sent by Atlassian Jira (v8.3.4#803005)