[
https://issues.apache.org/jira/browse/HADOOP-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084469#comment-18084469
]
Siyao Meng commented on HADOOP-18699:
-------------------------------------
Standalone repro on an AVX-512-capable host:
{code:bash}
# AVX-512 present
$ grep -o -E 'avx512f|avx512bw|avx512vl|vaes' /proc/cpuinfo | sort -u
avx512bw
avx512f
avx512vl
vaes
# Download affected JDK (Temurin 11.0.17)
curl -fsSL -o jdk11017.tar.gz \
'https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.17%2B8/OpenJDK11U-jdk_x64_linux_hotspot_11.0.17_8.tar.gz'
tar xzf jdk11017.tar.gz
export JAVA_HOME="$PWD/jdk-11.0.17+8"
export PATH="$JAVA_HOME/bin:$PATH"
$ which java
/home/ubuntu/jdk-11.0.17+8/bin/java
$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Temurin-11.0.17+8 (build 11.0.17+8)
OpenJDK 64-Bit Server VM Temurin-11.0.17+8 (build 11.0.17+8, mixed mode)
# JVM default UseAVX (expect 3 on AVX-512 hosts)
$ java -XX:+PrintFlagsFinal -version 2>/dev/null | grep -w UseAVX
intx UseAVX = 3
{ARCH product} {default}
{code}
{code}
cat > Repro18699.java <<'EOF'
import java.nio.ByteBuffer;
import java.security.SecureRandom;
import java.util.Arrays;
import java.util.Random;
import javax.crypto.Cipher;
import javax.crypto.spec.IvParameterSpec;
import javax.crypto.spec.SecretKeySpec;
public class Repro18699 {
private static final String ALGO = "AES/CTR/NoPadding";
private static final int ITERATIONS = 200000;
private static final int MAX_LEN = 15;
public static void main(String[] args) throws Exception {
SecureRandom sr = new SecureRandom();
byte[] keyBytes = new byte[16];
byte[] ivBytes = new byte[16];
sr.nextBytes(keyBytes);
sr.nextBytes(ivBytes);
SecretKeySpec key = new SecretKeySpec(keyBytes, "AES");
IvParameterSpec iv = new IvParameterSpec(ivBytes);
Cipher enc = Cipher.getInstance(ALGO);
Cipher dec = Cipher.getInstance(ALGO);
enc.init(Cipher.ENCRYPT_MODE, key, iv);
dec.init(Cipher.DECRYPT_MODE, key, iv);
ByteBuffer inBuf = ByteBuffer.allocateDirect(MAX_LEN);
ByteBuffer outBuf = ByteBuffer.allocateDirect(MAX_LEN);
Random random = new Random(12345L);
byte[][] plaintexts = new byte[ITERATIONS][];
byte[][] ciphertexts = new byte[ITERATIONS][];
for (int i = 0; i < ITERATIONS; i++) {
int len = (i % MAX_LEN) + 1;
byte[] original = new byte[len];
random.nextBytes(original);
plaintexts[i] = original;
inBuf.clear();
inBuf.put(original);
inBuf.flip();
outBuf.clear();
outBuf.limit(len);
enc.update(inBuf, outBuf);
outBuf.flip();
byte[] ct = new byte[len];
outBuf.get(ct);
ciphertexts[i] = ct;
}
for (int i = 0; i < ITERATIONS; i++) {
byte[] ct = ciphertexts[i];
int len = ct.length;
inBuf.clear();
inBuf.put(ct);
inBuf.flip();
outBuf.clear();
outBuf.limit(len);
dec.update(inBuf, outBuf);
outBuf.flip();
byte[] roundTripped = new byte[len];
outBuf.get(roundTripped);
if (!Arrays.equals(plaintexts[i], roundTripped)) {
System.out.println("RESULT: FAIL (HADOOP-18699 reproduced) at i=" + i);
System.exit(1);
}
}
System.out.println("RESULT: PASS (" + ITERATIONS + " sub-block round trips
OK)");
}
}
EOF
{code}
{code}
$ javac Repro18699.java
$ java Repro18699
RESULT: FAIL (HADOOP-18699 reproduced) at i=8192
$ echo "exit=$?"
exit=1
{code}
Workarounds passed:
{code}
$ java -XX:UseAVX=2 Repro18699
RESULT: PASS (200000 sub-block round trips OK)
$ java -XX:+UnlockDiagnosticVMOptions -XX:-UseAESCTRIntrinsics Repro18699
RESULT: PASS (200000 sub-block round trips OK)
{code}
Or repro with hadoop test diff:
{code}
mvn -pl hadoop-common-project/hadoop-common -am \
test -Dtest=TestCryptoCodec#testJceAesCtrCryptoCodecHADOOP18699
{code}
> InvalidProtocolBufferException caused by JDK 11 < 11.0.18 AES-CTR cipher
> state corruption with AVX-512 bug
> ----------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-18699
> URL: https://issues.apache.org/jira/browse/HADOOP-18699
> Project: Hadoop Common
> Issue Type: Bug
> Components: hdfs-client
> Reporter: Siyao Meng
> Priority: Major
>
> This serves as a PSA for a JDK bug. Not really a bug in Hadoop / HDFS.
> Symptom/Workaround/Solution detailed below.
> [~relek] identified [JDK-8292158|https://bugs.openjdk.org/browse/JDK-8292158]
> (backported to JDK 11 in
> [JDK-8295297|https://bugs.openjdk.org/browse/JDK-8295297]) causes HDFS
> clients to fail with InvalidProtocolBufferException due to corrupted protobuf
> message in Hadoop RPC request when all of the below conditions are met:
> 1. The host is capable of AVX-512 instruction sets
> 2. AVX-512 is enabled in JVM. This should be enabled by default on AVX-512
> capable hosts, equivalent to specifying JVM arg {{-XX:UseAVX=3}}
> 3. Hadoop native library (e.g. libhadoop.so) is not available. So the HDFS
> client falls back using Hotspot JVM's {{aesctr_encrypt}} implementation for
> AES/CTR/NoPadding.
> 4. Client uses JDK 11. And OpenJDK version < 11.0.18
> As a result, the client could print messages like these:
> {code:title=Symptoms on the HDFS client}
> 2023-02-21 15:21:44,380 WARN org.apache.hadoop.hdfs.DFSClient: Connection
> failure: Failed to connect to <HOST/IP:PORT> for file
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
> for block
> BP-1836197545-10.125.248.11-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
> Protocol message tag had invalid wire type.
> com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had
> invalid wire type.
> 2023-02-21 15:21:44,378 WARN org.apache.hadoop.hdfs.DFSClient: Connection
> failure: Failed to connect to <HOST/IP:PORT> for file
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
> for block
> BP-1836197545-<IP>-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
> Protocol message end-group tag did not match expected tag.
> com.google.protobuf.InvalidProtocolBufferException: Protocol message
> end-group tag did not match expected tag.
> 2023-02-21 15:06:55,530 WARN org.apache.hadoop.hdfs.DFSClient: Connection
> failure: Failed to connect to <HOST/IP:PORT> for file
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_06_55.b4a633a8bde014aa
> for block
> BP-1836197545-<IP>-1672668423261:blk_1073935025_194771:com.google.protobuf.InvalidProtocolBufferException:
> While parsing a protocol message, the input ended unexpectedly in the middle
> of a field. This could mean either than the input has been truncated or that
> an embedded message misreported its own length.
> com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol
> message, the input ended unexpectedly in the middle of a field. This could
> mean either than the input has been truncated or that an embedded message
> misreported its own length.
> {code}
> The error message might mislead devs/users into thinking this is a Hadoop
> Common or HDFS bug (while it is a JDK bug in this case).
> {color:red}Solutions:{color}
> 1. As a workaround, append {{-XX:UseAVX=2}} to client JVM args; or
> 2. Upgrade to OpenJDK >= 11.0.18.
> I might post a repro test case for this, or find a way in the code to prompt
> the user that this could be the potential issue (need to upgrade JDK 11) when
> it occurs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]