[ 
https://issues.apache.org/jira/browse/HADOOP-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084469#comment-18084469
 ] 

Siyao Meng commented on HADOOP-18699:
-------------------------------------

Standalone repro on an AVX-512-capable host:

{code:bash}
# AVX-512 present
$ grep -o -E 'avx512f|avx512bw|avx512vl|vaes' /proc/cpuinfo | sort -u
avx512bw
avx512f
avx512vl
vaes

# Download affected JDK (Temurin 11.0.17)
curl -fsSL -o jdk11017.tar.gz \
  
'https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.17%2B8/OpenJDK11U-jdk_x64_linux_hotspot_11.0.17_8.tar.gz'
tar xzf jdk11017.tar.gz
export JAVA_HOME="$PWD/jdk-11.0.17+8"
export PATH="$JAVA_HOME/bin:$PATH"

$ which java
/home/ubuntu/jdk-11.0.17+8/bin/java
$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Temurin-11.0.17+8 (build 11.0.17+8)
OpenJDK 64-Bit Server VM Temurin-11.0.17+8 (build 11.0.17+8, mixed mode)

# JVM default UseAVX (expect 3 on AVX-512 hosts)
$ java -XX:+PrintFlagsFinal -version 2>/dev/null | grep -w UseAVX
     intx UseAVX                                   = 3                          
          {ARCH product} {default}
{code}

{code}
cat > Repro18699.java <<'EOF'
import java.nio.ByteBuffer;
import java.security.SecureRandom;
import java.util.Arrays;
import java.util.Random;
import javax.crypto.Cipher;
import javax.crypto.spec.IvParameterSpec;
import javax.crypto.spec.SecretKeySpec;
public class Repro18699 {
  private static final String ALGO = "AES/CTR/NoPadding";
  private static final int ITERATIONS = 200000;
  private static final int MAX_LEN = 15;
  public static void main(String[] args) throws Exception {
    SecureRandom sr = new SecureRandom();
    byte[] keyBytes = new byte[16];
    byte[] ivBytes = new byte[16];
    sr.nextBytes(keyBytes);
    sr.nextBytes(ivBytes);
    SecretKeySpec key = new SecretKeySpec(keyBytes, "AES");
    IvParameterSpec iv = new IvParameterSpec(ivBytes);
    Cipher enc = Cipher.getInstance(ALGO);
    Cipher dec = Cipher.getInstance(ALGO);
    enc.init(Cipher.ENCRYPT_MODE, key, iv);
    dec.init(Cipher.DECRYPT_MODE, key, iv);
    ByteBuffer inBuf = ByteBuffer.allocateDirect(MAX_LEN);
    ByteBuffer outBuf = ByteBuffer.allocateDirect(MAX_LEN);
    Random random = new Random(12345L);
    byte[][] plaintexts = new byte[ITERATIONS][];
    byte[][] ciphertexts = new byte[ITERATIONS][];
    for (int i = 0; i < ITERATIONS; i++) {
      int len = (i % MAX_LEN) + 1;
      byte[] original = new byte[len];
      random.nextBytes(original);
      plaintexts[i] = original;
      inBuf.clear();
      inBuf.put(original);
      inBuf.flip();
      outBuf.clear();
      outBuf.limit(len);
      enc.update(inBuf, outBuf);
      outBuf.flip();
      byte[] ct = new byte[len];
      outBuf.get(ct);
      ciphertexts[i] = ct;
    }
    for (int i = 0; i < ITERATIONS; i++) {
      byte[] ct = ciphertexts[i];
      int len = ct.length;
      inBuf.clear();
      inBuf.put(ct);
      inBuf.flip();
      outBuf.clear();
      outBuf.limit(len);
      dec.update(inBuf, outBuf);
      outBuf.flip();
      byte[] roundTripped = new byte[len];
      outBuf.get(roundTripped);
      if (!Arrays.equals(plaintexts[i], roundTripped)) {
        System.out.println("RESULT: FAIL (HADOOP-18699 reproduced) at i=" + i);
        System.exit(1);
      }
    }
    System.out.println("RESULT: PASS (" + ITERATIONS + " sub-block round trips 
OK)");
  }
}
EOF
{code}

{code}
$ javac Repro18699.java
$ java Repro18699
RESULT: FAIL (HADOOP-18699 reproduced) at i=8192
$ echo "exit=$?"
exit=1
{code}

Workarounds passed:
{code}
$ java -XX:UseAVX=2 Repro18699
RESULT: PASS (200000 sub-block round trips OK)
$ java -XX:+UnlockDiagnosticVMOptions -XX:-UseAESCTRIntrinsics Repro18699
RESULT: PASS (200000 sub-block round trips OK)
{code}

Or repro with hadoop test diff:

{code}
mvn -pl hadoop-common-project/hadoop-common -am \
  test -Dtest=TestCryptoCodec#testJceAesCtrCryptoCodecHADOOP18699
{code}

> InvalidProtocolBufferException caused by JDK 11 < 11.0.18 AES-CTR cipher 
> state corruption with AVX-512 bug
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18699
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18699
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: hdfs-client
>            Reporter: Siyao Meng
>            Priority: Major
>
> This serves as a PSA for a JDK bug. Not really a bug in Hadoop / HDFS. 
> Symptom/Workaround/Solution detailed below.
> [~relek] identified [JDK-8292158|https://bugs.openjdk.org/browse/JDK-8292158] 
> (backported to JDK 11 in 
> [JDK-8295297|https://bugs.openjdk.org/browse/JDK-8295297]) causes HDFS 
> clients to fail with InvalidProtocolBufferException due to corrupted protobuf 
> message in Hadoop RPC request when all of the below conditions are met:
> 1. The host is capable of AVX-512 instruction sets
> 2. AVX-512 is enabled in JVM. This should be enabled by default on AVX-512 
> capable hosts, equivalent to specifying JVM arg {{-XX:UseAVX=3}}
> 3. Hadoop native library (e.g. libhadoop.so) is not available. So the HDFS 
> client falls back using Hotspot JVM's {{aesctr_encrypt}} implementation for 
> AES/CTR/NoPadding.
> 4. Client uses JDK 11. And OpenJDK version < 11.0.18
> As a result, the client could print messages like these:
> {code:title=Symptoms on the HDFS client}
> 2023-02-21 15:21:44,380 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
> failure: Failed to connect to <HOST/IP:PORT> for file 
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
>  for block 
> BP-1836197545-10.125.248.11-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
>  Protocol message tag had invalid wire type.
> com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had 
> invalid wire type.
> 2023-02-21 15:21:44,378 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
> failure: Failed to connect to <HOST/IP:PORT> for file 
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
>  for block 
> BP-1836197545-<IP>-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
>  Protocol message end-group tag did not match expected tag.
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> end-group tag did not match expected tag.
> 2023-02-21 15:06:55,530 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
> failure: Failed to connect to <HOST/IP:PORT> for file 
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_06_55.b4a633a8bde014aa
>  for block 
> BP-1836197545-<IP>-1672668423261:blk_1073935025_194771:com.google.protobuf.InvalidProtocolBufferException:
>  While parsing a protocol message, the input ended unexpectedly in the middle 
> of a field. This could mean either than the input has been truncated or that 
> an embedded message misreported its own length.
> com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol 
> message, the input ended unexpectedly in the middle of a field. This could 
> mean either than the input has been truncated or that an embedded message 
> misreported its own length.
> {code}
> The error message might mislead devs/users into thinking this is a Hadoop 
> Common or HDFS bug (while it is a JDK bug in this case).
> {color:red}Solutions:{color}
> 1. As a workaround, append {{-XX:UseAVX=2}} to client JVM args; or
> 2. Upgrade to OpenJDK >= 11.0.18.
> I might post a repro test case for this, or find a way in the code to prompt 
> the user that this could be the potential issue (need to upgrade JDK 11) when 
> it occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to