[ 
https://issues.apache.org/jira/browse/HADOOP-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HADOOP-18699:
--------------------------------
    Description: 
This serves as a PSA for a JDK bug. Not really a bug in Hadoop / HDFS. 
Symptom/Workaround/Solution detailed below.

[~relek] identified [JDK-8292158|https://bugs.openjdk.org/browse/JDK-8292158] 
(backported to JDK 11 in 
[JDK-8295297|https://bugs.openjdk.org/browse/JDK-8295297]) causes HDFS clients 
to fail with InvalidProtocolBufferException due to corrupted protobuf message 
in Hadoop RPC request when all of the below conditions are met:

1. The host is capable of AVX-512 instruction sets
2. AVX-512 is enabled in JVM. This should be enabled by default on AVX-512 
capable hosts, equivalent to specifying JVM arg {{-XX:UseAVX=3}}
3. Hadoop native library (e.g. libhadoop.so) is not available. So the HDFS 
client falls back using Hotspot JVM's {{aesctr_encrypt}} implementation for 
AES/CTR/NoPadding.
4. Client uses JDK 11. And OpenJDK version < 11.0.18

As a result, the client could print messages like these:

{code:title=Symptoms on the HDFS client}
2023-02-21 15:21:44,380 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
failure: Failed to connect to <HOST/IP:PORT> for file 
/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
 for block 
BP-1836197545-10.125.248.11-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
 Protocol message tag had invalid wire type.
com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had 
invalid wire type.

2023-02-21 15:21:44,378 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
failure: Failed to connect to <HOST/IP:PORT> for file 
/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
 for block 
BP-1836197545-<IP>-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
 Protocol message end-group tag did not match expected tag.
com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group 
tag did not match expected tag.

2023-02-21 15:06:55,530 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
failure: Failed to connect to <HOST/IP:PORT> for file 
/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_06_55.b4a633a8bde014aa
 for block 
BP-1836197545-<IP>-1672668423261:blk_1073935025_194771:com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the middle 
of a field. This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol 
message, the input ended unexpectedly in the middle of a field. This could mean 
either than the input has been truncated or that an embedded message 
misreported its own length.
{code}

The error message might mislead devs/users into thinking this is a Hadoop 
Common or HDFS bug (while it is a JDK bug in this case).


{color:red}Solutions:{color}
1. As a workaround, append {{-XX:UseAVX=2}} to client JVM args; or
2. Upgrade to OpenJDK >= 11.0.18.


I might post a repro test case for this, or find a way in the code to prompt 
the user that this could be the potential issue (need to upgrade JDK 11) when 
it occurs.

  was:
This serves as a PSA for a JDK bug. Not really a bug in Hadoop / HDFS. 
Symptom/Workaround/Solution detailed below.

[~relek] identified [JDK-8292158|https://bugs.openjdk.org/browse/JDK-8292158] 
(backported to JDK 11 in 
[JDK-8295297|https://bugs.openjdk.org/browse/JDK-8295297]) causes HDFS clients 
to fail with InvalidProtocolBufferException due to corrupted protobuf message 
in Hadoop RPC request when all of the below conditions are met:

1. The host is capable of AVX-512 instruction sets
2. AVX-512 is enabled in JVM. This should be enabled by default on AVX-512 
capable hosts, equivalent to specifying JVM arg {{-XX:UseAVX=3}}
3. Hadoop native library (e.g. libhadoop.so) is not available. So the HDFS 
client falls back to AES/CTR/NoPadding and thus uses Hotspot JVM's 
{{aesctr_encrypt}} implementation.
4. Client uses JDK 11. And OpenJDK version < 11.0.18

As a result, the client could print messages like these:

{code:title=Symptoms on the HDFS client}
2023-02-21 15:21:44,380 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
failure: Failed to connect to <HOST/IP:PORT> for file 
/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
 for block 
BP-1836197545-10.125.248.11-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
 Protocol message tag had invalid wire type.
com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had 
invalid wire type.

2023-02-21 15:21:44,378 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
failure: Failed to connect to <HOST/IP:PORT> for file 
/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
 for block 
BP-1836197545-<IP>-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
 Protocol message end-group tag did not match expected tag.
com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group 
tag did not match expected tag.

2023-02-21 15:06:55,530 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
failure: Failed to connect to <HOST/IP:PORT> for file 
/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_06_55.b4a633a8bde014aa
 for block 
BP-1836197545-<IP>-1672668423261:blk_1073935025_194771:com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the middle 
of a field. This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol 
message, the input ended unexpectedly in the middle of a field. This could mean 
either than the input has been truncated or that an embedded message 
misreported its own length.
{code}

The error message might mislead devs/users into thinking this is a Hadoop 
Common or HDFS bug (while it is a JDK bug in this case).


{color:red}Solutions:{color}
1. As a workaround, append {{-XX:UseAVX=2}} to client JVM args; or
2. Upgrade to OpenJDK >= 11.0.18.


I might post a repro test case for this, or find a way in the code to prompt 
the user that this could be the potential issue (need to upgrade JDK 11) when 
it occurs.


> InvalidProtocolBufferException caused by JDK 11 < 11.0.18 AES-CTR cipher 
> state corruption with AVX-512 bug
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18699
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18699
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: hdfs-client
>            Reporter: Siyao Meng
>            Priority: Major
>
> This serves as a PSA for a JDK bug. Not really a bug in Hadoop / HDFS. 
> Symptom/Workaround/Solution detailed below.
> [~relek] identified [JDK-8292158|https://bugs.openjdk.org/browse/JDK-8292158] 
> (backported to JDK 11 in 
> [JDK-8295297|https://bugs.openjdk.org/browse/JDK-8295297]) causes HDFS 
> clients to fail with InvalidProtocolBufferException due to corrupted protobuf 
> message in Hadoop RPC request when all of the below conditions are met:
> 1. The host is capable of AVX-512 instruction sets
> 2. AVX-512 is enabled in JVM. This should be enabled by default on AVX-512 
> capable hosts, equivalent to specifying JVM arg {{-XX:UseAVX=3}}
> 3. Hadoop native library (e.g. libhadoop.so) is not available. So the HDFS 
> client falls back using Hotspot JVM's {{aesctr_encrypt}} implementation for 
> AES/CTR/NoPadding.
> 4. Client uses JDK 11. And OpenJDK version < 11.0.18
> As a result, the client could print messages like these:
> {code:title=Symptoms on the HDFS client}
> 2023-02-21 15:21:44,380 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
> failure: Failed to connect to <HOST/IP:PORT> for file 
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
>  for block 
> BP-1836197545-10.125.248.11-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
>  Protocol message tag had invalid wire type.
> com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had 
> invalid wire type.
> 2023-02-21 15:21:44,378 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
> failure: Failed to connect to <HOST/IP:PORT> for file 
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_21_25.b6788e89894a61b5
>  for block 
> BP-1836197545-<IP>-1672668423261:blk_1073935111_194857:com.google.protobuf.InvalidProtocolBufferException:
>  Protocol message end-group tag did not match expected tag.
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> end-group tag did not match expected tag.
> 2023-02-21 15:06:55,530 WARN org.apache.hadoop.hdfs.DFSClient: Connection 
> failure: Failed to connect to <HOST/IP:PORT> for file 
> /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2023_02_21-15_06_55.b4a633a8bde014aa
>  for block 
> BP-1836197545-<IP>-1672668423261:blk_1073935025_194771:com.google.protobuf.InvalidProtocolBufferException:
>  While parsing a protocol message, the input ended unexpectedly in the middle 
> of a field. This could mean either than the input has been truncated or that 
> an embedded message misreported its own length.
> com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol 
> message, the input ended unexpectedly in the middle of a field. This could 
> mean either than the input has been truncated or that an embedded message 
> misreported its own length.
> {code}
> The error message might mislead devs/users into thinking this is a Hadoop 
> Common or HDFS bug (while it is a JDK bug in this case).
> {color:red}Solutions:{color}
> 1. As a workaround, append {{-XX:UseAVX=2}} to client JVM args; or
> 2. Upgrade to OpenJDK >= 11.0.18.
> I might post a repro test case for this, or find a way in the code to prompt 
> the user that this could be the potential issue (need to upgrade JDK 11) when 
> it occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to