[jira] [Assigned] (HDFS-17402) StartupSafeMode should not exit when resources are from low to available
[ https://issues.apache.org/jira/browse/HDFS-17402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zilong Zhu reassigned HDFS-17402: - Assignee: Zilong Zhu > StartupSafeMode should not exit when resources are from low to available > > > Key: HDFS-17402 > URL: https://issues.apache.org/jira/browse/HDFS-17402 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > Labels: pull-request-available > > After HDFS-17231, NameNode can exit safemode automatically when resources are > from low to available. It used > org.apache.hadoop.hdfs.server.namenode.FSNamesystem#leaveSafeMode, this > function will change BMSafeModeStatus. However, NameNode entering resource > low safe mode doesn't change BMSafeModeStatus in > org.apache.hadoop.hdfs.server.namenode.FSNamesystem#enterSafeMode. This is > not equal > Now: > a. NN enter StartupSafeMode > b. NN enter ResourceLowSafeMode > c. NN resources from low to available > d. NN safemode off > > Expectations: > a. NN enter StartupSafeMode > b. NN enter ResourceLowSafeMode > c. NN resources from low to available > d. NN exit ResourceLowSafeMode but in StartupSafeMode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17503) Unreleased volume references because of OOM
[ https://issues.apache.org/jira/browse/HDFS-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841815#comment-17841815 ] Zilong Zhu commented on HDFS-17503: --- [~Keepromise] It appears to occur when creating the BlockSender object. This is an intermittent issue that occurs in our production environment. If I manually throw an OOM error while creating the BlockSender object, it can cause volume references not to be released. > Unreleased volume references because of OOM > --- > > Key: HDFS-17503 > URL: https://issues.apache.org/jira/browse/HDFS-17503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > > When BlockSender throws an error because of OOM,the volume reference obtained > by the thread is not released,which causes the thread trying to remove the > volume to wait and fall into an infinite loop. > I found HDFS-15963 catched exception and release volume reference. But it did > not handle the case of throwing errors. I think "catch (Throwable t)" should > be used instead of "catch (IOException ioe)". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17504) DN process should exit when BPServiceActor exit
[ https://issues.apache.org/jira/browse/HDFS-17504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zilong Zhu reassigned HDFS-17504: - Assignee: Zilong Zhu > DN process should exit when BPServiceActor exit > --- > > Key: HDFS-17504 > URL: https://issues.apache.org/jira/browse/HDFS-17504 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > > BPServiceActor is a very important thread. In a non-HA cluster, the exit of > the BPServiceActor thread will cause the DN process to exit. However, in a HA > cluster, this is not the case. > I found HDFS-15651 causes BPServiceActor thread to exit and sets the > "runningState" from "RunningState.FAILED" to "RunningState.EXITED", it can > be confusing during troubleshooting. > I believe that the DN process should exit when the flag of the BPServiceActor > is set to RunningState.FAILED because at this point, the DN is unable to > recover and establish a heartbeat connection with the ANN on its own. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17504) DN process should exit when BPServiceActor exit
Zilong Zhu created HDFS-17504: - Summary: DN process should exit when BPServiceActor exit Key: HDFS-17504 URL: https://issues.apache.org/jira/browse/HDFS-17504 Project: Hadoop HDFS Issue Type: Bug Reporter: Zilong Zhu BPServiceActor is a very important thread. In a non-HA cluster, the exit of the BPServiceActor thread will cause the DN process to exit. However, in a HA cluster, this is not the case. I found HDFS-15651 causes BPServiceActor thread to exit and sets the "runningState" from "RunningState.FAILED" to "RunningState.EXITED", it can be confusing during troubleshooting. I believe that the DN process should exit when the flag of the BPServiceActor is set to RunningState.FAILED because at this point, the DN is unable to recover and establish a heartbeat connection with the ANN on its own. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17503) Unreleased volume references because of OOM
[ https://issues.apache.org/jira/browse/HDFS-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zilong Zhu updated HDFS-17503: -- Description: When BlockSender throws an error because of OOM,the volume reference obtained by the thread is not released,which causes the thread trying to remove the volume to wait and fall into an infinite loop. I found HDFS-15963 catched exception and release volume reference. But it did not handle the case of throwing errors. I think "catch (Throwable t)" should be used instead of "catch (IOException ioe)". > Unreleased volume references because of OOM > --- > > Key: HDFS-17503 > URL: https://issues.apache.org/jira/browse/HDFS-17503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Priority: Major > > When BlockSender throws an error because of OOM,the volume reference obtained > by the thread is not released,which causes the thread trying to remove the > volume to wait and fall into an infinite loop. > I found HDFS-15963 catched exception and release volume reference. But it did > not handle the case of throwing errors. I think "catch (Throwable t)" should > be used instead of "catch (IOException ioe)". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17503) Unreleased volume references because of OOM
[ https://issues.apache.org/jira/browse/HDFS-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zilong Zhu reassigned HDFS-17503: - Assignee: Zilong Zhu > Unreleased volume references because of OOM > --- > > Key: HDFS-17503 > URL: https://issues.apache.org/jira/browse/HDFS-17503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > > When BlockSender throws an error because of OOM,the volume reference obtained > by the thread is not released,which causes the thread trying to remove the > volume to wait and fall into an infinite loop. > I found HDFS-15963 catched exception and release volume reference. But it did > not handle the case of throwing errors. I think "catch (Throwable t)" should > be used instead of "catch (IOException ioe)". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17503) Unreleased volume references because of OOM
[ https://issues.apache.org/jira/browse/HDFS-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zilong Zhu updated HDFS-17503: -- Summary: Unreleased volume references because of OOM (was: Unreleased volume references because of) > Unreleased volume references because of OOM > --- > > Key: HDFS-17503 > URL: https://issues.apache.org/jira/browse/HDFS-17503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17503) Unreleased volume references because of
Zilong Zhu created HDFS-17503: - Summary: Unreleased volume references because of Key: HDFS-17503 URL: https://issues.apache.org/jira/browse/HDFS-17503 Project: Hadoop HDFS Issue Type: Bug Reporter: Zilong Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17402) StartupSafeMode should not exit when resources are from low to available
[ https://issues.apache.org/jira/browse/HDFS-17402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zilong Zhu updated HDFS-17402: -- Description: After HDFS-17231, NameNode can exit safemode automatically when resources are from low to available. It used org.apache.hadoop.hdfs.server.namenode.FSNamesystem#leaveSafeMode, this function will change BMSafeModeStatus. However, NameNode entering resource low safe mode doesn't change BMSafeModeStatus in org.apache.hadoop.hdfs.server.namenode.FSNamesystem#enterSafeMode. This is not equal Now: a. NN enter StartupSafeMode b. NN enter ResourceLowSafeMode c. NN resources from low to available d. NN safemode off Expectations: a. NN enter StartupSafeMode b. NN enter ResourceLowSafeMode c. NN resources from low to available d. NN exit ResourceLowSafeMode but in StartupSafeMode was: After HDFS-17231, NameNode can exit safemode automatically when resources are from low to available. It used org.apache.hadoop.hdfs.server.namenode.FSNamesystem#leaveSafeMode, this function will change BMSafeModeStatus. However, NameNode entering resource low safe mode doesn't change BMSafeModeStatus in org.apache.hadoop.hdfs.server.namenode.FSNamesystem#enterSafeMode. This is not equal So, I think StartupSafeMode should not exit when resources are from low to available > StartupSafeMode should not exit when resources are from low to available > > > Key: HDFS-17402 > URL: https://issues.apache.org/jira/browse/HDFS-17402 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Priority: Major > > After HDFS-17231, NameNode can exit safemode automatically when resources are > from low to available. It used > org.apache.hadoop.hdfs.server.namenode.FSNamesystem#leaveSafeMode, this > function will change BMSafeModeStatus. However, NameNode entering resource > low safe mode doesn't change BMSafeModeStatus in > org.apache.hadoop.hdfs.server.namenode.FSNamesystem#enterSafeMode. This is > not equal > Now: > a. NN enter StartupSafeMode > b. NN enter ResourceLowSafeMode > c. NN resources from low to available > d. NN safemode off > > Expectations: > a. NN enter StartupSafeMode > b. NN enter ResourceLowSafeMode > c. NN resources from low to available > d. NN exit ResourceLowSafeMode but in StartupSafeMode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17402) StartupSafeMode should not exit when resources are from low to available
Zilong Zhu created HDFS-17402: - Summary: StartupSafeMode should not exit when resources are from low to available Key: HDFS-17402 URL: https://issues.apache.org/jira/browse/HDFS-17402 Project: Hadoop HDFS Issue Type: Bug Reporter: Zilong Zhu After HDFS-17231, NameNode can exit safemode automatically when resources are from low to available. It used org.apache.hadoop.hdfs.server.namenode.FSNamesystem#leaveSafeMode, this function will change BMSafeModeStatus. However, NameNode entering resource low safe mode doesn't change BMSafeModeStatus in org.apache.hadoop.hdfs.server.namenode.FSNamesystem#enterSafeMode. This is not equal So, I think StartupSafeMode should not exit when resources are from low to available -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17368) HA: Standy should exit safemode when resources are from low available
Zilong Zhu created HDFS-17368: - Summary: HA: Standy should exit safemode when resources are from low available Key: HDFS-17368 URL: https://issues.apache.org/jira/browse/HDFS-17368 Project: Hadoop HDFS Issue Type: Bug Reporter: Zilong Zhu The NameNodeResourceMonitor automatically enters safemode when it detects that the resources are not suffcient. NNRM is only in ANN. If both ANN and SNN enter SM due to low resources, and later SNN's disk space is restored, SNN willl become ANN and ANN will become SNN. However, at this point, SNN will not exit the SM, even if the disk is recovered. Consider the following scenario: * Initially, nn-1 is active and nn-2 is standby. The insufficient resources of both nn-1 and nn-2 in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource issue and puts nn01 into safemode. * At this point, nn-1 is in safemode (ON) and active, while nn-2 is in safemode (OFF) and standby. * After a period of time, the resources in nn-2's dfs.namenode.name.dir recover, triggering failover. * Now, nn-1 is in safe mode (ON) and standby, while nn-2 is in safe mode (OFF) and active. * Afterward, the resources in nn-1's dfs.namenode.name.dir recover. * However, since nn-1 is standby but in safemode (ON), it unable to exit safe mode automatically. There are two possible ways fix this issues: # If SNN is detected to be in SM(because low resource), it will exit. # Or we already have HDFS-17231, we can revert HDFS-2914. Bringing NNRM back to SNN. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772769#comment-17772769 ] Zilong Zhu commented on HDFS-16644: --- [~nishtha11shah] You are right. [https://github.com/apache/hadoop/pull/5962/files] change can only help the DataNode avoid crashes but do not enable the 2.10 hadoop-client to successfully reads/writes. I believe we should change client code on the 2.10-branch. As mentioned above, HDFS-13541 merged into both branch-2.10 and branch-3.2. It added the "handshakeMsg" field. But HDFS-6708 and HDFS-9807 merged into branch-3.2 only. It added the "storageTypes" and "storageIds" fields before HDFS-13541. 2.10-client mistakenly thinks of the ”storageType“ as ”handshakeMsg“. This in turn passed the wrong “handshakeMsg”. This is where the real issue lies. > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > Labels: pull-request-available > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', > datanodeUuid='b1c7f64a-6389-4739-bddf-***', xmitsInProgress=0}:Exception > transfering block BP-1187699012-10.-***:blk_1119803380_46080919 to mirror > 10.*:9866 > java.io.IOException: Invalid token in javax.security.sasl.qop: D > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:220) > Once any client ver 2.10.1 connect to hdfs server, the DataNode no longer > accepts any client connection, even client ver 3.2.1 cannot connects to hdfs > server. The DataNode rejects any client connection. For a short time, all > DataNodes rejects client connections. > The problem exists even if I replace DataNode with ver 3.3.0 or replace java > with jdk 11. > The problem is fixed if I replace DataNode with ver 3.2.0. I guess the > problem is related to HDFS-13541 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754437#comment-17754437 ] Zilong Zhu edited comment on HDFS-16644 at 8/15/23 12:54 PM: - At the same time, we also identified another issue related to this. It looks like the meta inf file for the BlockTokenIdentifier is in the hadoop-hdfs.jar rather then the hadoop-hdfs-client.jar. This prevents a client from decode identifier because the service loader doesn't find the BlockTokenIdentifier class. This will result in HDFS-13541 not functioning properly on branch-2.10 as well. I created HDFS-17159 to track it. was (Author: JIRAUSER287487): At the same time, we also identified another issus related to this. It looks like the meta inf file for the BlockTokenIdentifier is in the hadoop-hdfs.jar rather then the hadoop-hdfs-client.jar. This prevents a client from decode identifier because the service loader doesn't find the BlockTokenIdentifier class. This will result in HDFS-13541 not functioning properly on branch-2.10 as well. I created HDFS-17159 to track it. > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', > datanodeUuid='b1c7f64a-6389-4739-bddf-***', xmitsInProgress=0}:Exception > transfering block BP-1187699012-10.-***:blk_1119803380_46080919 to mirror > 10.*:9866 > java.io.IOException: Invalid token in javax.security.sasl.qop: D > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:220) > Once any client ver 2.10.1 connect to hdfs server, the DataNode no longer > accepts any client connection, even client ver 3.2.1 cannot connects to hdfs > server. The DataNode rejects any client connection. For a short time, all > DataNodes rejects client connections. > The problem exists even if I replace DataNode with ver 3.3.0 or replace java > with jdk 11. > The problem is fixed if I replace DataNode with ver 3.2.0. I guess the > problem is related to HDFS-13541 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754420#comment-17754420 ] Zilong Zhu edited comment on HDFS-16644 at 8/15/23 12:54 PM: - We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fine. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out, modes.size()); for (AccessMode aMode : modes) { WritableUtils.writeEnum(out, aMode); } if (storageTypes != null) {< new fields WritableUtils.writeVInt(out, storageTypes.length); for (StorageType type : storageTypes) { WritableUtils.writeEnum(out, type); } } if (storageIds != null) { < new fields WritableUtils.writeVInt(out, storageIds.length); for (String id : storageIds) { WritableUtils.writeString(out, id); } } if (handshakeMsg != null && handshakeMsg.length > 0) { WritableUtils.writeVInt(out, handshakeMsg.length); out.write(handshakeMsg); } }{code} For BlockTokenIdentifier(2.10.1) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#readFields {code:java} public void readFields(DataInput in) throws IOException { this.cache = null; if (in instanceof DataInputStream) { final DataInputStream dis = (DataInputStream) in; // this.cache should be assigned the raw bytes from the input data for // upgrading compatibility. If we won't mutate fields and call getBytes() // for something (e.g retrieve password), we should return the raw bytes // instead of serializing the instance self fields to bytes, because we // may lose newly added fields which we can't recognize. this.cache = IOUtils.readFullyToByteArray(dis); dis.reset(); } expiryDate = WritableUtils.readVLong(in); keyId = WritableUtils.readVInt(in); userId = WritableUtils.readString(in); blockPoolId = WritableUtils.readString(in); blockId = WritableUtils.readVLong(in); int length = WritableUtils.readVIntInRange(in, 0, AccessMode.class.getEnumConstants().length); for (int i = 0; i < length; i++) { modes.add(WritableUtils.readEnum(in, AccessMode.class)); } try { int handshakeMsgLen = WritableUtils.readVInt(in); if (handshakeMsgLen != 0) { handshakeMsg = new byte[handshakeMsgLen]; in.readFully(handshakeMsg); } } catch (EOFException eof) { } } {code} So, when client(2.10.1) deserialized the handshakeMsg, an error occurred and it mistakenly deserialized the storageType instead of the handshakeMsg. HDFS-13541 merged into both branch-2.10 and branch-3.2. It added the "handshakeMsg" field. But HDFS-6708 and HDFS-9807 merged into branch-3.2 only. It added the "storageTypes" and "storageIds" fields before HDFS-13541. This is where the real issue lies. I want to fix this issue. Any comments and suggestions would be appreciated. was (Author: JIRAUSER287487): We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fine. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out,
[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754437#comment-17754437 ] Zilong Zhu edited comment on HDFS-16644 at 8/15/23 7:50 AM: At the same time, we also identified another issus related to this. It looks like the meta inf file for the BlockTokenIdentifier is in the hadoop-hdfs.jar rather then the hadoop-hdfs-client.jar. This prevents a client from decode identifier because the service loader doesn't find the BlockTokenIdentifier class. This will result in HDFS-13541 not functioning properly on branch-2.10 as well. I created HDFS-17159 to track it. was (Author: JIRAUSER287487): At the same time, we also identified another issus related to this. It looks like the meta inf file for the BlockTokenIdentifier is in the hadoop-hdfs.jar rather then the hadoop-hdfs-client.jar. This prevents a client from decode identifier because the service loader doesn't find the BlockTokenIdentifier class. This will result in HDFS-13541 not functioning properly on branch-2.10 as well. > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', > datanodeUuid='b1c7f64a-6389-4739-bddf-***', xmitsInProgress=0}:Exception > transfering block BP-1187699012-10.-***:blk_1119803380_46080919 to mirror > 10.*:9866 > java.io.IOException: Invalid token in javax.security.sasl.qop: D > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:220) > Once any client ver 2.10.1 connect to hdfs server, the DataNode no longer > accepts any client connection, even client ver 3.2.1 cannot connects to hdfs > server. The DataNode rejects any client connection. For a short time, all > DataNodes rejects client connections. > The problem exists even if I replace DataNode with ver 3.3.0 or replace java > with jdk 11. > The problem is fixed if I replace DataNode with ver 3.2.0. I guess the > problem is related to HDFS-13541 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17159) Can't decode Identifier HDFS tokens with only the hdfs client jar
Zilong Zhu created HDFS-17159: - Summary: Can't decode Identifier HDFS tokens with only the hdfs client jar Key: HDFS-17159 URL: https://issues.apache.org/jira/browse/HDFS-17159 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.10.1 Reporter: Zilong Zhu It looks like the meta inf file for the BlockTokenIdentifier is in the hadoop-hdfs.jar rather then the hadoop-hdfs-client.jar. This prevents a client from decode identifier because the service loader doesn't find the BlockTokenIdentifier class. This will result in HDFS-13541 not functioning properly on branch-2.10 as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754437#comment-17754437 ] Zilong Zhu commented on HDFS-16644: --- At the same time, we also identified another issus related to this. It looks like the meta inf file for the BlockTokenIdentifier is in the hadoop-hdfs.jar rather then the hadoop-hdfs-client.jar. This prevents a client from decode identifier because the service loader doesn't find the BlockTokenIdentifier class. This will result in HDFS-13541 not functioning properly on branch-2.10 as well. > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', > datanodeUuid='b1c7f64a-6389-4739-bddf-***', xmitsInProgress=0}:Exception > transfering block BP-1187699012-10.-***:blk_1119803380_46080919 to mirror > 10.*:9866 > java.io.IOException: Invalid token in javax.security.sasl.qop: D > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:220) > Once any client ver 2.10.1 connect to hdfs server, the DataNode no longer > accepts any client connection, even client ver 3.2.1 cannot connects to hdfs > server. The DataNode rejects any client connection. For a short time, all > DataNodes rejects client connections. > The problem exists even if I replace DataNode with ver 3.3.0 or replace java > with jdk 11. > The problem is fixed if I replace DataNode with ver 3.2.0. I guess the > problem is related to HDFS-13541 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754420#comment-17754420 ] Zilong Zhu edited comment on HDFS-16644 at 8/15/23 6:03 AM: We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fine. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out, modes.size()); for (AccessMode aMode : modes) { WritableUtils.writeEnum(out, aMode); } if (storageTypes != null) {< new fields WritableUtils.writeVInt(out, storageTypes.length); for (StorageType type : storageTypes) { WritableUtils.writeEnum(out, type); } } if (storageIds != null) { < new fields WritableUtils.writeVInt(out, storageIds.length); for (String id : storageIds) { WritableUtils.writeString(out, id); } } if (handshakeMsg != null && handshakeMsg.length > 0) { WritableUtils.writeVInt(out, handshakeMsg.length); out.write(handshakeMsg); } }{code} For BlockTokenIdentifier(2.10.1) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#readFields {code:java} public void readFields(DataInput in) throws IOException { this.cache = null; if (in instanceof DataInputStream) { final DataInputStream dis = (DataInputStream) in; // this.cache should be assigned the raw bytes from the input data for // upgrading compatibility. If we won't mutate fields and call getBytes() // for something (e.g retrieve password), we should return the raw bytes // instead of serializing the instance self fields to bytes, because we // may lose newly added fields which we can't recognize. this.cache = IOUtils.readFullyToByteArray(dis); dis.reset(); } expiryDate = WritableUtils.readVLong(in); keyId = WritableUtils.readVInt(in); userId = WritableUtils.readString(in); blockPoolId = WritableUtils.readString(in); blockId = WritableUtils.readVLong(in); int length = WritableUtils.readVIntInRange(in, 0, AccessMode.class.getEnumConstants().length); for (int i = 0; i < length; i++) { modes.add(WritableUtils.readEnum(in, AccessMode.class)); } try { int handshakeMsgLen = WritableUtils.readVInt(in); if (handshakeMsgLen != 0) { handshakeMsg = new byte[handshakeMsgLen]; in.readFully(handshakeMsg); } } catch (EOFException eof) { } } {code} So, when client(2.10.1) deserialized the handshakeMsg, an error occurred and it mistakenly deserialized the storageType instead of the handshakeMsg. HDFS-13541 merged into both branch-2.10 and branch-3.2. It added the "handshakeMsg" field. But HDFS-6708 and HDFS-9807 merged into branch-3.2 only. It added the "storageTypes" and "storageIds" fields before HDFS-13541. This is where the real issus lies. I want to fix this issus. Any comments and suggestions would be appreciated. was (Author: JIRAUSER287487): We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fine. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out,
[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754420#comment-17754420 ] Zilong Zhu edited comment on HDFS-16644 at 8/15/23 6:01 AM: We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fine. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out, modes.size()); for (AccessMode aMode : modes) { WritableUtils.writeEnum(out, aMode); } if (storageTypes != null) {< new fields WritableUtils.writeVInt(out, storageTypes.length); for (StorageType type : storageTypes) { WritableUtils.writeEnum(out, type); } } if (storageIds != null) { < new fields WritableUtils.writeVInt(out, storageIds.length); for (String id : storageIds) { WritableUtils.writeString(out, id); } } if (handshakeMsg != null && handshakeMsg.length > 0) { WritableUtils.writeVInt(out, handshakeMsg.length); out.write(handshakeMsg); } }{code} For BlockTokenIdentifier(2.10.1) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#readFields {code:java} public void readFields(DataInput in) throws IOException { this.cache = null; if (in instanceof DataInputStream) { final DataInputStream dis = (DataInputStream) in; // this.cache should be assigned the raw bytes from the input data for // upgrading compatibility. If we won't mutate fields and call getBytes() // for something (e.g retrieve password), we should return the raw bytes // instead of serializing the instance self fields to bytes, because we // may lose newly added fields which we can't recognize. this.cache = IOUtils.readFullyToByteArray(dis); dis.reset(); } expiryDate = WritableUtils.readVLong(in); keyId = WritableUtils.readVInt(in); userId = WritableUtils.readString(in); blockPoolId = WritableUtils.readString(in); blockId = WritableUtils.readVLong(in); int length = WritableUtils.readVIntInRange(in, 0, AccessMode.class.getEnumConstants().length); for (int i = 0; i < length; i++) { modes.add(WritableUtils.readEnum(in, AccessMode.class)); } try { int handshakeMsgLen = WritableUtils.readVInt(in); if (handshakeMsgLen != 0) { handshakeMsg = new byte[handshakeMsgLen]; in.readFully(handshakeMsg); } } catch (EOFException eof) { } } {code} So, when client(2.10.1) deserialized the handshakeMsg, an error occurred and it mistakenly deserialized the storageType instead of the handshakeMsg. HDFS-13541 merged into both branch-2.10 and branch-3.2. It added the "handshakeMsg" field. But HDFS-6708 and HDFS-9807 merged into branch-3.2 only. It added the "storageTypes" and "storageIds" fields before HDFS-13531. This is where the real issus lies. I want to fix this issus. Any comments and suggestions would be appreciated. was (Author: JIRAUSER287487): We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fin. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out,
[jira] [Commented] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754420#comment-17754420 ] Zilong Zhu commented on HDFS-16644: --- We‘ve also encountered this issue. Our NN and DN is Hadoop 3.2.4 version, and client version is 2.10.1. For the same code segment, if only "hadoop-client" included in the pom.xml, it works fin. However, if both "hadoop-client" and "hadoop-hdfs" are included, issues arise. We believe this issue is related to class loading and protocols. It leads to the generation of abnormal QOP value(e.g.D). The key to this issue lies in the handing of the accessToken's BlockTokenIdentifier. NN(3.2.4) serialized and sent the accessToken to the client(2.10.1). The client(2.10.1) deserialized the accessToken(3.2.4). At this point, some fields changed. For BlockTokenIdentifier(3.2.4) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#writeLegacy {code:java} void writeLegacy(DataOutput out) throws IOException { WritableUtils.writeVLong(out, expiryDate); WritableUtils.writeVInt(out, keyId); WritableUtils.writeString(out, userId); WritableUtils.writeString(out, blockPoolId); WritableUtils.writeVLong(out, blockId); WritableUtils.writeVInt(out, modes.size()); for (AccessMode aMode : modes) { WritableUtils.writeEnum(out, aMode); } if (storageTypes != null) {< new fields WritableUtils.writeVInt(out, storageTypes.length); for (StorageType type : storageTypes) { WritableUtils.writeEnum(out, type); } } if (storageIds != null) { < new fields WritableUtils.writeVInt(out, storageIds.length); for (String id : storageIds) { WritableUtils.writeString(out, id); } } if (handshakeMsg != null && handshakeMsg.length > 0) { WritableUtils.writeVInt(out, handshakeMsg.length); out.write(handshakeMsg); } }{code} For BlockTokenIdentifier(2.10.1) org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier#readFields {code:java} public void readFields(DataInput in) throws IOException { this.cache = null; if (in instanceof DataInputStream) { final DataInputStream dis = (DataInputStream) in; // this.cache should be assigned the raw bytes from the input data for // upgrading compatibility. If we won't mutate fields and call getBytes() // for something (e.g retrieve password), we should return the raw bytes // instead of serializing the instance self fields to bytes, because we // may lose newly added fields which we can't recognize. this.cache = IOUtils.readFullyToByteArray(dis); dis.reset(); } expiryDate = WritableUtils.readVLong(in); keyId = WritableUtils.readVInt(in); userId = WritableUtils.readString(in); blockPoolId = WritableUtils.readString(in); blockId = WritableUtils.readVLong(in); int length = WritableUtils.readVIntInRange(in, 0, AccessMode.class.getEnumConstants().length); for (int i = 0; i < length; i++) { modes.add(WritableUtils.readEnum(in, AccessMode.class)); } try { int handshakeMsgLen = WritableUtils.readVInt(in); if (handshakeMsgLen != 0) { handshakeMsg = new byte[handshakeMsgLen]; in.readFully(handshakeMsg); } } catch (EOFException eof) { } } {code} So, when client(2.10.1) deserialized the handshakeMsg, an error occurred and it mistakenly deserialized the storageType instead of the handshakeMsg. HDFS-13541 merged into both branch-2.10 and branch-3.2. It added the "handshakeMsg" field. But HDFS-6708 and HDFS-9807 merged into branch-3.2 only. It added the "storageTypes" and "storageIds" fields before HDFS-13531. This is where the real issus lies. I want to fix this issus. Any comments and suggestions would be appreciated. > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'},