[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683033#comment-17683033 ] Yiqun Lin edited comment on HDFS-16644 at 2/1/23 2:19 PM: -- We also meet this issue in our Hadoop3 cluster. Our DN server is Hadoop 3.3 version, and client version is 2.10.2. We find that there is one chance that abnormal QOP value(e.g. DI) can be passed and overwrite for DataNode sasl props. But by default case(HDFS-13541 feature not enabled), the secret should not be passed here. Somehow that there maybe some bug on 2.10.2 version that still pass the secret here. [~vagarychen], could you please check for this code on branch-2.10. It's very dangerous that once DN sasl props is overwrite with an invalid value, all the data read/write could be impacted. And also here we don't do any validation check for QOP value. SaslDataTransferServer#doSaslHandshake {noformat} private IOStreamPair doSaslHandshake(Peer peer, OutputStream underlyingOut, InputStream underlyingIn, Map saslProps, CallbackHandler callbackHandler) throws IOException { DataInputStream in = new DataInputStream(underlyingIn); DataOutputStream out = new DataOutputStream(underlyingOut); int magicNumber = in.readInt(); if (magicNumber != SASL_TRANSFER_MAGIC_NUMBER) { throw new InvalidMagicNumberException(magicNumber, dnConf.getEncryptDataTransfer()); } try { // step 1 SaslMessageWithHandshake message = readSaslMessageWithHandshakeSecret(in); byte[] secret = message.getSecret(); String bpid = message.getBpid(); if (secret != null || bpid != null) { // sanity check, if one is null, the other must also not be null assert(secret != null && bpid != null); String qop = new String(secret, Charsets.UTF_8); saslProps.put(Sasl.QOP, qop); <= here any QOP value could be set here } ... {noformat} was (Author: linyiqun): We also meet this issue in our Hadoop3 cluster. Our DN server is Hadoop 3.3 version, and client version is 2.10.2. We find that there is one chance that abnormal QOP value(e.g. DI) can be passed and overwrite for DataNode sasl props. But by default case(HDFS-13541 feature not enabled), the secret should not be passed here. Somehow that there maybe some bug on 2.10.2 version that still pass the secret here. [~vagarychen], could you please check for this code on branch-2.10. It's very dangerous that once DN sasl props is overwrite with an invalid value. All the data read/write could be impacted. And also here we don't do any validation check for QOP value. SaslDataTransferServer#doSaslHandshake {noformat} private IOStreamPair doSaslHandshake(Peer peer, OutputStream underlyingOut, InputStream underlyingIn, Map saslProps, CallbackHandler callbackHandler) throws IOException { DataInputStream in = new DataInputStream(underlyingIn); DataOutputStream out = new DataOutputStream(underlyingOut); int magicNumber = in.readInt(); if (magicNumber != SASL_TRANSFER_MAGIC_NUMBER) { throw new InvalidMagicNumberException(magicNumber, dnConf.getEncryptDataTransfer()); } try { // step 1 SaslMessageWithHandshake message = readSaslMessageWithHandshakeSecret(in); byte[] secret = message.getSecret(); String bpid = message.getBpid(); if (secret != null || bpid != null) { // sanity check, if one is null, the other must also not be null assert(secret != null && bpid != null); String qop = new String(secret, Charsets.UTF_8); saslProps.put(Sasl.QOP, qop); <= here any QOP value could be set here } ... {noformat} > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', >
[jira] [Commented] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop
[ https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683033#comment-17683033 ] Yiqun Lin commented on HDFS-16644: -- We also meet this issue in our Hadoop3 cluster. Our DN server is Hadoop 3.3 version, and client version is 2.10.2. We find that there is one chance that abnormal QOP value(e.g. DI) can be passed and overwrite for DataNode sasl props. But by default case(HDFS-13541 feature not enabled), the secret should not be passed here. Somehow that there maybe some bug on 2.10.2 version that still pass the secret here. [~vagarychen], could you please check for this code on branch-2.10. It's very dangerous that once DN sasl props is overwrite with an invalid value. All the data read/write could be impacted. And also here we don't do any validation check for QOP value. SaslDataTransferServer#doSaslHandshake {noformat} private IOStreamPair doSaslHandshake(Peer peer, OutputStream underlyingOut, InputStream underlyingIn, Map saslProps, CallbackHandler callbackHandler) throws IOException { DataInputStream in = new DataInputStream(underlyingIn); DataOutputStream out = new DataOutputStream(underlyingOut); int magicNumber = in.readInt(); if (magicNumber != SASL_TRANSFER_MAGIC_NUMBER) { throw new InvalidMagicNumberException(magicNumber, dnConf.getEncryptDataTransfer()); } try { // step 1 SaslMessageWithHandshake message = readSaslMessageWithHandshakeSecret(in); byte[] secret = message.getSecret(); String bpid = message.getBpid(); if (secret != null || bpid != null) { // sanity check, if one is null, the other must also not be null assert(secret != null && bpid != null); String qop = new String(secret, Charsets.UTF_8); saslProps.put(Sasl.QOP, qop); <= here any QOP value could be set here } ... {noformat} > java.io.IOException Invalid token in javax.security.sasl.qop > > > Key: HDFS-16644 > URL: https://issues.apache.org/jira/browse/HDFS-16644 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 >Reporter: Walter Su >Priority: Major > > deployment: > server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1 > client side: > I run command hadoop fs -put a test file, with kerberos ticket inited first, > and use identical core-site.xml & hdfs-site.xml configuration. > using client ver 3.2.1, it succeeds. > using client ver 2.8.5, it succeeds. > using client ver 2.10.1, it fails. The client side error info is: > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2022-06-27 01:06:15,781 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, > /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', > datanodeUuid='b1c7f64a-6389-4739-bddf-***', xmitsInProgress=0}:Exception > transfering block BP-1187699012-10.-***:blk_1119803380_46080919 to mirror > 10.*:9866 > java.io.IOException: Invalid token in javax.security.sasl.qop: D > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:220) > Once any client ver 2.10.1 connect to hdfs server, the DataNode no longer > accepts any client connection, even client ver 3.2.1 cannot connects to hdfs > server. The DataNode rejects any client connection. For a short time, all > DataNodes rejects client connections. > The problem exists even if I replace DataNode with ver 3.3.0 or replace java > with jdk 11. > The problem is fixed if I replace DataNode with ver 3.2.0. I guess the > problem is related to HDFS-13541 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
[ https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423086#comment-17423086 ] Yiqun Lin commented on HDFS-15486: -- Hi [~functioner], {quote}Yiqun Lin I reported a similar issue in HDFS-15869 and I had a github pull request there. You can take a look at whether that works, and whether we should resolve that Jira issue and this Jira issue together. {quote} I'm afraid that I don't have enough time to review that patch recently, sorry for that.. > Costly sendResponse operation slows down async editlog handling > --- > > Key: HDFS-15486 > URL: https://issues.apache.org/jira/browse/HDFS-15486 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yiqun Lin >Priority: Major > Attachments: Async-profile-(2).jpg, HDFS-15486_draft.patch, > async-profile-(1).jpg > > > When our cluster NameNode in a very high load, we find it often stuck in > Async-editlog handling. > We use async-profile tool to get the flamegraph. > !Async-profile-(2).jpg! > This happened in that async editlog thread consumes Edit from the queue and > triggers the sendResponse call. > But here the sendResponse call is a little expensive since our cluster > enabled the security env and will do some encode operations when doing the > return response operation. > We often catch some moments of costly sendResponse operation when rpc call > queue is fulled. > !async-profile-(1).jpg! > Slowness on consuming Edit in async editlog will make Edit pending Queue > easily become the fulled state, then block its enqueue operation that is > invoked in writeLock type methods in FSNamesystem class. > Here the enhancement is that we can use multiple thread to parallel execute > sendResponse call. sendResponse doesn't need use the write lock to do > protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
[ https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423081#comment-17423081 ] Yiqun Lin commented on HDFS-15486: -- Some notes for above draft patch: * Here we introduced the switch setting to enable the async response handling. * The patch is based on the branch-2.7 branch not the latest trunk branch. > Costly sendResponse operation slows down async editlog handling > --- > > Key: HDFS-15486 > URL: https://issues.apache.org/jira/browse/HDFS-15486 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yiqun Lin >Priority: Major > Attachments: Async-profile-(2).jpg, HDFS-15486_draft.patch, > async-profile-(1).jpg > > > When our cluster NameNode in a very high load, we find it often stuck in > Async-editlog handling. > We use async-profile tool to get the flamegraph. > !Async-profile-(2).jpg! > This happened in that async editlog thread consumes Edit from the queue and > triggers the sendResponse call. > But here the sendResponse call is a little expensive since our cluster > enabled the security env and will do some encode operations when doing the > return response operation. > We often catch some moments of costly sendResponse operation when rpc call > queue is fulled. > !async-profile-(1).jpg! > Slowness on consuming Edit in async editlog will make Edit pending Queue > easily become the fulled state, then block its enqueue operation that is > invoked in writeLock type methods in FSNamesystem class. > Here the enhancement is that we can use multiple thread to parallel execute > sendResponse call. sendResponse doesn't need use the write lock to do > protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
[ https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15486: - Attachment: HDFS-15486_draft.patch > Costly sendResponse operation slows down async editlog handling > --- > > Key: HDFS-15486 > URL: https://issues.apache.org/jira/browse/HDFS-15486 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yiqun Lin >Priority: Major > Attachments: Async-profile-(2).jpg, HDFS-15486_draft.patch, > async-profile-(1).jpg > > > When our cluster NameNode in a very high load, we find it often stuck in > Async-editlog handling. > We use async-profile tool to get the flamegraph. > !Async-profile-(2).jpg! > This happened in that async editlog thread consumes Edit from the queue and > triggers the sendResponse call. > But here the sendResponse call is a little expensive since our cluster > enabled the security env and will do some encode operations when doing the > return response operation. > We often catch some moments of costly sendResponse operation when rpc call > queue is fulled. > !async-profile-(1).jpg! > Slowness on consuming Edit in async editlog will make Edit pending Queue > easily become the fulled state, then block its enqueue operation that is > invoked in writeLock type methods in FSNamesystem class. > Here the enhancement is that we can use multiple thread to parallel execute > sendResponse call. sendResponse doesn't need use the write lock to do > protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
[ https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423078#comment-17423078 ] Yiqun Lin commented on HDFS-15486: -- Asked by [~yuanbo] offline that if there is the patch for this JIRA, I have attached the draft patch for this JIRA, this patch already applied in our internal hadoop version and executed well in our production environment. It can increased the RPC throughout for NameNode. > Costly sendResponse operation slows down async editlog handling > --- > > Key: HDFS-15486 > URL: https://issues.apache.org/jira/browse/HDFS-15486 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yiqun Lin >Priority: Major > Attachments: Async-profile-(2).jpg, async-profile-(1).jpg > > > When our cluster NameNode in a very high load, we find it often stuck in > Async-editlog handling. > We use async-profile tool to get the flamegraph. > !Async-profile-(2).jpg! > This happened in that async editlog thread consumes Edit from the queue and > triggers the sendResponse call. > But here the sendResponse call is a little expensive since our cluster > enabled the security env and will do some encode operations when doing the > return response operation. > We often catch some moments of costly sendResponse operation when rpc call > queue is fulled. > !async-profile-(1).jpg! > Slowness on consuming Edit in async editlog will make Edit pending Queue > easily become the fulled state, then block its enqueue operation that is > invoked in writeLock type methods in FSNamesystem class. > Here the enhancement is that we can use multiple thread to parallel execute > sendResponse call. sendResponse doesn't need use the write lock to do > protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307712#comment-17307712 ] Yiqun Lin edited comment on HDFS-15660 at 3/24/21, 9:48 AM: Hi [~weichiu], this compatible issue only happened in that old hadoop version client doesn't contain the storage type which introduced in HDFS-9806. It's a client side issue not the server side. As version 3.1, 3.2 and 3.3 already contain the new storage type, it should be okay to do the upgrade. So I don't cherry-pick to other branches. was (Author: linyiqun): Hi [~weichiu], this compatible issue only happened in that old hadoop version client doesn't contain the storage type which introduced in HDFS-9806. It's a client side issue not the server side. As version 3.1, 3.2 and 3.3 already contain the new storage type, it should be okay to do the upgrade. So I only push the fix to trunk. > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Fix For: 2.9.3, 3.4.0, 2.10.2 > > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by:
[jira] [Comment Edited] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307712#comment-17307712 ] Yiqun Lin edited comment on HDFS-15660 at 3/24/21, 9:47 AM: Hi [~weichiu], this compatible issue only happened in that old hadoop version client doesn't contain the storage type which introduced in HDFS-9806. It's a client side issue not the server side. As version 3.1, 3.2 and 3.3 already contain the new storage type, it should be okay to do the upgrade. So I only push the fix to trunk. was (Author: linyiqun): Hi [~weichiu], this compatible issue only happened in that old hadoop version client doesn't contain the storage type which introduced in HDFS-9806. It's a client side issue not the server side. As version 3.1, 3.2 and 3.3 already contain the new storage type, it should be okay to do the upgrade. > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Fix For: 2.9.3, 3.4.0, 2.10.2 > > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing >
[jira] [Commented] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307712#comment-17307712 ] Yiqun Lin commented on HDFS-15660: -- Hi [~weichiu], this compatible issue only happened in that old hadoop version client doesn't contain the storage type which introduced in HDFS-9806. It's a client side issue not the server side. As version 3.1, 3.2 and 3.3 already contain the new storage type, it should be okay to do the upgrade. > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Fix For: 2.9.3, 3.4.0, 2.10.2 > > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at >
[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263438#comment-17263438 ] Yiqun Lin commented on HDFS-14558: -- LGTM, +1. > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch, > HDFS-14558.003.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-14558: - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Committed this to trunk. Thanks [~fengnanli] for the contribution. > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch, > HDFS-14558.003.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262526#comment-17262526 ] Yiqun Lin edited comment on HDFS-14558 at 1/11/21, 9:46 AM: mvnsite compiled failed due to unexpected '<>' was used in dfs.federation.router.fairness.handler.count.. and 'Dedicated handler assigned to a specific ' [~fengnanli], could you please use below lines instead of? {noformat} dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE* Dedicated handler assigned to a specific nameservice {noformat} Others look good to me. was (Author: linyiqun): mvnsite compiled failed due to unexpected '<>' was used in dfs.federation.router.fairness.handler.count.. and 'Dedicated handler assigned to a specific ' [~fengnanli], could you please below lines instead of? {noformat} dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE* Dedicated handler assigned to a specific nameservice {noformat} Others look good to me. > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262526#comment-17262526 ] Yiqun Lin edited comment on HDFS-14558 at 1/11/21, 9:46 AM: mvnsite compiled failed due to unexpected '<>' was used in dfs.federation.router.fairness.handler.count.. and 'Dedicated handler assigned to a specific ' [~fengnanli], could you please below lines instead of? {noformat} dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE* Dedicated handler assigned to a specific nameservice {noformat} Others look good to me. was (Author: linyiqun): mvnsite compiled failed due to unexpected '<>' was used in dfs.federation.router.fairness.handler.count.. [~fengnanli], could you use below instead of? {noformat} dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE* {noformat} Others look good to me. > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262526#comment-17262526 ] Yiqun Lin commented on HDFS-14558: -- mvnsite compiled failed due to unexpected '<>' was used in dfs.federation.router.fairness.handler.count.. [~fengnanli], could you use below instead of? {noformat} dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE* {noformat} Others look good to me. > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262202#comment-17262202 ] Yiqun Lin commented on HDFS-14558: -- Hi [~fengnanli], do you have the time to address above review comments? It would be better to complete this document as well. We already merged HDFS-14090 for some time but this JIRA was blocked. > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14558.001.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation
[ https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248545#comment-17248545 ] Yiqun Lin commented on HDFS-14558: -- [~fengnanli], thanks for updating the patch, most of the patch looks great. Just few comments from me: {noformat} +| dfs.federation.router.fairness.policy.controller.class | `org.apache.hadoop.hdfs.server.federation.fairness.DefaultFairnessPolicyController` | Default handler allocation model to be used if isolation feature is enabled. | {noformat} Here the default value should be org.apache.hadoop.hdfs.server.federation.fairness.NoRouterRpcFairnessPolicyController. We can mention DefaultFairnessPolicyController in description if isolation feature is enabled. {noformat} +### Isolation + +Isolation and dedicated assignment of RPC handlers across all configured downstream nameservices. + {noformat} Can we additionally mention one point that the sum of all configured handler count values must be strictly smaller than the router handlers (configed by dfs.federation.router.handler.count)? Please fix one whitespace line warings {noformat} hadoop-hdfs-project/hadoop-hdfs-rbf/src/site/markdown/HDFSRouterFederation.md:193:Overall the isolation feature is exposed via a configuration dfs.federation.router.handler.isolation.enable. The default value of this feature will be “falseâ€. Users can also introduce their own fairness policy controller for custom allocation of handlers to various nameservices. {noformat} > RBF: Isolation/Fairness documentation > - > > Key: HDFS-14558 > URL: https://issues.apache.org/jira/browse/HDFS-14558 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14558.001.patch > > > Documentation is needed to make users aware of this feature HDFS-14090. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15660: - Description: In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, we found hive to call getContentSummary method , the client and server was not compatible because of hadoop3 added new PROVIDED storage type. {code:java} // code placeholder 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over x/x:8020. Trying to fail over immediately. java.io.IOException: com.google.protobuf.ServiceException: com.google.protobuf.UninitializedMessageException: Message missing required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type at org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) at org.apache.hadoop.fs.shell.Command.run(Command.java:165) at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) Caused by: com.google.protobuf.ServiceException: com.google.protobuf.UninitializedMessageException: Message missing required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) ... 23 more Caused by: com.google.protobuf.UninitializedMessageException: Message missing required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type at com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263) ... 25 more {code} This compatible issue only happened in StorageType feature is used in cluster. was: In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, we found hive to call getContentSummary method , the client and server was not compatible because of hadoop3 added new PROVIDED storage type. {code:java} // code placeholder 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over x/x:8020. Trying to fail over immediately. java.io.IOException: com.google.protobuf.ServiceException: com.google.protobuf.UninitializedMessageException: Message missing required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type at
[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15660: - Fix Version/s: 2.10.2 3.4.0 2.9.3 Hadoop Flags: Reviewed The new storage type was introduced in HDFS-9806, and this feature is implemented in 3.1 version. So the versions before 3.1 contains this compatible issue and needed to be applied this fix. Committed this to branch-2.9, branch-2.10, branch-3.0 and trunk. Thanks [~jianliang.wu] for the contribution and others for the review. > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Fix For: 2.9.3, 3.4.0, 2.10.2 > > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at >
[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15660: - Resolution: Fixed Status: Resolved (was: Patch Available) > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Fix For: 2.9.3, 3.4.0, 2.10.2 > > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263) > ... 25 more > {code} -- This message was sent by
[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15660: - Target Version/s: 2.9.3, 2.10.2 (was: 2.9.3, 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3) Affects Version/s: (was: 3.1.3) (was: 3.2.0) 2.9.2 2.8.5 2.7.7 2.10.1 Issue Type: Bug (was: Improvement) > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at >
[jira] [Commented] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243715#comment-17243715 ] Yiqun Lin commented on HDFS-15660: -- Thanks for providing the test result for this change, [~jianliang.wu]. LGTM, +1. I think this is a safe change, I will hold off the commit until next week. > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263) >
[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15660: - Attachment: HDFS-15660.002.patch > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-15660.001.patch, HDFS-15660.002.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263) > ... 25 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Commented] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239094#comment-17239094 ] Yiqun Lin commented on HDFS-15660: -- Attach the same patch to trigger Jenkins. > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-15660.001.patch, HDFS-15660.002.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263) > ... 25 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6
[ https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15660: - Status: Patch Available (was: Open) > StorageTypeProto is not compatiable between 3.x and 2.6 > --- > > Key: HDFS-15660 > URL: https://issues.apache.org/jira/browse/HDFS-15660 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.3, 3.2.0 >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-15660.001.patch > > > In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6, > we found hive to call getContentSummary method , the client and server was > not compatible because of hadoop3 added new PROVIDED storage type. > {code:java} > // code placeholder > 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while > invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over > x/x:8020. Trying to fail over immediately. > java.io.IOException: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713) > at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) > at > org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) > at > org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) > at > org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118) > at org.apache.hadoop.fs.shell.Command.run(Command.java:165) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:315) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) > Caused by: com.google.protobuf.ServiceException: > com.google.protobuf.UninitializedMessageException: Message missing required > fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272) > at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816) > ... 23 more > Caused by: com.google.protobuf.UninitializedMessageException: Message missing > required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type > at > com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263) > ... 25 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231451#comment-17231451 ] Yiqun Lin commented on HDFS-14090: -- Thanks for addressing the comments, [~fengnanli], LGTM, +1. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, > HDFS-14090.024.patch, HDFS-14090.025.patch, RBF_ Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230612#comment-17230612 ] Yiqun Lin edited comment on HDFS-14090 at 11/12/20, 2:38 PM: - Hi [~fengnanli], three nits for the latest patch: 1 Will look good to rename dfs.federation.router.fairness.handler.count.NS to dfs.federation.router.fairness.handler.count.EXAMPLENAMESERVICE. 2 {noformat} smaller or equal to the total number of router handlers; if the special *concurrent* is not specified, the sum of all configured values must be strictly smaller than the router handlers thus the left will be allocated to the concurrent calls. {noformat} Can we mention related setting ''strictly smaller than the router handlers (dfs.federation.router.handler.count)... 3 Can you fix related failed unit test? |hadoop.hdfs.server.federation.router.TestRBFConfigFields| Others look good to me. was (Author: linyiqun): Hi [~fengnanli], two nits for the latest patch: {noformat} smaller or equal to the total number of router handlers; if the special *concurrent* is not specified, the sum of all configured values must be strictly smaller than the router handlers thus the left will be allocated to the concurrent calls. {noformat} Can we mention related setting ''strictly smaller than the router handlers (dfs.federation.router.handler.count)... Can you fix related failed unit test? |hadoop.hdfs.server.federation.router.TestRBFConfigFields| Others look good to me. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, > HDFS-14090.024.patch, RBF_ Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230612#comment-17230612 ] Yiqun Lin commented on HDFS-14090: -- Hi [~fengnanli], two nits for the latest patch: {noformat} smaller or equal to the total number of router handlers; if the special *concurrent* is not specified, the sum of all configured values must be strictly smaller than the router handlers thus the left will be allocated to the concurrent calls. {noformat} Can we mention related setting ''strictly smaller than the router handlers (dfs.federation.router.handler.count)... Can you fix related failed unit test? Others look good to me. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, > HDFS-14090.024.patch, RBF_ Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230612#comment-17230612 ] Yiqun Lin edited comment on HDFS-14090 at 11/12/20, 1:07 PM: - Hi [~fengnanli], two nits for the latest patch: {noformat} smaller or equal to the total number of router handlers; if the special *concurrent* is not specified, the sum of all configured values must be strictly smaller than the router handlers thus the left will be allocated to the concurrent calls. {noformat} Can we mention related setting ''strictly smaller than the router handlers (dfs.federation.router.handler.count)... Can you fix related failed unit test? |hadoop.hdfs.server.federation.router.TestRBFConfigFields| Others look good to me. was (Author: linyiqun): Hi [~fengnanli], two nits for the latest patch: {noformat} smaller or equal to the total number of router handlers; if the special *concurrent* is not specified, the sum of all configured values must be strictly smaller than the router handlers thus the left will be allocated to the concurrent calls. {noformat} Can we mention related setting ''strictly smaller than the router handlers (dfs.federation.router.handler.count)... Can you fix related failed unit test? Others look good to me. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, > HDFS-14090.024.patch, RBF_ Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230347#comment-17230347 ] Yiqun Lin commented on HDFS-14090: -- Sounds good to me, let's address #2 comment, [~fengnanli]. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, RBF_ > Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227189#comment-17227189 ] Yiqun Lin edited comment on HDFS-14090 at 11/6/20, 6:58 AM: Hi [~fengnanli], some minor comments from me: 1. I see here we introduce CONCURRENT_NS for concurrent call, why not acquire permit to corresponding ns instead of? 2. Current description of setting hdfs-rbf-default.xml can describe more. At least, we need to mention: * The setting name for how to configure handler count for each ns, also include CONCURRENT_NS ns. * The sum of dedicated handler count should be less than value of dfs.federation.router.handler.count 3. It would be better to add this improvement in HDFSRouterFederation.md. Comment #2 and #3 can be addressed in a follow-up JIRA, :). was (Author: linyiqun): Hi [~fengnanli], some minor comments from me: 1. I see here we introduce CONCURRENT_NS for concurrent call, why not acquire permit to corresponding ns instead of? 2. Current description of setting hdfs-rbf-default.xml can describe more. At least, we need to mention: * The setting name for how to configure handler count for each ns, also include CONCURRENT_NS ns. * The sum of dedicated handler count should be less than value of dfs.federation.router.handler.count 3. It would be better to add this improvement in HDFSRouterFederation.md. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, RBF_ > Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
[ https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227189#comment-17227189 ] Yiqun Lin commented on HDFS-14090: -- Hi [~fengnanli], some minor comments from me: 1. I see here we introduce CONCURRENT_NS for concurrent call, why not acquire permit to corresponding ns instead of? 2. Current description of setting hdfs-rbf-default.xml can describe more. At least, we need to mention: * The setting name for how to configure handler count for each ns, also include CONCURRENT_NS ns. * The sum of dedicated handler count should be less than value of dfs.federation.router.handler.count 3. It would be better to add this improvement in HDFSRouterFederation.md. > RBF: Improved isolation for downstream name nodes. {Static} > --- > > Key: HDFS-14090 > URL: https://issues.apache.org/jira/browse/HDFS-14090 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > Attachments: HDFS-14090-HDFS-13891.001.patch, > HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, > HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, > HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, > HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, > HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, > HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, > HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, > HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, RBF_ > Isolation design.pdf > > > Router is a gateway to underlying name nodes. Gateway architectures, should > help minimize impact of clients connecting to healthy clusters vs unhealthy > clusters. > For example - If there are 2 name nodes downstream, and one of them is > heavily loaded with calls spiking rpc queue times, due to back pressure the > same with start reflecting on the router. As a result of this, clients > connecting to healthy/faster name nodes will also slow down as same rpc queue > is maintained for all calls at the router layer. Essentially the same IPC > thread pool is used by router to connect to all name nodes. > Currently router uses one single rpc queue for all calls. Lets discuss how we > can change the architecture and add some throttling logic for > unhealthy/slow/overloaded name nodes. > One way could be to read from current call queue, immediately identify > downstream name node and maintain a separate queue for each underlying name > node. Another simpler way is to maintain some sort of rate limiter configured > for each name node and let routers drop/reject/send error requests after > certain threshold. > This won’t be a simple change as router’s ‘Server’ layer would need redesign > and implementation. Currently this layer is the same as name node. > Opening this ticket to discuss, design and implement this feature. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225839#comment-17225839 ] Yiqun Lin commented on HDFS-15651: -- Thanks [~Aiphag0] for the quick fix. LGTM. +1. > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, > HDFS-15651.patch > > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224606#comment-17224606 ] Yiqun Lin commented on HDFS-15294: -- Hi [~coconut_icecream], as FedBlance is a completely new feature and hasn't released in the latest hadoop version, not sure if there is other potential issues. I prefer to backport this feature later once this feature is stable enough after released. > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15640) Add diff threshold to FedBalance
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221089#comment-17221089 ] Yiqun Lin edited comment on HDFS-15640 at 10/27/20, 2:48 AM: - Committed this to trunk. Thanks [~LiJinglun] for the contribution. BTW, [~LiJinglun], as HDFS-15294 is already a closed feature JIRA. next time we could add a related link to HDFS-15294 JIRA instead of reopen that once we find further bug or enhancement for FedBalance. was (Author: linyiqun): Committed this to trunk. Thanks [~LiJinglun] for the contribution. > Add diff threshold to FedBalance > > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. During incremental copy stage, if the diff size is under the given > threshold scope then we don't need to wait for no diff. We can start the > final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15640) Add diff threshold to FedBalance
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15640: - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Committed this to trunk. Thanks [~LiJinglun] for the contribution. > Add diff threshold to FedBalance > > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. During incremental copy stage, if the diff size is under the given > threshold scope then we don't need to wait for no diff. We can start the > final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin resolved HDFS-15294. -- Resolution: Fixed > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15640) Add diff threshold to FedBalance
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15640: - Summary: Add diff threshold to FedBalance (was: Add snapshot diff threshold to FedBalance) > Add diff threshold to FedBalance > > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. During incremental copy stage, if the diff size is under the given > threshold scope then we don't need to wait for no diff. We can start the > final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15640) Add snapshot diff threshold to FedBalance
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15640: - Description: Currently in the DistCpProcedure it must submit distcp round by round until there is no diff to go to the final distcp stage. The condition is very strict. During incremental copy stage, if the diff size is under the given threshold scope then we don't need to wait for no diff. We can start the final distcp directly. (was: Currently in the DistCpProcedure it must submit distcp round by round until there is no diff to go to the final distcp stage. The condition is very strict. If the distcp could finish in an acceptable period then we don't need to wait for no diff. For example if 3 consecutive distcp jobs all finish within 10 minutes then we can predict the final distcp could also finish within 10 minutes. So we can start the final distcp directly.) > Add snapshot diff threshold to FedBalance > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. During incremental copy stage, if the diff size is under the given > threshold scope then we don't need to wait for no diff. We can start the > final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15640) Add snapshot diff threshold to FedBalance
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15640: - Summary: Add snapshot diff threshold to FedBalance (was: RBF: Add fast distcp threshold to FedBalance.) > Add snapshot diff threshold to FedBalance > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220794#comment-17220794 ] Yiqun Lin edited comment on HDFS-15651 at 10/26/20, 4:21 PM: - Thanks for the comments, [~hexiaoqiao]. {quote}Catch the error and loop forever could not resolve this issue in my opinion because DataNode still service but without the correct blockToken key. {quote} The blocktoken key will be updated for every keyUpdateInterval (dfs.block.access.key.update.interval). Once we recover the CommandProcessingThread, DN will get the new key from NN in the next keyUpdateInterval (by default is 10 hours). [~Aiphag0], feel free to attach your fix here, :). was (Author: linyiqun): Thanks for the comments, [~hexiaoqiao]. {quote}Catch the error and loop forever could not resolve this issue in my opinion because DataNode still service but without the correct blockToken key. {quote} The blocktoken key will be updated for every keyUpdateInterval. Once we recover the CommandProcessingThread, DN will get the new key from NN in the next keyUpdateInterval (by default is 10 hours). [~Aiphag0], feel free to attach your fix here, :). > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Priority: Major > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { >
[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220794#comment-17220794 ] Yiqun Lin commented on HDFS-15651: -- Thanks for the comments, [~hexiaoqiao]. {quote}Catch the error and loop forever could not resolve this issue in my opinion because DataNode still service but without the correct blockToken key. {quote} The blocktoken key will be updated for every keyUpdateInterval. Once we recover the CommandProcessingThread, DN will get the new key from NN in the next keyUpdateInterval (by default is 10 hours). [~Aiphag0], feel free to attach your fix here, :). > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Priority: Major > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional
[jira] [Comment Edited] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220611#comment-17220611 ] Yiqun Lin edited comment on HDFS-15640 at 10/26/20, 10:33 AM: -- Latest patch LGTM, +1. Will commit this tomorrow once there is no further comments from others. was (Author: linyiqun): Latest patch LGTM, +1. Will commit this tomorrow once there is further comments from others. > RBF: Add fast distcp threshold to FedBalance. > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220611#comment-17220611 ] Yiqun Lin commented on HDFS-15640: -- Latest patch LGTM, +1. Will commit this tomorrow once there is further comments from others. > RBF: Add fast distcp threshold to FedBalance. > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch, HDFS-15640.004.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15651: - Description: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) {noformat} Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side. We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log: {noformat} javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.] {noformat} For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException. CommandProcessingThread is a critical thread, it should always be running. {code:java} /** * CommandProcessingThread that process commands asynchronously. */ class CommandProcessingThread extends Thread { private final BPServiceActor actor; private final BlockingQueue queue; ... @Override public void run() { try { processQueue(); } catch (Throwable t) { LOG.error("{} encountered fatal exception and exit.", getName(), t); <=== should not exit this thread } } {code} Once a unexpected error happened, a better handing should be: * catch the exception, appropriately deal with the error and let processQueue continue to run or * exit the DN process to let admin user investigate this was: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at
[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15651: - Description: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) {noformat} Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side. We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log: {noformat} javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.] {noformat} For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException. CommandProcessingThread is a critical thread, it should always be running. {code:java} /** * CommandProcessingThread that process commands asynchronously. */ class CommandProcessingThread extends Thread { private final BPServiceActor actor; private final BlockingQueue queue; ... @Override public void run() { try { processQueue(); } catch (Throwable t) { LOG.error("{} encountered fatal exception and exit.", getName(), t); <=== should not exit this thread } } {code} Once a unexpected error happened, a better handing should be: * catch the exception or * exit the DN process to let admin user investigate this was: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at
[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15651: - Description: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) {noformat} Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side. We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log: {noformat} javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.] {noformat} For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException. CommandProcessingThread is a critical thread, it should always be running. Once a unexpected error happened, a better handing should be: * catch the exception or * exit the DN process to let admin user investigate this was: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at
[jira] [Created] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
Yiqun Lin created HDFS-15651: Summary: Client could not obtain block when DN CommandProcessingThread exit Key: HDFS-15651 URL: https://issues.apache.org/jira/browse/HDFS-15651 Project: Hadoop HDFS Issue Type: Bug Reporter: Yiqun Lin In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) {noformat} Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side. As we enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time. And then we see lots of Sasl error due to key expiration: {noformat} javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.] {noformat} For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException. CommandProcessingThread is a critical thread, it should always be running. Once a unexpected error happened, a better handing should be: * catch the exception or * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220448#comment-17220448 ] Yiqun Lin commented on HDFS-15640: -- Thanks for updating the patch, [~LiJinglun]! Looks very great now. Catch one comment is outdated: {code:java} + * @return true if moving to the next stage. false if the conditions are not + * satisfied. + * @throws RetryException if the conditions are not satisfied and there is no + * diff needed to be copied.x + */ + @VisibleForTesting + boolean diffDistCpStageDone() throws IOException, RetryException { {code} Please update {noformat} ...and there is no diff needed to be copied.. {noformat} to {noformat} ...and the diff size is under the given threshold scope.. {noformat} +1 once this addressed. > RBF: Add fast distcp threshold to FedBalance. > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, > HDFS-15640.003.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220003#comment-17220003 ] Yiqun Lin commented on HDFS-15640: -- [~LiJinglun], the latest patch almost looks good to me. Minor comments from me: *DistCpProcedure.java* For below logic: {code:java} + boolean diffDistCpStageDone() throws IOException, RetryException { +int diffSize = getDiffSize(); +if (diffSize <= diffThreshold && (forceCloseOpenFiles +|| !verifyOpenFiles())) { + return true; +} +if (diffSize == 0) { + throw new RetryException(); +} else { + return false; +} + } {code} When diffSize is not 0 but it smaller than diffThreshold and (forceCloseOpenFiles || !verifyOpenFiles()) return false, we should also return RetryException. So above logic would be like below, below logic is consistent with original logic. {code:java} boolean diffDistCpStageDone() throws IOException, RetryException { int diffSize = getDiffSize(); if (diffSize <= diffThreshold) { if (forceCloseOpenFiles || !verifyOpenFiles()) { return true; } else { throw new RetryException(); } } return false; } {code} *FedBalanceOptions.java* Please update the description of DIFF_THRESHOLD option, I make a minor rewrite to let it easily understand. {code:java} final static Option DIFF_THRESHOLD = new Option("diffThreshold", true, "This specifies the threshold of the diff entries that used in incremental copy stage. If the diff entries" + " size is no greater than this threshold and the open files check is satisfied(no open files or force" + " close all open files), the fedBalance will go to the final round" + " of distcp. Default value is 0, that means waiting until there is no diff."); {code} *HDFSFederationBalance.md* Can we update 'Specify the threshold of the diff entries.' to 'Specify the threshold of the diff entries that used in incremental copy stage.'? *TestDistCpProcedure.java* # Please add a cleanup operation in testDiffThreshold like other test methods does in this class. # We can add new method buildContext(Path src, Path dst, String mount, int diffThreshold) without change existed method. Change existed one will have to do some unnecessary update change. > RBF: Add fast distcp threshold to FedBalance. > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218221#comment-17218221 ] Yiqun Lin commented on HDFS-15640: -- {quote} Only a little problem is it might not be easy to know how much time will the diffs cost. {quote} Actually current logic already get the latest snapshot diff, and we can just reuse that result. So it won't add additionally cost compared with current logic. {code} /** * Verify whether the src has changed since CURRENT_SNAPSHOT_NAME snapshot. * * @return true if the src has changed. */ private boolean verifyDiff() throws IOException { SnapshotDiffReport diffReport = srcFs.getSnapshotDiffReport(src, CURRENT_SNAPSHOT_NAME, ""); return diffReport.getDiffList().size() > 0; } {code} Just depend on last 3 consecutive distcp execution time is not a 100% accurate way, for example an extreme case, the final distcp should be running very fast but actually it finished slowly due to unexpected thing, like abnormal node. So I still prefer to use the diff number. > RBF: Add fast distcp threshold to FedBalance. > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.
[ https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217678#comment-17217678 ] Yiqun Lin commented on HDFS-15640: -- [~LiJinglun] , use distcp execution time as fedbalance threshold is not an appropriate way. The execution time can be impacted by other aspects, like no enough resource to schedule task or slow rpc calls. I prefer to use the snapshot diff entries number as the threshold here. We could use getSnapshotDiffReport API to get this info. If snapshot diff entries reduced to a very low number value, that means only few files/dirs needed be synced. And then we can prepare to do the final distcp copy. > RBF: Add fast distcp threshold to FedBalance. > - > > Key: HDFS-15640 > URL: https://issues.apache.org/jira/browse/HDFS-15640 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15640.001.patch > > > Currently in the DistCpProcedure it must submit distcp round by round until > there is no diff to go to the final distcp stage. The condition is very > strict. If the distcp could finish in an acceptable period then we don't need > to wait for no diff. For example if 3 consecutive distcp jobs all finish > within 10 minutes then we can predict the final distcp could also finish > within 10 minutes. So we can start the final distcp directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
[ https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164173#comment-17164173 ] Yiqun Lin commented on HDFS-15486: -- Hi [~yuanbo] , thanks for the comment. We don't have the centos version changed in our cluster, seems this is not really related. [~John Smith], the place you pointed is exactly what we want to improve. > Costly sendResponse operation slows down async editlog handling > --- > > Key: HDFS-15486 > URL: https://issues.apache.org/jira/browse/HDFS-15486 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yiqun Lin >Priority: Major > Attachments: Async-profile-(2).jpg, async-profile-(1).jpg > > > When our cluster NameNode in a very high load, we find it often stuck in > Async-editlog handling. > We use async-profile tool to get the flamegraph. > !Async-profile-(2).jpg! > This happened in that async editlog thread consumes Edit from the queue and > triggers the sendResponse call. > But here the sendResponse call is a little expensive since our cluster > enabled the security env and will do some encode operations when doing the > return response operation. > We often catch some moments of costly sendResponse operation when rpc call > queue is fulled. > !async-profile-(1).jpg! > Slowness on consuming Edit in async editlog will make Edit pending Queue > easily become the fulled state, then block its enqueue operation that is > invoked in writeLock type methods in FSNamesystem class. > Here the enhancement is that we can use multiple thread to parallel execute > sendResponse call. sendResponse doesn't need use the write lock to do > protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
[ https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15486: - Description: When our cluster NameNode in a very high load, we find it often stuck in Async-editlog handling. We use async-profile tool to get the flamegraph. !Async-profile-(2).jpg! This happened in that async editlog thread consumes Edit from the queue and triggers the sendResponse call. But here the sendResponse call is a little expensive since our cluster enabled the security env and will do some encode operations when doing the return response operation. We often catch some moments of costly sendResponse operation when rpc call queue is fulled. !async-profile-(1).jpg! Slowness on consuming Edit in async editlog will make Edit pending Queue easily become the fulled state, then block its enqueue operation that is invoked in writeLock type methods in FSNamesystem class. Here the enhancement is that we can use multiple thread to parallel execute sendResponse call. sendResponse doesn't need use the write lock to do protection, so this change is safe. was: When our cluster NameNode in a very high load, we find it often stuck in Async-editlog handling. We use async-profile tool to get the flamegraph. !Async-profile-(2).jpg! This happened in that async editlog thread consumes Edit from the queue and triggers the sendResponse call. But here the sendResponse call is a little expensive since our cluster enabled the security env and will do some encode operations when doing the return response operation. We often catch some moments of costly sendResponse operation when rpc call queue is fulled. !async-profile-(1).jpg! Slowness on consuming Edit in async editlog will make Edit pending Queue in the fulled state, then block its enqueue operation that is invoked in writeLock type methods in FSNamesystem class. Here the enhancement is that we can use multiple thread to parallel execute sendResponse call. sendResponse doesn't need use the write lock to do protection, so this change is safe. > Costly sendResponse operation slows down async editlog handling > --- > > Key: HDFS-15486 > URL: https://issues.apache.org/jira/browse/HDFS-15486 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yiqun Lin >Priority: Major > Attachments: Async-profile-(2).jpg, async-profile-(1).jpg > > > When our cluster NameNode in a very high load, we find it often stuck in > Async-editlog handling. > We use async-profile tool to get the flamegraph. > !Async-profile-(2).jpg! > This happened in that async editlog thread consumes Edit from the queue and > triggers the sendResponse call. > But here the sendResponse call is a little expensive since our cluster > enabled the security env and will do some encode operations when doing the > return response operation. > We often catch some moments of costly sendResponse operation when rpc call > queue is fulled. > !async-profile-(1).jpg! > Slowness on consuming Edit in async editlog will make Edit pending Queue > easily become the fulled state, then block its enqueue operation that is > invoked in writeLock type methods in FSNamesystem class. > Here the enhancement is that we can use multiple thread to parallel execute > sendResponse call. sendResponse doesn't need use the write lock to do > protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
Yiqun Lin created HDFS-15486: Summary: Costly sendResponse operation slows down async editlog handling Key: HDFS-15486 URL: https://issues.apache.org/jira/browse/HDFS-15486 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.0 Reporter: Yiqun Lin Attachments: Async-profile-(2).jpg, async-profile-(1).jpg When our cluster NameNode in a very high load, we find it often stuck in Async-editlog handling. We use async-profile tool to get the flamegraph. !Async-profile-(2).jpg! This happened in that async editlog thread consumes Edit from the queue and triggers the sendResponse call. But here the sendResponse call is a little expensive since our cluster enabled the security env and will do some encode operations when doing the return response operation. We often catch some moments of costly sendResponse operation when rpc call queue is fulled. !async-profile-(1).jpg! Slowness on consuming Edit in async editlog will make Edit pending Queue in the fulled state, then block its enqueue operation that is invoked in writeLock type methods in FSNamesystem class. Here the enhancement is that we can use multiple thread to parallel execute sendResponse call. sendResponse doesn't need use the write lock to do protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.
[ https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149396#comment-17149396 ] Yiqun Lin edited comment on HDFS-15448 at 7/1/20, 12:28 PM: Not sure if it's a right behavior to remove startAll() in DataNode#runDatanodeDaemon. The method BlockPoolManager#startAll is invoked in different places, see attached screenshot. The BlockPoolManager#startAll invoked in runDatanodeDaemon seems used for the test. was (Author: linyiqun): Not sure if it's a right behavior to remove startAll() in DataNode#runDatanodeDaemon. The method BlockPoolManager#startAll is invoked in different places, see attached screenshot. > When starting a DataNode, call BlockPoolManager#startAll() twice. > - > > Key: HDFS-15448 > URL: https://issues.apache.org/jira/browse/HDFS-15448 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.1 >Reporter: jianghua zhu >Assignee: jianghua zhu >Priority: Major > Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, > method_invoke_path.jpg > > > When starting a DataNode, call BlockPoolManager#startAll() twice. > The first call: > BlockPoolManager#doRefreshNamenodes() > private void doRefreshNamenodes( > Map> addrMap, > Map> lifelineAddrMap) > throws IOException { > ... > startAll(); > ... > } > The second call: > DataNode#runDatanodeDaemon() > public void runDatanodeDaemon() throws IOException { > blockPoolManager.startAll(); > ... > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.
[ https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149396#comment-17149396 ] Yiqun Lin commented on HDFS-15448: -- Not sure if it's a right behavior to remove startAll() in DataNode#runDatanodeDaemon. The method BlockPoolManager#startAll is invoked in different places, see attached screenshot. > When starting a DataNode, call BlockPoolManager#startAll() twice. > - > > Key: HDFS-15448 > URL: https://issues.apache.org/jira/browse/HDFS-15448 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.1 >Reporter: jianghua zhu >Assignee: jianghua zhu >Priority: Major > Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, > method_invoke_path.jpg > > > When starting a DataNode, call BlockPoolManager#startAll() twice. > The first call: > BlockPoolManager#doRefreshNamenodes() > private void doRefreshNamenodes( > Map> addrMap, > Map> lifelineAddrMap) > throws IOException { > ... > startAll(); > ... > } > The second call: > DataNode#runDatanodeDaemon() > public void runDatanodeDaemon() throws IOException { > blockPoolManager.startAll(); > ... > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.
[ https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15448: - Attachment: method_invoke_path.jpg > When starting a DataNode, call BlockPoolManager#startAll() twice. > - > > Key: HDFS-15448 > URL: https://issues.apache.org/jira/browse/HDFS-15448 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.1 >Reporter: jianghua zhu >Assignee: jianghua zhu >Priority: Major > Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, > method_invoke_path.jpg > > > When starting a DataNode, call BlockPoolManager#startAll() twice. > The first call: > BlockPoolManager#doRefreshNamenodes() > private void doRefreshNamenodes( > Map> addrMap, > Map> lifelineAddrMap) > throws IOException { > ... > startAll(); > ... > } > The second call: > DataNode#runDatanodeDaemon() > public void runDatanodeDaemon() throws IOException { > blockPoolManager.startAll(); > ... > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149160#comment-17149160 ] Yiqun Lin edited comment on HDFS-15294 at 7/1/20, 6:44 AM: --- I update the description of this JIRA. [~LiJinglun] , can you update the description of two subtask HDFS-15340 and HDFS-15346. That will be better understanding. All the subtasks of this feature have been done by [~LiJinglun]. If you are interested in detailed of this tool, please see the documentation JIRA HDFS-15374. Thanks [~LiJinglun] for hard working and making the great contribution! And also thanks [~elgoiri], [~ayushtkn] and others for the discussion and reviews! Any further improvement or bug fixed for this feature is very welcomed, :). was (Author: linyiqun): I update the description of this JIRA. [~LiJinglun] , can you update the description of two subtask HDFS-15340 and HDFS-15346. That will be better understanding. All the subtasks of this feature have been done by [~LiJinglun]. If you are interested in detailed of this tool, please see the documentation JIRA HDFS-15374. Thanks [~LiJinglun] for hard working and making the great contribution! And also thanks [~elgoiri], [~ayushtkn] and others for the discussion and reviews! > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15294: - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) I update the description of this JIRA. [~LiJinglun] , can you update the description of two subtask HDFS-15340 and HDFS-15346. That will be better understanding. All the subtasks of this feature have been done by [~LiJinglun]. If you are interested in detailed of this tool, please see the documentation JIRA HDFS-15374. Thanks [~LiJinglun] for hard working and making the great contribution! And also thanks [~elgoiri], [~ayushtkn] and others for the discussion and reviews! > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15294: - Description: This jira introduces a new HDFS federation balance tool to balance data across different federation namespaces. It uses Distcp to copy data from the source path to the target path. The process is: 1. Use distcp and snapshot diff to sync data between src and dst until they are the same. 2. Update mount table in Router if we specified RBF mode. 3. Deal with src data, move to trash, delete or skip them. The design of fedbalance tool comes from the discussion in HDFS-15087. was: This jira introduces a new HDFS federation balance tool to balance data across different federation namespaces. It uses Distcp to copy data from the source path to the target path. The process is: 1. Use distcp and snapshot diff to sync data between src and dst until they are the same. 2. Update mount table in Router if we specified RBF mode. 3. Deal with src data, move to trash, delete or skip them. This The patch is too big to review, so I split it into 2 patches: Phase 1 / The State Machine(BalanceProcedureScheduler): Including the abstraction of job and scheduler model. {code:java} org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; org.apache.hadoop.hdfs.procedure.BalanceProcedure; org.apache.hadoop.hdfs.procedure.BalanceJob; org.apache.hadoop.hdfs.procedure.BalanceJournal; org.apache.hadoop.hdfs.procedure.HDFSJournal; {code} Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. {code:java} org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; org.apache.hadoop.tools.DistCpFedBalance; org.apache.hadoop.tools.DistCpProcedure; org.apache.hadoop.tools.FedBalance; org.apache.hadoop.tools.FedBalanceConfigs; org.apache.hadoop.tools.FedBalanceContext; org.apache.hadoop.tools.TrashProcedure; {code} > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15294: - Description: This jira introduces a new HDFS federation balance tool to balance data across different federation namespaces. It uses Distcp to copy data from the source path to the target path. The process is: 1. Use distcp and snapshot diff to sync data between src and dst until they are the same. 2. Update mount table in Router if we specified RBF mode. 3. Deal with src data, move to trash, delete or skip them. This The patch is too big to review, so I split it into 2 patches: Phase 1 / The State Machine(BalanceProcedureScheduler): Including the abstraction of job and scheduler model. {code:java} org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; org.apache.hadoop.hdfs.procedure.BalanceProcedure; org.apache.hadoop.hdfs.procedure.BalanceJob; org.apache.hadoop.hdfs.procedure.BalanceJournal; org.apache.hadoop.hdfs.procedure.HDFSJournal; {code} Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. {code:java} org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; org.apache.hadoop.tools.DistCpFedBalance; org.apache.hadoop.tools.DistCpProcedure; org.apache.hadoop.tools.FedBalance; org.apache.hadoop.tools.FedBalanceConfigs; org.apache.hadoop.tools.FedBalanceContext; org.apache.hadoop.tools.TrashProcedure; {code} was: This jira introduces a new balance command 'fedbalance' that is ran by the administrator. The process is: 1. Use distcp and snapshot diff to sync data between src and dst until they are the same. 2. Update mount table in Router. 3. Delete the src to trash. The patch is too big to review, so I split it into 2 patches: Phase 1 / The State Machine(BalanceProcedureScheduler): Including the abstraction of job and scheduler model. {code:java} org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; org.apache.hadoop.hdfs.procedure.BalanceProcedure; org.apache.hadoop.hdfs.procedure.BalanceJob; org.apache.hadoop.hdfs.procedure.BalanceJournal; org.apache.hadoop.hdfs.procedure.HDFSJournal; {code} Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. {code:java} org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; org.apache.hadoop.tools.DistCpFedBalance; org.apache.hadoop.tools.DistCpProcedure; org.apache.hadoop.tools.FedBalance; org.apache.hadoop.tools.FedBalanceConfigs; org.apache.hadoop.tools.FedBalanceContext; org.apache.hadoop.tools.TrashProcedure; {code} > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > This > The patch is too big to review, so I split it into 2 patches: > Phase 1 / The State Machine(BalanceProcedureScheduler): Including the > abstraction of job and scheduler model. > {code:java} > org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; > org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; > org.apache.hadoop.hdfs.procedure.BalanceProcedure; > org.apache.hadoop.hdfs.procedure.BalanceJob; > org.apache.hadoop.hdfs.procedure.BalanceJournal; > org.apache.hadoop.hdfs.procedure.HDFSJournal; > {code} > Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. HDFS-15346> > {code:java} > org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; > org.apache.hadoop.tools.DistCpFedBalance; > org.apache.hadoop.tools.DistCpProcedure; > org.apache.hadoop.tools.FedBalance; > org.apache.hadoop.tools.FedBalanceConfigs; > org.apache.hadoop.tools.FedBalanceContext; > org.apache.hadoop.tools.TrashProcedure; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15374: - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Commit this to trunk. Thanks [~LiJinglun] for the contribution and thanks [~elgoiri] for the review. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, > FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, > FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, > HDFS-15374.003.patch, HDFS-15374.004.patch, HDFS-15374.005.patch > > > Add documentation for fedbalance tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15374: - Description: Add documentation for fedbalance tool. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, > FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, > FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, > HDFS-15374.003.patch, HDFS-15374.004.patch, HDFS-15374.005.patch > > > Add documentation for fedbalance tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) Add separated config file hdfs-fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Commit this to trunk. Thanks [~elgoiri] for the review and thanks [~LiJinglun] for the contribution! > Add separated config file hdfs-fedbalance-default.xml for fedbalance tool > - > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, > HDFS-15410.003.patch, HDFS-15410.004.patch, HDFS-15410.005.patch > > > Add a separated config file named hdfs-fedbalance-default.xml for fedbalance > tool configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) Add separated config file hdfs-fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Description: Add a separated config file named hdfs-fedbalance-default.xml for fedbalance tool configs. It's like the ditcp-default.xml for distcp tool. (was: Add a separated config file named fedbalance-default.xml for fedbalance tool configs. It's like the ditcp-default.xml for distcp tool.) > Add separated config file hdfs-fedbalance-default.xml for fedbalance tool > - > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, > HDFS-15410.003.patch, HDFS-15410.004.patch, HDFS-15410.005.patch > > > Add a separated config file named hdfs-fedbalance-default.xml for fedbalance > tool configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) Add separated config file hdfs-fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Summary: Add separated config file hdfs-fedbalance-default.xml for fedbalance tool (was: Add separated config file fedbalance-default.xml for fedbalance tool) > Add separated config file hdfs-fedbalance-default.xml for fedbalance tool > - > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, > HDFS-15410.003.patch, HDFS-15410.004.patch, HDFS-15410.005.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147742#comment-17147742 ] Yiqun Lin commented on HDFS-15410: -- [~inigoiri], would you mind to have a quick review for this JIRA and HDFS-15374? [~LiJinglun] , I will hold off one day to commit and let [~inigoiri] to have a quick review once he gets the time. > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, > HDFS-15410.003.patch, HDFS-15410.004.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142178#comment-17142178 ] Yiqun Lin commented on HDFS-15374: -- I generated the markdown documentation page in my local, it rendered good now. Thanks for addressing the comments, +1. [~elgoiri], any further comment for this? I will hold off the commit in case you have other comments. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, > FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, > FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, > HDFS-15374.003.patch, HDFS-15374.004.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142164#comment-17142164 ] Yiqun Lin commented on HDFS-15410: -- LGTM , +1. [~elgoiri], Does the latest patch also look good to you? > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, > HDFS-15410.003.patch, HDFS-15410.004.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141672#comment-17141672 ] Yiqun Lin commented on HDFS-15410: -- [~LiJinglun], one minor comment: can you add more description for the setting hdfs.fedbalance.procedure.scheduler.journal.uri and hdfs.fedbalance.procedure.work.thread.num? For example, we can add part of some definition: hdfs.fedbalance.procedure.scheduler.journal.uri : The uri of the journal, the journal file is used for handling the job persistence and recover. hdfs.fedbalance.procedure.work.thread.num: The worker threads number of the BalanceProcedureScheduler. BalanceProcedureScheduler is responsible for scheduling a balance job, including submit, run, delay and recover. Also please update above new description in FederationBalance.md configuration options section that tracked in HDFS-15374. Thanks. > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, > HDFS-15410.003.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15374: - Attachment: FedBalance_Screenshot3.jpg FedBalance_Screenshot2.jpg FedBalance_Screenshot1.jpg > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, > FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, > FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, > HDFS-15374.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141662#comment-17141662 ] Yiqun Lin edited comment on HDFS-15374 at 6/22/20, 3:27 AM: The patch almost looks great now, I find one problem when I use mvn site:site to generate html page. We lack css file here. Can you copy css directory from distp module(../site/resources/css) to same place in fed balance module? Attach html page screenshot generated in my local. BTW, can you answer my question in previous comment? {quote} I have a question here, can we support the full path like hdfs://my-ns01/src-folder instead of above specific nn port address now? In the local config, we often have the nn address configured in the hdfs-site.xml {quote} was (Author: linyiqun): The patch almost looks great now, I find one problem when I use mvn site:site to generate html page. We lack css file here. Can you copy css directory from distp module(../site/resources/css) to same place in fed balance module? Attach html page screenshot generated in my local. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, > HDFS-15374.002.patch, HDFS-15374.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141662#comment-17141662 ] Yiqun Lin commented on HDFS-15374: -- The patch almost looks great now, I find one problem when I use mvn site:site to generate html page. We lack css file here. Can you copy css directory from distp module(../site/resources/css) to same place in fed balance module? Attach html page screenshot generated in my local. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, > HDFS-15374.002.patch, HDFS-15374.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140990#comment-17140990 ] Yiqun Lin commented on HDFS-15374: -- [~LiJinglun], thanks for updating the patch! Minor comments from me: {code:java} Finally when the source and the target are the same, it + updates the mount table in Router and moves the source to trash. {code} It will better to mention in the normal federation mode and in the rbf mode.. {code:java} In normal federation mode the source path must includes the source cluster. {code} This can update to {code:java} In normal federation mode the source path must includes the path schema. {code} I have a question here, can we support the full path like hdfs://my-ns01/src-folder instead of above specific nn port address now? In the local config, we often have the nn address configured in the hdfs-site.xml The name {{DistCpFedBalance}} should be updated to FedBalance in doc since it has been renamed now, Some whitespaces I find, please remove these redundant whitespaces which leads checkstyles warnings. {noformat} + This will scan the journal to find all the unfinished jobs, recover and + continue to execute them. + <--- whitespaces + If we want to balance in a normal federation cluster, use the command below. + +bash$ /bin/hadoop fedbalance submit hdfs://nn0:8020/foo/src hdfs://nn1:8020/foo/dst +<--- whitespaces + In normal federation mode the source path must includes the source cluster. + +### RBF Mode And Normal Federation Mode + + The federation balance tool has 2 modes: <---whitespaces +<---whitespaces + * the router-based federation mode (RBF mode). + * the normal federation mode. +<---whitespaces + By default the command runs in the normal federation mode. You can specify the + rbf mode by using the option `-router`. +<---whitespaces + In the rbf mode the first parameter is taken as the mount point. It disables + write by setting the mount point readonly. +<---whitespaces + In the normal federation mode the first parameter is taken as the full path of + the source. The first parameter must include the source cluster. It disables + write by cancelling all the permissions of the source path. +<---whitespaces + Details about disabling write see [DistCpFedBalance](#DistCpFedBalance). ... when there is no diff and no open files. <---whitespaces +* FINAL_DISTCP: Force close all the open files and submit the final distcp. +* FINISH: Do the cleanup works. In normal federation mode the finish stage + also restores the permission of the dst path. +your patch name {noformat} > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, > HDFS-15374.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140960#comment-17140960 ] Yiqun Lin commented on HDFS-15410: -- Besides [~elgoiri]'s review comment, some more reivew comments from me: Not fully understand why we need to define the class impl config to do reflection and get the instance. Currently there is no other implement class, why not just create new FedBalance/BalanceJournalInfoHDFS instance in the code? From my understanding, this two config settings is can be removed. {code:java} federation.balance.class hadoop.hdfs.procedure.journal.class // init journal. Class clazz = (Class) conf .getClass(JOURNAL_CLASS, BalanceJournalInfoHDFS.class); journal = ReflectionUtils.newInstance(clazz, conf); Class balanceClazz = (Class) conf .getClass(FEDERATION_BALANCE_CLASS, FedBalance.class); Tool balancer = ReflectionUtils.newInstance(balanceClazz, conf); {code} Can we rename class name from {{DistCpBalanceOptions}} to {{FedBalanceOptions}}? This will look more readable that these options here are making sense for fedbalance tool. Can we rename config prefix from {{hadoop.hdfs.procedure.work.thread.num}} to {{hdfs.fedbalance.procedure.work.thread.num}}? Following description need to be updated here since -router option doesn't require to inout true or false as a parameter now. {noformat} final static Option ROUTER = new Option("router", false, "If `true` the command runs in router mode. The source path is " + "taken as a mount point. It will disable write by setting the mount" + " point readonly. Otherwise the command works in normal federation" + " mode. The source path is taken as the full path. It will disable" + " write by cancelling all permissions of the source path. The" + " default value is `true`."); {noformat} > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139105#comment-17139105 ] Yiqun Lin commented on HDFS-15374: -- Hi [~LiJinglun], can you attach the latest patch here? I am more accustomed to review the patch file way, :D. Thank you. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15374.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15346) FedBalance tool implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15346: - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Committed this to trunk. Thanks [~LiJinglun] for the great contribution! > FedBalance tool implementation > -- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, > HDFS-15346.012.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15346) FedBalance tool implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15346: - Summary: FedBalance tool implementation (was: DistCpFedBalance implementation) > FedBalance tool implementation > -- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, > HDFS-15346.012.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136480#comment-17136480 ] Yiqun Lin commented on HDFS-15346: -- LGTM, +1. Will commit this the day after tomorrow once there is no other comment. > DistCpFedBalance implementation > --- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, > HDFS-15346.012.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136263#comment-17136263 ] Yiqun Lin commented on HDFS-15294: -- As this feature tool is designed as a common tool like distcp, I removed all RBF label in uncommitted subtask. > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new balance command 'fedbalance' that is ran by the > administrator. The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router. > 3. Delete the src to trash. > > The patch is too big to review, so I split it into 2 patches: > Phase 1 / The State Machine(BalanceProcedureScheduler): Including the > abstraction of job and scheduler model. > {code:java} > org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; > org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; > org.apache.hadoop.hdfs.procedure.BalanceProcedure; > org.apache.hadoop.hdfs.procedure.BalanceJob; > org.apache.hadoop.hdfs.procedure.BalanceJournal; > org.apache.hadoop.hdfs.procedure.HDFSJournal; > {code} > Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. HDFS-15346> > {code:java} > org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; > org.apache.hadoop.tools.DistCpFedBalance; > org.apache.hadoop.tools.DistCpProcedure; > org.apache.hadoop.tools.FedBalance; > org.apache.hadoop.tools.FedBalanceConfigs; > org.apache.hadoop.tools.FedBalanceContext; > org.apache.hadoop.tools.TrashProcedure; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Summary: Add separated config file fedbalance-default.xml for fedbalance tool (was: RBF: Add separated config file fedbalance-default.xml for fedbalance tool) > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15374: - Summary: Add documentation for fedbalance tool (was: RBF: Add documentation for fedbalance tool) > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15374.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15346) DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15346: - Summary: DistCpFedBalance implementation (was: RBF: DistCpFedBalance implementation) > DistCpFedBalance implementation > --- > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15294: - Summary: Federation balance tool (was: RBF: Balance data across federation namespaces with DistCp and snapshot diff) > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new balance command 'fedbalance' that is ran by the > administrator. The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router. > 3. Delete the src to trash. > > The patch is too big to review, so I split it into 2 patches: > Phase 1 / The State Machine(BalanceProcedureScheduler): Including the > abstraction of job and scheduler model. > {code:java} > org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler; > org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys; > org.apache.hadoop.hdfs.procedure.BalanceProcedure; > org.apache.hadoop.hdfs.procedure.BalanceJob; > org.apache.hadoop.hdfs.procedure.BalanceJournal; > org.apache.hadoop.hdfs.procedure.HDFSJournal; > {code} > Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob. HDFS-15346> > {code:java} > org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure; > org.apache.hadoop.tools.DistCpFedBalance; > org.apache.hadoop.tools.DistCpProcedure; > org.apache.hadoop.tools.FedBalance; > org.apache.hadoop.tools.FedBalanceConfigs; > org.apache.hadoop.tools.FedBalanceContext; > org.apache.hadoop.tools.TrashProcedure; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) RBF: Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Lin updated HDFS-15410: - Summary: RBF: Add separated config file fedbalance-default.xml for fedbalance tool (was: Add separated config file fedbalance-default.xml for fedbalance tool.) > RBF: Add separated config file fedbalance-default.xml for fedbalance tool > - > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135819#comment-17135819 ] Yiqun Lin commented on HDFS-15346: -- [~LiJinglun], the refactor looks great. I find you decrease the timeout value, the new value seems too small and it will lead timeout error. Can you adjust all this time value to 3(@Test(timeout = 3) in TestDistCpProcedure? This value works well in my local. Finally, can we add 'fedbalance' in current package name under fedbalance module? Under module path src/test/java, src/main/java Update {noformat} org.apache.hadoop.tools org.apache.hadoop.tools.procedure {noformat} to {noformat} org.apache.hadoop.tools.fedbalance org.apache.hadoop.tools.fedbalance.procedure {noformat} Then please check and update some old class path that used in the module, like hadoop-federation-balance.sh, pom.xml or some other place. Others looks good to me now. Thanks [~LiJinglun] for the so patient working for this. Once above are addressed, I will hold off the commit for few days in case there are some other comments from others. > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135024#comment-17135024 ] Yiqun Lin edited comment on HDFS-15346 at 6/14/20, 5:05 AM: [~LiJinglun], thanks for addressing remaining comments. These two days, I am trying to improve the efficiency of the unit test, current unit test is too slow. I find another way that we don't have to depend on mini yarn cluster in test running. The job can be submitted and executed in LocalJobRunner when there is no mini yarn cluster env. But we need to make an adjustment in getting job status from job client. I do some refactor in getCurrent method and apply them in DistCpProcedure. Following are part of some necessary change we need to update. {noformat} @VisibleForTesting private Job runningJob; static boolean ENABLED_FOR_TEST = false; ... private String submitDistCpJob(String srcParam, String dstParam, boolean useSnapshotDiff) throws IOException { ... try { LOG.info("Submit distcp job={}", job); runningJob = job; <--- need to reset there return job.getJobID().toString(); } catch (Exception e) { throw new IOException("Submit job failed.", e); } } private RunningJobStatus getCurrentJob() throws IOException { if (jobId != null) { if (ENABLED_FOR_TEST) { if (this.runningJob != null) { Job latestJob = null; try { latestJob = this.runningJob.getCluster() .getJob(JobID.forName(jobId)); } catch (InterruptedException e) { throw new IOException(e); } return latestJob == null ? null : new RunningJobStatus(latestJob, null); } } else { RunningJob latestJob = client.getJob(JobID.forName(jobId)); return latestJob == null ? null : new RunningJobStatus(null, latestJob); } } return null; } class RunningJobStatus { Job testJob; RunningJob job; public RunningJobStatus(Job testJob, RunningJob job) { this.testJob = testJob; this.job = job; } String getJobID() { return ENABLED_FOR_TEST ? testJob.getJobID().toString() : job.getID().toString(); } boolean isComplete() throws IOException { return ENABLED_FOR_TEST ? testJob.isComplete() : job.isComplete(); } boolean isSuccessful() throws IOException { return ENABLED_FOR_TEST ? testJob.isSuccessful() : job.isSuccessful(); } String getFailureInfo() throws IOException { try { return ENABLED_FOR_TEST ? testJob.getStatus().getFailureInfo() : job.getFailureInfo(); } catch (InterruptedException e) { throw new IOException(e); } } } {noformat} And mini yarn cluster related code lines can all be removed (include two pom dependencies mentioned above) {code:java} +mrCluster = new MiniMRYarnCluster(TestDistCpProcedure.class.getName(), 3); +conf.set(MRJobConfig.MR_AM_STAGING_DIR, "/apps_staging_dir"); +mrCluster.init(conf); +mrCluster.start(); +conf = mrCluster.getConfig(); {code} We need additionally set test enabled flag. {code:java} public static void beforeClass() throws IOException { DistCpProcedure.ENABLED_FOR_TEST = true; ... } {code} After this improvement, the whole test runs very faster than before, it totally costs less than 1 min. In additional, we need to have a cleanup at the end of each test method. like {code:java} fs.delete(new Path(testRoot), true); {code} or {code:java} dcProcedure.finish(); (soemtimes need to call this since some case has snapshot created and cannot be deleted) fs.delete(new Path(testRoot), true); {code} Also I catch some places still needed to update. # Can you update following description in router option? I update this content as well but seems this was not addressed in the latest patch. {noformat} It will disable read and write by cancelling all permissions of the source path. The default value is `false`." {noformat} # Method name cleanUpBeforeInitDistcp can be renamed to pathCheckBeforeInitDistcp since we don't do any cleanup operation now. was (Author: linyiqun): [~LiJinglun], thanks for addressing remaining comments. These two days, I am trying to improve the efficiency of the unit test, current unit test is too slow. I find another way that we don't have to depend on mini yarn cluster in test running. The job can be submitted and executed in LocalJobRunner when there is no mini yarn cluster env. But we need to make an adjustment in getting job status from job client. I do some refactor in getCurrent method and apply them in DistCpProcedure. Following are part of some necessary change we need to update. {noformat} @VisibleForTesting private Job runningJob; static boolean ENABLED_FOR_TEST = false; ... private String
[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135024#comment-17135024 ] Yiqun Lin edited comment on HDFS-15346 at 6/14/20, 4:59 AM: [~LiJinglun], thanks for addressing remaining comments. These two days, I am trying to improve the efficiency of the unit test, current unit test is too slow. I find another way that we don't have to depend on mini yarn cluster in test running. The job can be submitted and executed in LocalJobRunner when there is no mini yarn cluster env. But we need to make an adjustment in getting job status from job client. I do some refactor in getCurrent method and apply them in DistCpProcedure. Following are part of some necessary change we need to update. {noformat} @VisibleForTesting private Job runningJob; static boolean ENABLED_FOR_TEST = false; ... private String submitDistCpJob(String srcParam, String dstParam, boolean useSnapshotDiff) throws IOException { ... try { LOG.info("Submit distcp job={}", job); runningJob = job; <--- need to reset there return job.getJobID().toString(); } catch (Exception e) { throw new IOException("Submit job failed.", e); } } private RunningJobStatus getCurrentJob() throws IOException { if (jobId != null) { if (ENABLED_FOR_TEST) { if (this.runningJob != null) { Job latestJob = null; try { latestJob = this.runningJob.getCluster() .getJob(JobID.forName(jobId)); } catch (InterruptedException e) { throw new IOException(e); } return latestJob == null ? null : new RunningJobStatus(latestJob, null); } } else { RunningJob latestJob = client.getJob(JobID.forName(jobId)); return latestJob == null ? null : new RunningJobStatus(null, latestJob); } } return null; } class RunningJobStatus { Job testJob; RunningJob job; public RunningJobStatus(Job testJob, RunningJob job) { this.testJob = testJob; this.job = job; } String getJobID() { return ENABLED_FOR_TEST ? testJob.getJobID().toString() : job.getID().toString(); } boolean isComplete() throws IOException { return ENABLED_FOR_TEST ? testJob.isComplete() : job.isComplete(); } boolean isSuccessful() throws IOException { return ENABLED_FOR_TEST ? testJob.isSuccessful() : job.isSuccessful(); } String getFailureInfo() throws IOException { try { return ENABLED_FOR_TEST ? testJob.getStatus().getFailureInfo() : job.getFailureInfo(); } catch (InterruptedException e) { throw new IOException(e); } } } {noformat} And mini yarn cluster related code lines can all be removed (include two pom dependencies mentioned above) {code:java} +mrCluster = new MiniMRYarnCluster(TestDistCpProcedure.class.getName(), 3); +conf.set(MRJobConfig.MR_AM_STAGING_DIR, "/apps_staging_dir"); +mrCluster.init(conf); +mrCluster.start(); +conf = mrCluster.getConfig(); {code} We need additionally set test enabled flag. {code:java} public static void beforeClass() throws IOException { DistCpProcedure.ENABLED_FOR_TEST = true; ... } {code} After this improvement, the whole test runs very faster than before, it totally costs less than 1 min. Also I catch some places still needed to update. # Can you update following description in router option? I update this content as well but seems this was not addressed in the latest patch. {noformat} It will disable read and write by cancelling all permissions of the source path. The default value is `false`." {noformat} # Method name cleanUpBeforeInitDistcp can be renamed to pathCheckBeforeInitDistcp since we don't do any cleanup operation now. was (Author: linyiqun): [~LiJinglun], thanks for addressing remaining comments. These two days, I am trying to improve the efficiency of the unit test, current unit test is too slow. I find another way that we don't have to depend on mini yarn cluster in test running. The job can submitted and executed in LocalJobRunner way. But we need to make an adjustment in getting job status from job client. I do some refactor in getCurrent method and apply them in DistCpProcedure. Following are part of some necessary change we need to update. {noformat} @VisibleForTesting private Job runningJob; static boolean ENABLED_FOR_TEST = false; ... private String submitDistCpJob(String srcParam, String dstParam, boolean useSnapshotDiff) throws IOException { ... try { LOG.info("Submit distcp job={}", job); runningJob = job; <--- need to reset there return job.getJobID().toString(); } catch (Exception e) { throw new IOException("Submit job failed.", e); } } private
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135024#comment-17135024 ] Yiqun Lin commented on HDFS-15346: -- [~LiJinglun], thanks for addressing remaining comments. These two days, I am trying to improve the efficiency of the unit test, current unit test is too slow. I find another way that we don't have to depend on mini yarn cluster in test running. The job can submitted and executed in LocalJobRunner way. But we need to make an adjustment in getting job status from job client. I do some refactor in getCurrent method and apply them in DistCpProcedure. Following are part of some necessary change we need to update. {noformat} @VisibleForTesting private Job runningJob; static boolean ENABLED_FOR_TEST = false; ... private String submitDistCpJob(String srcParam, String dstParam, boolean useSnapshotDiff) throws IOException { ... try { LOG.info("Submit distcp job={}", job); runningJob = job; <--- need to reset there return job.getJobID().toString(); } catch (Exception e) { throw new IOException("Submit job failed.", e); } } private RunningJobStatus getCurrentJob() throws IOException { if (jobId != null) { if (ENABLED_FOR_TEST) { if (this.runningJob != null) { Job latestJob = null; try { latestJob = this.runningJob.getCluster() .getJob(JobID.forName(jobId)); } catch (InterruptedException e) { throw new IOException(e); } return latestJob == null ? null : new RunningJobStatus(latestJob, null); } } else { RunningJob latestJob = client.getJob(JobID.forName(jobId)); return latestJob == null ? null : new RunningJobStatus(null, latestJob); } } return null; } class RunningJobStatus { Job testJob; RunningJob job; public RunningJobStatus(Job testJob, RunningJob job) { this.testJob = testJob; this.job = job; } String getJobID() { return ENABLED_FOR_TEST ? testJob.getJobID().toString() : job.getID().toString(); } boolean isComplete() throws IOException { return ENABLED_FOR_TEST ? testJob.isComplete() : job.isComplete(); } boolean isSuccessful() throws IOException { return ENABLED_FOR_TEST ? testJob.isSuccessful() : job.isSuccessful(); } String getFailureInfo() throws IOException { try { return ENABLED_FOR_TEST ? testJob.getStatus().getFailureInfo() : job.getFailureInfo(); } catch (InterruptedException e) { throw new IOException(e); } } } {noformat} And mini yarn cluster related code lines can all be removed (include two pom dependencies mentioned above) {code:java} +mrCluster = new MiniMRYarnCluster(TestDistCpProcedure.class.getName(), 3); +conf.set(MRJobConfig.MR_AM_STAGING_DIR, "/apps_staging_dir"); +mrCluster.init(conf); +mrCluster.start(); +conf = mrCluster.getConfig(); {code} We need additionally set test enabled flag. {code:java} public static void beforeClass() throws IOException { DistCpProcedure.ENABLED_FOR_TEST = true; ... } {code} After this improvement, the whole test runs very faster than before, it totally costs less than 1 min. Also I catch some places still needed to update. # Can you update following description in router option? I update this content as well but seems this was not addressed in the latest patch. {noformat} It will disable read and write by cancelling all permissions of the source path. The default value is `false`." {noformat} # Method name cleanUpBeforeInitDistcp can be renamed to pathCheckBeforeInitDistcp since we don't do any cleanup operation now. > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch, HDFS-15346.010.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15087) RBF: Balance/Rename across federation namespaces
[ https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134832#comment-17134832 ] Yiqun Lin edited comment on HDFS-15087 at 6/13/20, 3:16 PM: Hi [~umamaheswararao] and others, I'd like to share current status about HDFS-15294 and maybe some others also want to know this feature. Now [~LiJinglun] almost completed the majority implementation and we are actively working for the one core subtask HDFS-15346. There are still some remaining work like documentation. And after discussed with [~LiJinglun] about this feature design, we are agreed that let this tool become a common balance tool not only used for RBF mode, but also can be used in normal federation clusters. {quote}If it's mandatory, would it be possible to think and make it as optional and have alternative thoughts to get the diff? {quote} Good idea, we could let this diff to be pluggable in the future improvement. Please share your wonderful thoughts/comments on HDFS-15294 if you are interested in this, :). was (Author: linyiqun): Hi [~umamaheswararao] and others, I'd like to share current status about HDFS-15294 and maybe some others also want to know this feature. Now [~LiJinglun] almost completed the majority implementation and we are actively working for the one core subtask HDFS-15346. There are still some remaining work like documentation. And after discussed with [~LiJinglun] about this feature design, we are agreed that let this tool become a common balance tool not only used for RBF mode, but also can be used in normal federation clusters. Please share your wonderful thoughts/comments on HDFS-15294 if you are interested in this, :). > RBF: Balance/Rename across federation namespaces > > > Key: HDFS-15087 > URL: https://issues.apache.org/jira/browse/HDFS-15087 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation > Namespaces.pdf > > > The Xiaomi storage team has developed a new feature called HFR(HDFS > Federation Rename) that enables us to do balance/rename across federation > namespaces. The idea is to first move the meta to the dst NameNode and then > link all the replicas. It has been working in our largest production cluster > for 2 months. We use it to balance the namespaces. It turns out HFR is fast > and flexible. The detail could be found in the design doc. > Looking forward to a lively discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15087) RBF: Balance/Rename across federation namespaces
[ https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134832#comment-17134832 ] Yiqun Lin commented on HDFS-15087: -- Hi [~umamaheswararao] and others, I'd like to share current status about HDFS-15294 and maybe some others also want to know this feature. Now [~LiJinglun] almost completed the majority implementation and we are actively working for the one core subtask HDFS-15346. There are still some remaining work like documentation. And after discussed with [~LiJinglun] about this feature design, we are agreed that let this tool become a common balance tool not only used for RBF mode, but also can be used in normal federation clusters. Please share your wonderful thoughts/comments on HDFS-15294 if you are interested in this, :). > RBF: Balance/Rename across federation namespaces > > > Key: HDFS-15087 > URL: https://issues.apache.org/jira/browse/HDFS-15087 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation > Namespaces.pdf > > > The Xiaomi storage team has developed a new feature called HFR(HDFS > Federation Rename) that enables us to do balance/rename across federation > namespaces. The idea is to first move the meta to the dst NameNode and then > link all the replicas. It has been working in our largest production cluster > for 2 months. We use it to balance the namespaces. It turns out HFR is fast > and flexible. The detail could be found in the design doc. > Looking forward to a lively discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130903#comment-17130903 ] Yiqun Lin commented on HDFS-15346: -- [~LiJinglun] , thanks for addressing the comments, almost looks good now. {quote}Agree with you ! Using a fedbalance-default.xml is much better. {quote} Would you create a subtask JIRA for this? Let's try to complete this in a later time. {quote}I'll try to figure it out. But it might be quite tricky as the unit tests use both MiniDFSCluster and MiniMRYarnCluster. And there are many rounds of distcp. Please tell me if you have any suggestions, thanks {quote} I will take a further look for this later. But anyway, currently the unit tests can all be passed, it's okay for me. Still some remaining minor comments: *hadoop-federation-balance/pom.xml* {noformat} + + org.bouncycastle + bcprov-jdk15on + test + + + org.bouncycastle + bcpkix-jdk15on + test + {noformat} These two dependencies seems not related, can we remove this one? *DistCpFedBalance.java/FedBalance.java* I don't know why we define another class FedBalance. This FedBalance can just combined to DistCpFedBalance. I prefer to override main method in DistCpFedBalance and then renamed DistCpFedBalance to FedBalance. *DistCpBalanceOptions.java* Find two places can be described more clear: # I prefer to move detailed comment message into option description and users can known detailed about this option. {code:java} /** * Run in router-based federation mode. */ final static Option ROUTER = new Option("router", false, ". If `true` the command runs in router mode. The source path is taken as a mount point. It will disable write by setting the mount point readonly. Otherwise the command works in normal federation mode. The source path is taken as the full path. It will disable read and write by cancelling all permissions of the source path. The default value is `false`."); {code} # The description of delay option is hard to understand. I make a minor change for this. [~LiJinglun], if you have a better description for this option, feel free to update your change on this. {code:java} /* Specify the delayed duration(millie seconds) to recover the Job.*/ final static Option DELAY_DURATION = new Option("delay", true, "The delayed duration(millie seconds) to recover the Job continue to run when the job is detected that it hasn't been finished and waits to complete."); {code} *DistCpProcedure.java* # Move {{srcFs.allowSnapshot(src);}} to at the end of method. Only after the snapshot check, then we do the allow snapshot opertion. {code:java} + + private void cleanUpBeforeInitDistcp() throws IOException { +if (dstFs.exists(dst)) { // clean up. + throw new IOException("The dst path=" + dst + " already exists. The admin" + + " should delete it before submitting the initial distcp job."); +} +Path snapshotPath = new Path(src, +HdfsConstants.DOT_SNAPSHOT_DIR_SEPARATOR + CURRENT_SNAPSHOT_NAME); +if (srcFs.exists(snapshotPath)) { + throw new IOException("The src snapshot=" + snapshotPath + + " already exists. The admin should delete the snapshot before" + + " submitting the initial distcp."); +} srcFs.allowSnapshot(src); <--- move to here + } {code} *FedBalanceContext.java* # Please add necessary dot in toString method, like this: {code:java} public String toString() { StringBuilder builder = new StringBuilder("FedBalance context:"); builder.append(" src=").append(src); builder.append(", dst=").append(dst); if (useMountReadOnly) { builder.append(", router-mode=true"); builder.append(", mount-point=").append(mount); } else { builder.append(", router-mode=false"); } builder.append(", forceCloseOpenFiles=").append(forceCloseOpenFiles); builder.append(", trash=").append(trashOpt.name()); builder.append(", map=").append(mapNum); builder.append(", bandwidth=").append(bandwidthLimit); return builder.toString(); } {code} # Can you add new added option delayDuration option into this class? > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, > HDFS-15346.009.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by
[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127489#comment-17127489 ] Yiqun Lin edited comment on HDFS-15346 at 6/7/20, 4:35 AM: --- Some more detailed review comments: *HdfsConstants.java* Can we rename DOT_SNAPSHOT_SEPARATOR_DIR to the more readable name DOT_SNAPSHOT_DIR_SEPARATOR? *DistCpFedBalance.java* # It would good to print the fed context that created from input options, so that we will know final options that we passed in. {noformat} +. // --> print fed balancer context + // Construct the balance job. + BalanceJob.Builder builder = new BalanceJob.Builder<>(); + DistCpProcedure dcp = + new DistCpProcedure(DISTCP_PROCEDURE, null, delayDuration, context); + builder.nextProcedure(dcp); {noformat} # We can replace this system out in LOG instance, {noformat} +for (BalanceJob job : jobs) { + if (!job.isJobDone()) { +unfinished++; + } + System.out.println(job); +} {noformat} *DistCpProcedure.java* # The message in IOException(src + " doesn't exist.") not correctly described, should be 'src + " should be the directory."' # For each stage change, can we add aN additional output log, like this: {noformat} +if (srcFs.exists(new Path(src, HdfsConstants.DOT_SNAPSHOT_DIR))) { + throw new IOException(src + " shouldn't enable snapshot."); +} LOG.info("Stage updated from {} to {}.", stage.name(), Stage.INIT_DISTCP.name()) +stage = Stage.INIT_DISTCP; + } {noformat} # Here we reset permission to 0, that means no any operation is allowed? Is this expected, why not is 400 (only allow read)? The comment said that 'cancelling the x permission of the source path.' makes me confused. {noformat} srcFs.setPermission(src, FsPermission.createImmutable((short) 0)); {noformat} # I prefer to throw IOException rather than doing delete operation in cleanUpBeforeInitDistcp. cleanUpBeforeInitDistcp is expected to be the final pre-check function before submitting ditcp job. And let admin users to check and do delete operation manually by themself. {noformat} + private void initialCheckBeforeInitDistcp() throws IOException { +if (dstFs.exists(dst)) { + throw IOException(); +} +srcFs.allowSnapshot(src); +if (srcFs.exists(new Path(src, +HdfsConstants.DOT_SNAPSHOT_SEPARATOR_DIR + CURRENT_SNAPSHOT_NAME))) { throw IOException(); +} {noformat} *FedBalanceConfigs.java* Can we move all keys from BalanceProcedureConfigKeys to this class? We don't need two duplicated Config class. One follow-up task I am thinking that we can have a separated config file something named fedbalance-default.xml for fedbalance tool, like ditcp-default.xml for distcp tool now. I don't prefer to add all tool config settings into hdfs-default.xml. *FedBalanceContext.java* Override the toString method in FedBalanceContext to help us know the input options that actually be used. *MountTableProcedure.java* The for loop can just break once we find the first source path that matched. {noformat} +for (MountTable result : results) { + if (mount.equals(result.getSourcePath())) { + existingEntry = result; break; + } +} {noformat} *TrashProcedure.java* {noformat} + /** + * Delete source path to trash. + */ + void moveToTrash() throws IOException { +Path src = context.getSrc(); +if (srcFs.exists(src)) { + switch (context.getTrashOpt()) { + case TRASH: +conf.setFloat(FS_TRASH_INTERVAL_KEY, 1); +if (!Trash.moveToAppropriateTrash(srcFs, src, conf)) { + throw new IOException("Failed move " + src + " to trash."); +} +break; + case DELETE: +if (!srcFs.delete(src, true)) { + throw new IOException("Failed delete " + src); +} +LOG.info("{} is deleted.", src); +break; + default: +break; + } +} + } {noformat} For above lines, two review comments: # Can we add SKIP option check as well and throw unexpected option error? {noformat} case SKIP: break; + default: + throw new IOException("Unexpected trash option=" + context.getTrashOpt()); + } {noformat} # FS_TRASH_INTERVAL_KEY defined with 1 is too small, that means we the trash will be deleted after 1 minute. Can you increased this to 60? Also please add necessary comment in trash option description to say the default trash behavior when trash is disabled in server side and client side value will be used. was (Author: linyiqun): Some more detailed review comments: *HdfsConstants.java* Can we rename DOT_SNAPSHOT_SEPARATOR_DIR to the more readable name DOT_SNAPSHOT_DIR_SEPARATOR? *DistCpFedBalance.java* # It would good to print the fed context that created from input options, so that we will
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127498#comment-17127498 ] Yiqun Lin commented on HDFS-15346: -- Review comments for unit tests: *TestDistCpProcedure.java* # Use CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY to replace 'fs.defaultFS'. # In {{TestDistCpProcedure#testSuccessfulDistCpProcedure}}, can we add additional file length check between src file and dst file? # Please complete the javadoc comment for method executeProcedure and createFiles. # Method sede can update to a more readable name serializeProcedure. # I think we missing a corner case test case that disable writer behavior in non-RBF mode. # The test need a little long time to execute the whole test. >From Jenkins test result: {noformat} testDiffDistCp 1 min 18 secPassed testInitDistCp 22 sec Passed testRecoveryByStage 55 sec Passed testShutdown8.9 sec Passed testStageFinalDistCp47 sec Passed testStageFinish 0.22 secPassed testSuccessfulDistCpProcedure 38 sec Passed {noformat} Can we look into why some ut spend so many time? Increasing timeout value is a quick-fix way but not the best way. *TestMountTableProcedure.java* Please update testSeDe to testSeDeserialize *TestTrashProcedure.java* Can we also add a test method testSeDeserialize like TestMountTableProcedure does? > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127489#comment-17127489 ] Yiqun Lin commented on HDFS-15346: -- Some more detailed review comments: *HdfsConstants.java* Can we rename DOT_SNAPSHOT_SEPARATOR_DIR to the more readable name DOT_SNAPSHOT_DIR_SEPARATOR? *DistCpFedBalance.java* # It would good to print the fed context that created from input options, so that we will know final options that we passed in. {noformat} +. // --> print fed balancer context + // Construct the balance job. + BalanceJob.Builder builder = new BalanceJob.Builder<>(); + DistCpProcedure dcp = + new DistCpProcedure(DISTCP_PROCEDURE, null, delayDuration, context); + builder.nextProcedure(dcp); {noformat} # We can replace this system out in LOG instance, {noformat} +for (BalanceJob job : jobs) { + if (!job.isJobDone()) { +unfinished++; + } + System.out.println(job); +} {noformat} *DistCpProcedure.java* # The message in IOException(src + " doesn't exist.") not correctly described, should be 'src + " should be the directory."' # For each stage change, can we add aN additional output log, like this: {noformat} +if (srcFs.exists(new Path(src, HdfsConstants.DOT_SNAPSHOT_DIR))) { + throw new IOException(src + " shouldn't enable snapshot."); +} LOG.info("Stage updated from {} to {}.", stage.name(), Stage.INIT_DISTCP.name()) +stage = Stage.INIT_DISTCP; + } {noformat} # Here we reset permission to 0, that means no any operation is allowed? Is this expected, why not is 400 (only allow read)? The comment said that 'cancelling the x permission of the source path.' makes me confused. {noformat} srcFs.setPermission(src, FsPermission.createImmutable((short) 0)); {noformat} # I prefer to throw IOException rather than doing delete operation in cleanUpBeforeInitDistcp. cleanUpBeforeInitDistcp is expected to be the final pre-check function before submit ditcp job. {noformat} + private void initialCheckBeforeInitDistcp() throws IOException { +if (dstFs.exists(dst)) { + throw IOException(); +} +srcFs.allowSnapshot(src); +if (srcFs.exists(new Path(src, +HdfsConstants.DOT_SNAPSHOT_SEPARATOR_DIR + CURRENT_SNAPSHOT_NAME))) { throw IOException(); +} {noformat} *FedBalanceConfigs.java* Can we move all keys from BalanceProcedureConfigKeys to this class? We don't need two duplicated Config class. One follow-up task I am thinking that we can have a separated config file something named fedbalance-default.xml for fedbalance tool, like ditcp-default.xml for distcp tool now. I don't prefer to add all tool config settings into hdfs-default.xml. *FedBalanceContext.java* Override the toString method in FedBalanceContext to help us know the input options that actually be used. *MountTableProcedure.java* The for loop can just break once we find the first source path that matched. {noformat} +for (MountTable result : results) { + if (mount.equals(result.getSourcePath())) { + existingEntry = result; break; + } +} {noformat} *TrashProcedure.java* {noformat} + /** + * Delete source path to trash. + */ + void moveToTrash() throws IOException { +Path src = context.getSrc(); +if (srcFs.exists(src)) { + switch (context.getTrashOpt()) { + case TRASH: +conf.setFloat(FS_TRASH_INTERVAL_KEY, 1); +if (!Trash.moveToAppropriateTrash(srcFs, src, conf)) { + throw new IOException("Failed move " + src + " to trash."); +} +break; + case DELETE: +if (!srcFs.delete(src, true)) { + throw new IOException("Failed delete " + src); +} +LOG.info("{} is deleted.", src); +break; + default: +break; + } +} + } {noformat} For above lines, two review comments: # Can we add SKIP option check as well and throw unexpected option error? {noformat} case SKIP: break; + default: + throw new IOException("Unexpected trash option=" + context.getTrashOpt()); + } {noformat} # FS_TRASH_INTERVAL_KEY defined with 1 is too small, that means we the trash will be deleted after 1 minute. Can you increased this to 60? Also please add necessary comment in trash option description to say the default trash behavior when trash is disabled in server side and client side value will be used. > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, >
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126332#comment-17126332 ] Yiqun Lin commented on HDFS-15346: -- Will give detailed review on this weekend, [~LiJinglun]. > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, > HDFS-15346.006.patch, HDFS-15346.007.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123467#comment-17123467 ] Yiqun Lin edited comment on HDFS-15346 at 6/2/20, 7:56 AM: --- [~LiJinglun], can you fix related failure ut and generated checkstyle warnings? The patch generated 19 new + 2 unchanged - 0 fixed = 21 total (was 2) [https://builds.apache.org/job/PreCommit-HDFS-Build/29395/artifact/out/diff-checkstyle-root.txt] was (Author: linyiqun): [~LiJinglun], can you fix related failure ut and generated checkstyle warnings? The patch generated 19 new + 2 unchanged - 0 fixed = 21 total (was 2) > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123467#comment-17123467 ] Yiqun Lin commented on HDFS-15346: -- [~LiJinglun], can you fix related failure ut and generated checkstyle warnings? The patch generated 19 new + 2 unchanged - 0 fixed = 21 total (was 2) > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120981#comment-17120981 ] Yiqun Lin edited comment on HDFS-15346 at 6/1/20, 12:23 PM: Hi [~LiJinglun] , some initial review comments from me: *DistCpFedBalance.java* # line 77 I suggest to extract 'submit' as a static variable in this class. # line 85 the same comment to extract. # line 127 Can you complete the javadoc of this method? # line 132: Why the default bandwidth is only 1 for fedbaalance, will not be too small? # line 137, 140, 150 We can use method CommandLine#hasOption to extract Boolean type input value. # line 178 Can you complete the javadoc of construct method? # line 199, 206, 210, 215 Also suggest to use static variable rather than hard-coded value in these places. # line 228 rClient not closed after it's used. *DistCpProcedure.java* # line 191 We can use HdfsConstants.SEPARATOR_DOT_SNAPSHOT_DIR_SEPARATOR to replace '/.snapshot/' # line 306 It will be better if we can add some necessary describe for the steps of diff distcp job submission. # line 374 Can we replace '.snapshot' with HdfsConstants.DOT_SNAPSHOT_DIR in all other places in this class? *TestDistCpProcedure.java* Can you use replace HdfsConstants.DOT_SNAPSHOT_DIR to replace '.snapshot' in this class? *TestTrashProcedure.java* {quote}Path src = new Path(nnUri + "/"+getMethodName()+"-src"); Path dst = new Path(nnUri + "/"+getMethodName()+"-dst"); {quote} We don't need to use nnUri here because we have already got the Filesystem instance. If we don't want to specified for one namespace, URI prefix can be ignored, default fs will be used. We can simplifed to {quote}Path src = new Path("/" + +getMethodName() ++ "-src"); Path dst = new Path("/" + +getMethodName() ++ "-dst"); {quote} was (Author: linyiqun): Hi [~LiJinglun] , some initial review comments from me: *DistCpFedBalance.java* # line 77 I suggest to extract 'submit' as a static variable in this class. # line 85 the same comment to extract. # line 127 Can you complete the javadoc of this method? # line 132: Why the default bandwidth is only 1 for fedbaalance, will not be too small? # line 137, 140, 150 We can use method CommandLine#hasOption to extract Boolean type input value. # line 178 Can you complete the javadoc of construct method? # line 199, 206, 210, 215 Also suggest to use static variable rather than hard-coded value in these places. # line 228 rClient not closed after it's used. *DistCpProcedure.java* # line 191 We can use HdfsConstants.SEPARATOR_DOT_SNAPSHOT_DIR_SEPARATOR to replace '/.snapshot/' # line 306 It will be better if we can add some necessary describe for the steps of diff distcp job submission. # line 374 Can we replace '.snapshot' with HdfsConstants.DOT_SNAPSHOT_DIR in all other places in this class? *TestDistCpProcedure.java* Can you use replace HdfsConstants.DOT_SNAPSHOT_DIR to replace '.snapshot' in this class? *TestTrashProcedure.java* {quote}Path src = new Path(nnUri + "/"+getMethodName()+"-src"); Path dst = new Path(nnUri + "/"+getMethodName()+"-dst"); {quote} We don't need to use nnUri here because we have already got the Filesystem instance. If we don't want to specified for one namespace, URI prefix can be ignored, default fs will be used. We can simplifed to {quote}Path src = new Path("/"+getMethodName()+"-src"); Path dst = new Path("/"+getMethodName()+"-dst"); {quote} > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation
[ https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120981#comment-17120981 ] Yiqun Lin commented on HDFS-15346: -- Hi [~LiJinglun] , some initial review comments from me: *DistCpFedBalance.java* # line 77 I suggest to extract 'submit' as a static variable in this class. # line 85 the same comment to extract. # line 127 Can you complete the javadoc of this method? # line 132: Why the default bandwidth is only 1 for fedbaalance, will not be too small? # line 137, 140, 150 We can use method CommandLine#hasOption to extract Boolean type input value. # line 178 Can you complete the javadoc of construct method? # line 199, 206, 210, 215 Also suggest to use static variable rather than hard-coded value in these places. # line 228 rClient not closed after it's used. *DistCpProcedure.java* # line 191 We can use HdfsConstants.SEPARATOR_DOT_SNAPSHOT_DIR_SEPARATOR to replace '/.snapshot/' # line 306 It will be better if we can add some necessary describe for the steps of diff distcp job submission. # line 374 Can we replace '.snapshot' with HdfsConstants.DOT_SNAPSHOT_DIR in all other places in this class? *TestDistCpProcedure.java* Can you use replace HdfsConstants.DOT_SNAPSHOT_DIR to replace '.snapshot' in this class? *TestTrashProcedure.java* {quote}Path src = new Path(nnUri + "/"+getMethodName()+"-src"); Path dst = new Path(nnUri + "/"+getMethodName()+"-dst"); {quote} We don't need to use nnUri here because we have already got the Filesystem instance. If we don't want to specified for one namespace, URI prefix can be ignored, default fs will be used. We can simplifed to {quote}Path src = new Path("/"+getMethodName()+"-src"); Path dst = new Path("/"+getMethodName()+"-dst"); {quote} > RBF: DistCpFedBalance implementation > > > Key: HDFS-15346 > URL: https://issues.apache.org/jira/browse/HDFS-15346 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, > HDFS-15346.003.patch, HDFS-15346.004.patch > > > Patch in HDFS-15294 is too big to review so we split it into 2 patches. This > is the second one. Detail can be found at HDFS-15294. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org