[jira] [Comment Edited] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop

2023-02-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683033#comment-17683033
 ] 

Yiqun Lin edited comment on HDFS-16644 at 2/1/23 2:19 PM:
--

We also meet this issue in our Hadoop3 cluster. Our DN server is Hadoop 3.3 
version, and client version is 2.10.2.

We find that there is one chance that abnormal QOP value(e.g. DI) can be passed 
and overwrite for DataNode sasl props.
But by default case(HDFS-13541 feature not enabled), the secret should not be 
passed here. Somehow that there maybe some bug on 2.10.2 version that still 
pass the secret here.

[~vagarychen], could you please check for this code on branch-2.10. It's very 
dangerous that once DN sasl props is overwrite with an invalid value, all the 
data read/write could be impacted. And also here we don't do any validation 
check for QOP value.

SaslDataTransferServer#doSaslHandshake
{noformat}
  private IOStreamPair doSaslHandshake(Peer peer, OutputStream underlyingOut,
  InputStream underlyingIn, Map saslProps,
  CallbackHandler callbackHandler) throws IOException {

DataInputStream in = new DataInputStream(underlyingIn);
DataOutputStream out = new DataOutputStream(underlyingOut);

int magicNumber = in.readInt();
if (magicNumber != SASL_TRANSFER_MAGIC_NUMBER) {
  throw new InvalidMagicNumberException(magicNumber, 
  dnConf.getEncryptDataTransfer());
}
try {
  // step 1
  SaslMessageWithHandshake message = readSaslMessageWithHandshakeSecret(in);
  byte[] secret = message.getSecret();
  String bpid = message.getBpid();
  if (secret != null || bpid != null) {
// sanity check, if one is null, the other must also not be null
assert(secret != null && bpid != null);
String qop = new String(secret, Charsets.UTF_8);
saslProps.put(Sasl.QOP, qop);   <= here any QOP value could be set 
here
  }
...
{noformat}


was (Author: linyiqun):
We also meet this issue in our Hadoop3 cluster. Our DN server is Hadoop 3.3 
version, and client version is 2.10.2.

We find that there is one chance that abnormal QOP value(e.g. DI) can be passed 
and overwrite for DataNode sasl props.
But by default case(HDFS-13541 feature not enabled), the secret should not be 
passed here. Somehow that there maybe some bug on 2.10.2 version that still 
pass the secret here.

[~vagarychen], could you please check for this code on branch-2.10. It's very 
dangerous that once DN sasl props is overwrite with an invalid value. All the 
data read/write could be impacted. And also here we don't do any validation 
check for QOP value.

SaslDataTransferServer#doSaslHandshake
{noformat}
  private IOStreamPair doSaslHandshake(Peer peer, OutputStream underlyingOut,
  InputStream underlyingIn, Map saslProps,
  CallbackHandler callbackHandler) throws IOException {

DataInputStream in = new DataInputStream(underlyingIn);
DataOutputStream out = new DataOutputStream(underlyingOut);

int magicNumber = in.readInt();
if (magicNumber != SASL_TRANSFER_MAGIC_NUMBER) {
  throw new InvalidMagicNumberException(magicNumber, 
  dnConf.getEncryptDataTransfer());
}
try {
  // step 1
  SaslMessageWithHandshake message = readSaslMessageWithHandshakeSecret(in);
  byte[] secret = message.getSecret();
  String bpid = message.getBpid();
  if (secret != null || bpid != null) {
// sanity check, if one is null, the other must also not be null
assert(secret != null && bpid != null);
String qop = new String(secret, Charsets.UTF_8);
saslProps.put(Sasl.QOP, qop);   <= here any QOP value could be set 
here
  }
...
{noformat}

> java.io.IOException Invalid token in javax.security.sasl.qop
> 
>
> Key: HDFS-16644
> URL: https://issues.apache.org/jira/browse/HDFS-16644
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Walter Su
>Priority: Major
>
> deployment:
> server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1
> client side:
> I run command hadoop fs -put a test file, with kerberos ticket inited first, 
> and use identical core-site.xml & hdfs-site.xml configuration.
>  using client ver 3.2.1, it succeeds.
>  using client ver 2.8.5, it succeeds.
>  using client ver 2.10.1, it fails. The client side error info is:
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2022-06-27 01:06:15,781 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, 
> /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', 
> 

[jira] [Commented] (HDFS-16644) java.io.IOException Invalid token in javax.security.sasl.qop

2023-02-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683033#comment-17683033
 ] 

Yiqun Lin commented on HDFS-16644:
--

We also meet this issue in our Hadoop3 cluster. Our DN server is Hadoop 3.3 
version, and client version is 2.10.2.

We find that there is one chance that abnormal QOP value(e.g. DI) can be passed 
and overwrite for DataNode sasl props.
But by default case(HDFS-13541 feature not enabled), the secret should not be 
passed here. Somehow that there maybe some bug on 2.10.2 version that still 
pass the secret here.

[~vagarychen], could you please check for this code on branch-2.10. It's very 
dangerous that once DN sasl props is overwrite with an invalid value. All the 
data read/write could be impacted. And also here we don't do any validation 
check for QOP value.

SaslDataTransferServer#doSaslHandshake
{noformat}
  private IOStreamPair doSaslHandshake(Peer peer, OutputStream underlyingOut,
  InputStream underlyingIn, Map saslProps,
  CallbackHandler callbackHandler) throws IOException {

DataInputStream in = new DataInputStream(underlyingIn);
DataOutputStream out = new DataOutputStream(underlyingOut);

int magicNumber = in.readInt();
if (magicNumber != SASL_TRANSFER_MAGIC_NUMBER) {
  throw new InvalidMagicNumberException(magicNumber, 
  dnConf.getEncryptDataTransfer());
}
try {
  // step 1
  SaslMessageWithHandshake message = readSaslMessageWithHandshakeSecret(in);
  byte[] secret = message.getSecret();
  String bpid = message.getBpid();
  if (secret != null || bpid != null) {
// sanity check, if one is null, the other must also not be null
assert(secret != null && bpid != null);
String qop = new String(secret, Charsets.UTF_8);
saslProps.put(Sasl.QOP, qop);   <= here any QOP value could be set 
here
  }
...
{noformat}

> java.io.IOException Invalid token in javax.security.sasl.qop
> 
>
> Key: HDFS-16644
> URL: https://issues.apache.org/jira/browse/HDFS-16644
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Walter Su
>Priority: Major
>
> deployment:
> server side: kerberos enabled cluster with jdk 1.8 and hdfs-server 3.2.1
> client side:
> I run command hadoop fs -put a test file, with kerberos ticket inited first, 
> and use identical core-site.xml & hdfs-site.xml configuration.
>  using client ver 3.2.1, it succeeds.
>  using client ver 2.8.5, it succeeds.
>  using client ver 2.10.1, it fails. The client side error info is:
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2022-06-27 01:06:15,781 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DataNode{data=FSDataset{dirpath='[/mnt/disk1/hdfs, /mnt/***/hdfs, 
> /mnt/***/hdfs, /mnt/***/hdfs]'}, localName='emr-worker-***.***:9866', 
> datanodeUuid='b1c7f64a-6389-4739-bddf-***', xmitsInProgress=0}:Exception 
> transfering block BP-1187699012-10.-***:blk_1119803380_46080919 to mirror 
> 10.*:9866
> java.io.IOException: Invalid token in javax.security.sasl.qop: D
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:220)
> Once any client ver 2.10.1 connect to hdfs server, the DataNode no longer 
> accepts any client connection, even client ver 3.2.1 cannot connects to hdfs 
> server. The DataNode rejects any client connection. For a short time, all 
> DataNodes rejects client connections. 
> The problem exists even if I replace DataNode with ver 3.3.0 or replace java 
> with jdk 11.
> The problem is fixed if I replace DataNode with ver 3.2.0. I guess the 
> problem is related to HDFS-13541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2021-09-30 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423086#comment-17423086
 ] 

Yiqun Lin commented on HDFS-15486:
--

Hi [~functioner],
{quote}Yiqun Lin I reported a similar issue in HDFS-15869 and I had a github 
pull request there. You can take a look at whether that works, and whether we 
should resolve that Jira issue and this Jira issue together.
{quote}
 I'm afraid that I don't have enough time to review that patch recently, sorry 
for that..

> Costly sendResponse operation slows down async editlog handling
> ---
>
> Key: HDFS-15486
> URL: https://issues.apache.org/jira/browse/HDFS-15486
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yiqun Lin
>Priority: Major
> Attachments: Async-profile-(2).jpg, HDFS-15486_draft.patch, 
> async-profile-(1).jpg
>
>
> When our cluster NameNode in a very high load, we find it often stuck in 
> Async-editlog handling.
> We use async-profile tool to get the flamegraph.
> !Async-profile-(2).jpg!
> This happened in that async editlog thread consumes Edit from the queue and 
> triggers the sendResponse call.
> But here the sendResponse call is a little expensive since our cluster 
> enabled the security env and will do some encode operations when doing the 
> return response operation.
> We often catch some moments of costly sendResponse operation when rpc call 
> queue is fulled.
> !async-profile-(1).jpg!
> Slowness on consuming Edit in async editlog will make Edit pending Queue 
> easily become the fulled state, then block its enqueue operation that is 
> invoked in writeLock type methods in FSNamesystem class.
> Here the enhancement is that we can use multiple thread to parallel execute 
> sendResponse call. sendResponse doesn't need use the write lock to do 
> protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2021-09-30 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423081#comment-17423081
 ] 

Yiqun Lin commented on HDFS-15486:
--

Some notes for above draft patch:
 * Here we introduced the switch setting to enable the async response handling.
 * The patch is based on the branch-2.7 branch not the latest trunk branch.

> Costly sendResponse operation slows down async editlog handling
> ---
>
> Key: HDFS-15486
> URL: https://issues.apache.org/jira/browse/HDFS-15486
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yiqun Lin
>Priority: Major
> Attachments: Async-profile-(2).jpg, HDFS-15486_draft.patch, 
> async-profile-(1).jpg
>
>
> When our cluster NameNode in a very high load, we find it often stuck in 
> Async-editlog handling.
> We use async-profile tool to get the flamegraph.
> !Async-profile-(2).jpg!
> This happened in that async editlog thread consumes Edit from the queue and 
> triggers the sendResponse call.
> But here the sendResponse call is a little expensive since our cluster 
> enabled the security env and will do some encode operations when doing the 
> return response operation.
> We often catch some moments of costly sendResponse operation when rpc call 
> queue is fulled.
> !async-profile-(1).jpg!
> Slowness on consuming Edit in async editlog will make Edit pending Queue 
> easily become the fulled state, then block its enqueue operation that is 
> invoked in writeLock type methods in FSNamesystem class.
> Here the enhancement is that we can use multiple thread to parallel execute 
> sendResponse call. sendResponse doesn't need use the write lock to do 
> protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2021-09-30 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15486:
-
Attachment: HDFS-15486_draft.patch

> Costly sendResponse operation slows down async editlog handling
> ---
>
> Key: HDFS-15486
> URL: https://issues.apache.org/jira/browse/HDFS-15486
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yiqun Lin
>Priority: Major
> Attachments: Async-profile-(2).jpg, HDFS-15486_draft.patch, 
> async-profile-(1).jpg
>
>
> When our cluster NameNode in a very high load, we find it often stuck in 
> Async-editlog handling.
> We use async-profile tool to get the flamegraph.
> !Async-profile-(2).jpg!
> This happened in that async editlog thread consumes Edit from the queue and 
> triggers the sendResponse call.
> But here the sendResponse call is a little expensive since our cluster 
> enabled the security env and will do some encode operations when doing the 
> return response operation.
> We often catch some moments of costly sendResponse operation when rpc call 
> queue is fulled.
> !async-profile-(1).jpg!
> Slowness on consuming Edit in async editlog will make Edit pending Queue 
> easily become the fulled state, then block its enqueue operation that is 
> invoked in writeLock type methods in FSNamesystem class.
> Here the enhancement is that we can use multiple thread to parallel execute 
> sendResponse call. sendResponse doesn't need use the write lock to do 
> protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2021-09-30 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423078#comment-17423078
 ] 

Yiqun Lin commented on HDFS-15486:
--

Asked by [~yuanbo] offline that if there is the patch for this JIRA,  I have 
attached the draft patch for this JIRA, this patch already applied in our 
internal hadoop version and  executed well in our production environment. It 
can increased the RPC throughout for NameNode.

> Costly sendResponse operation slows down async editlog handling
> ---
>
> Key: HDFS-15486
> URL: https://issues.apache.org/jira/browse/HDFS-15486
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yiqun Lin
>Priority: Major
> Attachments: Async-profile-(2).jpg, async-profile-(1).jpg
>
>
> When our cluster NameNode in a very high load, we find it often stuck in 
> Async-editlog handling.
> We use async-profile tool to get the flamegraph.
> !Async-profile-(2).jpg!
> This happened in that async editlog thread consumes Edit from the queue and 
> triggers the sendResponse call.
> But here the sendResponse call is a little expensive since our cluster 
> enabled the security env and will do some encode operations when doing the 
> return response operation.
> We often catch some moments of costly sendResponse operation when rpc call 
> queue is fulled.
> !async-profile-(1).jpg!
> Slowness on consuming Edit in async editlog will make Edit pending Queue 
> easily become the fulled state, then block its enqueue operation that is 
> invoked in writeLock type methods in FSNamesystem class.
> Here the enhancement is that we can use multiple thread to parallel execute 
> sendResponse call. sendResponse doesn't need use the write lock to do 
> protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2021-03-24 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307712#comment-17307712
 ] 

Yiqun Lin edited comment on HDFS-15660 at 3/24/21, 9:48 AM:


Hi [~weichiu], this compatible issue only happened in that old hadoop version 
client doesn't contain the storage type which introduced in HDFS-9806. It's a 
client side issue not the server side. As version 3.1, 3.2 and 3.3 already 
contain the new storage type, it should be okay to do the upgrade. So I don't 
cherry-pick to other branches.


was (Author: linyiqun):
Hi [~weichiu], this compatible issue only happened in that old hadoop version 
client doesn't contain the storage type which introduced in HDFS-9806. It's a 
client side issue not the server side. As version 3.1, 3.2 and 3.3 already 
contain the new storage type, it should be okay to do the upgrade. So I only 
push the fix to trunk.

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.9.3, 3.4.0, 2.10.2
>
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: 

[jira] [Comment Edited] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2021-03-24 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307712#comment-17307712
 ] 

Yiqun Lin edited comment on HDFS-15660 at 3/24/21, 9:47 AM:


Hi [~weichiu], this compatible issue only happened in that old hadoop version 
client doesn't contain the storage type which introduced in HDFS-9806. It's a 
client side issue not the server side. As version 3.1, 3.2 and 3.3 already 
contain the new storage type, it should be okay to do the upgrade. So I only 
push the fix to trunk.


was (Author: linyiqun):
Hi [~weichiu], this compatible issue only happened in that old hadoop version 
client doesn't contain the storage type which introduced in HDFS-9806. It's a 
client side issue not the server side. As version 3.1, 3.2 and 3.3 already 
contain the new storage type, it should be okay to do the upgrade.

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.9.3, 3.4.0, 2.10.2
>
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> 

[jira] [Commented] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2021-03-24 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307712#comment-17307712
 ] 

Yiqun Lin commented on HDFS-15660:
--

Hi [~weichiu], this compatible issue only happened in that old hadoop version 
client doesn't contain the storage type which introduced in HDFS-9806. It's a 
client side issue not the server side. As version 3.1, 3.2 and 3.3 already 
contain the new storage type, it should be okay to do the upgrade.

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.9.3, 3.4.0, 2.10.2
>
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> 

[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation

2021-01-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263438#comment-17263438
 ] 

Yiqun Lin commented on HDFS-14558:
--

LGTM, +1.

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch, 
> HDFS-14558.003.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14558) RBF: Isolation/Fairness documentation

2021-01-12 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-14558:
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed this to trunk.

Thanks [~fengnanli] for the contribution.

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch, 
> HDFS-14558.003.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14558) RBF: Isolation/Fairness documentation

2021-01-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262526#comment-17262526
 ] 

Yiqun Lin edited comment on HDFS-14558 at 1/11/21, 9:46 AM:


mvnsite compiled failed due to unexpected '<>' was used in 
dfs.federation.router.fairness.handler.count.. and 'Dedicated 
handler assigned to a specific '
 [~fengnanli], could you please use below lines instead of?
{noformat}
dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE*
Dedicated handler assigned to a specific nameservice
{noformat}
Others look good to me.


was (Author: linyiqun):
mvnsite compiled failed due to unexpected '<>' was used in 
dfs.federation.router.fairness.handler.count.. and 'Dedicated 
handler assigned to a specific '
 [~fengnanli], could you please below lines instead of?
{noformat}
dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE*
Dedicated handler assigned to a specific nameservice
{noformat}
Others look good to me.

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14558) RBF: Isolation/Fairness documentation

2021-01-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262526#comment-17262526
 ] 

Yiqun Lin edited comment on HDFS-14558 at 1/11/21, 9:46 AM:


mvnsite compiled failed due to unexpected '<>' was used in 
dfs.federation.router.fairness.handler.count.. and 'Dedicated 
handler assigned to a specific '
 [~fengnanli], could you please below lines instead of?
{noformat}
dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE*
Dedicated handler assigned to a specific nameservice
{noformat}
Others look good to me.


was (Author: linyiqun):
mvnsite compiled failed due to unexpected '<>' was used in 
dfs.federation.router.fairness.handler.count..
 [~fengnanli], could you use below instead of?
{noformat}
dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE*
{noformat}
Others look good to me.

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation

2021-01-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262526#comment-17262526
 ] 

Yiqun Lin commented on HDFS-14558:
--

mvnsite compiled failed due to unexpected '<>' was used in 
dfs.federation.router.fairness.handler.count..
 [~fengnanli], could you use below instead of?
{noformat}
dfs.federation.router.fairness.handler.count.*EXAMPLENAMESERVICE*
{noformat}
Others look good to me.

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14558.001.patch, HDFS-14558.002.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation

2021-01-10 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262202#comment-17262202
 ] 

Yiqun Lin commented on HDFS-14558:
--

Hi [~fengnanli], do you have the time to address above review comments? It 
would be better to complete this document as well. We already merged HDFS-14090 
for some time but this JIRA was blocked.

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14558.001.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14558) RBF: Isolation/Fairness documentation

2020-12-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248545#comment-17248545
 ] 

Yiqun Lin commented on HDFS-14558:
--

[~fengnanli], thanks for updating the patch, most of the patch looks great. 
Just few comments from me:
{noformat}
+| dfs.federation.router.fairness.policy.controller.class | 
`org.apache.hadoop.hdfs.server.federation.fairness.DefaultFairnessPolicyController`
 | Default handler allocation model to be used if isolation feature is enabled. 
|
{noformat}
Here the default value should be 
org.apache.hadoop.hdfs.server.federation.fairness.NoRouterRpcFairnessPolicyController.
 We can mention DefaultFairnessPolicyController in description if isolation 
feature is enabled.
{noformat}
+### Isolation
+
+Isolation and dedicated assignment of RPC handlers across all configured 
downstream nameservices.
+
{noformat}
Can we additionally mention one point that the sum of all configured handler 
count values must be strictly smaller than the router handlers (configed by 
dfs.federation.router.handler.count)?

Please fix one whitespace line warings
{noformat}
hadoop-hdfs-project/hadoop-hdfs-rbf/src/site/markdown/HDFSRouterFederation.md:193:Overall
 the isolation feature is exposed via a configuration 
dfs.federation.router.handler.isolation.enable. The default value of this 
feature will be “false”. Users can also introduce their own fairness policy 
controller for custom allocation of handlers to various nameservices. 
{noformat}

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14558.001.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-12-07 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15660:
-
Description: 
In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  we 
found hive to call getContentSummary method , the client and server was not 
compatible  because of hadoop3 added new PROVIDED storage type.
{code:java}
// code placeholder
20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
x/x:8020. Trying to fail over immediately.
java.io.IOException: com.google.protobuf.ServiceException: 
com.google.protobuf.UninitializedMessageException: Message missing required 
fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
        at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
        at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
        at 
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
        at 
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
Caused by: com.google.protobuf.ServiceException: 
com.google.protobuf.UninitializedMessageException: Message missing required 
fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
        at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
        ... 23 more
Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
        at 
com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263)
        ... 25 more
{code}

This compatible issue only happened in StorageType feature is used in cluster.

  was:
In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  we 
found hive to call getContentSummary method , the client and server was not 
compatible  because of hadoop3 added new PROVIDED storage type.
{code:java}
// code placeholder
20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
x/x:8020. Trying to fail over immediately.
java.io.IOException: com.google.protobuf.ServiceException: 
com.google.protobuf.UninitializedMessageException: Message missing required 
fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
        at 

[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-12-07 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15660:
-
Fix Version/s: 2.10.2
   3.4.0
   2.9.3
 Hadoop Flags: Reviewed

The new storage type was introduced in HDFS-9806, and this feature is 
implemented in 3.1 version. So the versions before 3.1 contains this compatible 
issue and needed to be applied this fix.

Committed this to branch-2.9, branch-2.10, branch-3.0 and trunk.
Thanks [~jianliang.wu] for the contribution and others for the review.

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.9.3, 3.4.0, 2.10.2
>
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> 

[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-12-07 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15660:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.9.3, 3.4.0, 2.10.2
>
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263)
>         ... 25 more
> {code}



--
This message was sent by 

[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-12-04 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15660:
-
 Target Version/s: 2.9.3, 2.10.2  (was: 2.9.3, 3.3.1, 3.4.0, 3.1.5, 2.10.2, 
3.2.3)
Affects Version/s: (was: 3.1.3)
   (was: 3.2.0)
   2.9.2
   2.8.5
   2.7.7
   2.10.1
   Issue Type: Bug  (was: Improvement)

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.0.1, 2.9.2, 2.8.5, 2.7.7, 2.10.1
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> 

[jira] [Commented] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-12-03 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243715#comment-17243715
 ] 

Yiqun Lin commented on HDFS-15660:
--

Thanks for providing the test result for this change, [~jianliang.wu].

LGTM, +1. 

I think this is a safe change,  I will hold off the commit until next week.

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Attachments: HDFS-15660.002.patch, HDFS-15660.003.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263)
>     

[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-11-25 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15660:
-
Attachment: HDFS-15660.002.patch

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Attachments: HDFS-15660.001.patch, HDFS-15660.002.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263)
>         ... 25 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Commented] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-11-25 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239094#comment-17239094
 ] 

Yiqun Lin commented on HDFS-15660:
--

Attach the same patch to trigger Jenkins.

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Attachments: HDFS-15660.001.patch, HDFS-15660.002.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263)
>         ... 25 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HDFS-15660) StorageTypeProto is not compatiable between 3.x and 2.6

2020-11-25 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15660:
-
Status: Patch Available  (was: Open)

> StorageTypeProto is not compatiable between 3.x and 2.6
> ---
>
> Key: HDFS-15660
> URL: https://issues.apache.org/jira/browse/HDFS-15660
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.3, 3.2.0
>Reporter: Ryan Wu
>Assignee: Ryan Wu
>Priority: Major
> Attachments: HDFS-15660.001.patch
>
>
> In our case, when nn has upgraded to 3.1.3 and dn’s version was still 2.6,  
> we found hive to call getContentSummary method , the client and server was 
> not compatible  because of hadoop3 added new PROVIDED storage type.
> {code:java}
> // code placeholder
> 20/04/15 14:28:35 INFO retry.RetryInvocationHandler---main: Exception while 
> invoking getContentSummary of class ClientNamenodeProtocolTranslatorPB over 
> x/x:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:819)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy11.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:3144)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:706)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:702)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:713)
>         at org.apache.hadoop.fs.shell.Count.processPath(Count.java:109)
>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>         at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>         at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>         at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>         at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> Caused by: com.google.protobuf.ServiceException: 
> com.google.protobuf.UninitializedMessageException: Message missing required 
> fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:272)
>         at com.sun.proxy.$Proxy10.getContentSummary(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getContentSummary(ClientNamenodeProtocolTranslatorPB.java:816)
>         ... 23 more
> Caused by: com.google.protobuf.UninitializedMessageException: Message missing 
> required fields: summary.typeQuotaInfos.typeQuotaInfo[3].type
>         at 
> com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65392)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetContentSummaryResponseProto$Builder.build(ClientNamenodeProtocolProtos.java:65331)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:263)
>         ... 25 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231451#comment-17231451
 ] 

Yiqun Lin commented on HDFS-14090:
--

Thanks for addressing the comments, [~fengnanli],

LGTM, +1.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, 
> HDFS-14090.024.patch, HDFS-14090.025.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230612#comment-17230612
 ] 

Yiqun Lin edited comment on HDFS-14090 at 11/12/20, 2:38 PM:
-

Hi [~fengnanli], three nits for the latest patch:

1 Will look good to  rename dfs.federation.router.fairness.handler.count.NS to 
dfs.federation.router.fairness.handler.count.EXAMPLENAMESERVICE.

2 {noformat}
smaller or equal to the total number of router handlers; if the special
  *concurrent* is not specified, the sum of all configured values must be
  strictly smaller than the router handlers thus the left will be allocated
  to the concurrent calls.
{noformat}
Can we mention related setting ''strictly smaller than the router handlers 
(dfs.federation.router.handler.count)...

3
Can you fix related failed unit test?
|hadoop.hdfs.server.federation.router.TestRBFConfigFields|

Others look good to me.


was (Author: linyiqun):
Hi [~fengnanli], two nits for the latest patch:
{noformat}
smaller or equal to the total number of router handlers; if the special
  *concurrent* is not specified, the sum of all configured values must be
  strictly smaller than the router handlers thus the left will be allocated
  to the concurrent calls.
{noformat}
Can we mention related setting ''strictly smaller than the router handlers 
(dfs.federation.router.handler.count)...

Can you fix related failed unit test?
|hadoop.hdfs.server.federation.router.TestRBFConfigFields|

Others look good to me.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, 
> HDFS-14090.024.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230612#comment-17230612
 ] 

Yiqun Lin commented on HDFS-14090:
--

Hi [~fengnanli], two nits for the latest patch:
{noformat}
smaller or equal to the total number of router handlers; if the special
  *concurrent* is not specified, the sum of all configured values must be
  strictly smaller than the router handlers thus the left will be allocated
  to the concurrent calls.
{noformat}
Can we mention related setting ''strictly smaller than the router handlers 
(dfs.federation.router.handler.count)...

Can you fix related failed unit test?

Others look good to me.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, 
> HDFS-14090.024.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230612#comment-17230612
 ] 

Yiqun Lin edited comment on HDFS-14090 at 11/12/20, 1:07 PM:
-

Hi [~fengnanli], two nits for the latest patch:
{noformat}
smaller or equal to the total number of router handlers; if the special
  *concurrent* is not specified, the sum of all configured values must be
  strictly smaller than the router handlers thus the left will be allocated
  to the concurrent calls.
{noformat}
Can we mention related setting ''strictly smaller than the router handlers 
(dfs.federation.router.handler.count)...

Can you fix related failed unit test?
|hadoop.hdfs.server.federation.router.TestRBFConfigFields|

Others look good to me.


was (Author: linyiqun):
Hi [~fengnanli], two nits for the latest patch:
{noformat}
smaller or equal to the total number of router handlers; if the special
  *concurrent* is not specified, the sum of all configured values must be
  strictly smaller than the router handlers thus the left will be allocated
  to the concurrent calls.
{noformat}
Can we mention related setting ''strictly smaller than the router handlers 
(dfs.federation.router.handler.count)...

Can you fix related failed unit test?

Others look good to me.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, 
> HDFS-14090.024.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230347#comment-17230347
 ] 

Yiqun Lin commented on HDFS-14090:
--

Sounds good to me, let's address #2 comment, [~fengnanli].

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-05 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227189#comment-17227189
 ] 

Yiqun Lin edited comment on HDFS-14090 at 11/6/20, 6:58 AM:


Hi [~fengnanli], some minor comments from me:


 1. I see here we introduce CONCURRENT_NS for concurrent call, why not acquire 
permit to corresponding ns instead of?

2. Current description of setting hdfs-rbf-default.xml can describe more. At 
least, we need to mention: 
 * The setting name for how to configure handler count for each ns, also 
include CONCURRENT_NS ns.
 * The sum of dedicated handler count should be less than value of 
dfs.federation.router.handler.count

3. It would be better to add this improvement in HDFSRouterFederation.md.

Comment #2 and #3 can be addressed in a follow-up JIRA,  :).


was (Author: linyiqun):
Hi [~fengnanli], some minor comments from me:


 1. I see here we introduce CONCURRENT_NS for concurrent call, why not acquire 
permit to corresponding ns instead of?

2. Current description of setting hdfs-rbf-default.xml can describe more. At 
least, we need to mention: 
 * The setting name for how to configure handler count for each ns, also 
include CONCURRENT_NS ns.
 * The sum of dedicated handler count should be less than value of 
dfs.federation.router.handler.count

3. It would be better to add this improvement in HDFSRouterFederation.md.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2020-11-05 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227189#comment-17227189
 ] 

Yiqun Lin commented on HDFS-14090:
--

Hi [~fengnanli], some minor comments from me:


 1. I see here we introduce CONCURRENT_NS for concurrent call, why not acquire 
permit to corresponding ns instead of?

2. Current description of setting hdfs-rbf-default.xml can describe more. At 
least, we need to mention: 
 * The setting name for how to configure handler count for each ns, also 
include CONCURRENT_NS ns.
 * The sum of dedicated handler count should be less than value of 
dfs.federation.router.handler.count

3. It would be better to add this improvement in HDFSRouterFederation.md.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: Fengnan Li
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, 
> HDFS-14090.015.patch, HDFS-14090.016.patch, HDFS-14090.017.patch, 
> HDFS-14090.018.patch, HDFS-14090.019.patch, HDFS-14090.020.patch, 
> HDFS-14090.021.patch, HDFS-14090.022.patch, HDFS-14090.023.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-11-03 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225839#comment-17225839
 ] 

Yiqun Lin commented on HDFS-15651:
--

Thanks [~Aiphag0] for the quick fix.

LGTM. +1.

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, 
> HDFS-15651.patch
>
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15294) Federation balance tool

2020-11-02 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224606#comment-17224606
 ] 

Yiqun Lin commented on HDFS-15294:
--

Hi [~coconut_icecream], as FedBlance is a completely new feature and hasn't 
released in the latest hadoop version, not sure if there is other potential 
issues. I prefer to backport this feature later once this feature is stable 
enough after released.

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15640) Add diff threshold to FedBalance

2020-10-26 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221089#comment-17221089
 ] 

Yiqun Lin edited comment on HDFS-15640 at 10/27/20, 2:48 AM:
-

Committed this to trunk.
Thanks [~LiJinglun] for the contribution.
BTW, [~LiJinglun], as HDFS-15294 is already a closed feature JIRA. next time we 
could add a related link to HDFS-15294 JIRA instead of reopen that once we find 
further bug or enhancement for FedBalance. 


was (Author: linyiqun):
Committed this to trunk.
Thanks [~LiJinglun] for the contribution.

> Add diff threshold to FedBalance
> 
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. During incremental copy stage, if the diff size is under the given 
> threshold scope then we don't need to wait for no diff. We can start the 
> final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15640) Add diff threshold to FedBalance

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15640:
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed this to trunk.
Thanks [~LiJinglun] for the contribution.

> Add diff threshold to FedBalance
> 
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. During incremental copy stage, if the diff size is under the given 
> threshold scope then we don't need to wait for no diff. We can start the 
> final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15294) Federation balance tool

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin resolved HDFS-15294.
--
Resolution: Fixed

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15640) Add diff threshold to FedBalance

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15640:
-
Summary: Add diff threshold to FedBalance  (was: Add snapshot diff 
threshold to FedBalance)

> Add diff threshold to FedBalance
> 
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. During incremental copy stage, if the diff size is under the given 
> threshold scope then we don't need to wait for no diff. We can start the 
> final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15640) Add snapshot diff threshold to FedBalance

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15640:
-
Description: Currently in the DistCpProcedure it must submit distcp round 
by round until there is no diff to go to the final distcp stage. The condition 
is very strict. During incremental copy stage, if the diff size is under the 
given threshold scope then we don't need to wait for no diff. We can start the 
final distcp directly.  (was: Currently in the DistCpProcedure it must submit 
distcp round by round until there is no diff to go to the final distcp stage. 
The condition is very strict. If the distcp could finish in an acceptable 
period then we don't need to wait for no diff. For example if 3 consecutive 
distcp jobs all finish within 10 minutes then we can predict the final distcp 
could also finish within 10 minutes. So we can start the final distcp directly.)

> Add snapshot diff threshold to FedBalance
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. During incremental copy stage, if the diff size is under the given 
> threshold scope then we don't need to wait for no diff. We can start the 
> final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15640) Add snapshot diff threshold to FedBalance

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15640:
-
Summary: Add snapshot diff threshold to FedBalance  (was: RBF: Add fast 
distcp threshold to FedBalance.)

> Add snapshot diff threshold to FedBalance
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220794#comment-17220794
 ] 

Yiqun Lin edited comment on HDFS-15651 at 10/26/20, 4:21 PM:
-

Thanks for the comments, [~hexiaoqiao].
{quote}Catch the error and loop forever could not resolve this issue in my 
opinion because DataNode still service but without the correct blockToken key.
{quote}
The blocktoken key will be updated for every keyUpdateInterval 
(dfs.block.access.key.update.interval). Once we recover the 
CommandProcessingThread, DN will get the new key from NN in the next 
keyUpdateInterval (by default is 10 hours).

[~Aiphag0], feel free to attach your fix here, :).


was (Author: linyiqun):
Thanks for the comments, [~hexiaoqiao].
{quote}Catch the error and loop forever could not resolve this issue in my 
opinion because DataNode still service but without the correct blockToken key.
{quote}
The blocktoken key will be updated for every keyUpdateInterval. Once we recover 
the CommandProcessingThread, DN will get the new key from NN in the next 
keyUpdateInterval (by default is 10 hours).

[~Aiphag0], feel free to attach your fix here, :).

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Priority: Major
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> 

[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220794#comment-17220794
 ] 

Yiqun Lin commented on HDFS-15651:
--

Thanks for the comments, [~hexiaoqiao].
{quote}Catch the error and loop forever could not resolve this issue in my 
opinion because DataNode still service but without the correct blockToken key.
{quote}
The blocktoken key will be updated for every keyUpdateInterval. Once we recover 
the CommandProcessingThread, DN will get the new key from NN in the next 
keyUpdateInterval (by default is 10 hours).

[~Aiphag0], feel free to attach your fix here, :).

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Priority: Major
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional 

[jira] [Comment Edited] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-26 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220611#comment-17220611
 ] 

Yiqun Lin edited comment on HDFS-15640 at 10/26/20, 10:33 AM:
--

Latest patch LGTM, +1.

Will commit this tomorrow once there is no further comments from others.


was (Author: linyiqun):
Latest patch LGTM, +1.

Will commit this tomorrow once there is further comments from others.

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-26 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220611#comment-17220611
 ] 

Yiqun Lin commented on HDFS-15640:
--

Latest patch LGTM, +1.

Will commit this tomorrow once there is further comments from others.

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch, HDFS-15640.004.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15651:
-
Description: 
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
{noformat}
Here the main point is that CommandProcessingThread crashed will lead a very 
bad impact. All the NN response commands will not be processed by DN side.

We enabled the block token to access the data, but here the DN command 
DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
Sasl error due to key expiration in DN log:
{noformat}
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
required block key (keyID=xxx) doesn't exist.]
{noformat}
 

For the impact in client side, our users receive lots of 'could not obtain 
block' error  with BlockMissingException.

CommandProcessingThread is a critical thread, it should always be running.
{code:java}
  /**
   * CommandProcessingThread that process commands asynchronously.
   */
  class CommandProcessingThread extends Thread {
private final BPServiceActor actor;
private final BlockingQueue queue;

...

@Override
public void run() {
  try {
processQueue();
  } catch (Throwable t) {
LOG.error("{} encountered fatal exception and exit.", getName(), t);   
<=== should not exit this thread
  }
}
{code}
Once a unexpected error happened, a better handing should be:
 * catch the exception, appropriately deal with the error and let processQueue 
continue to run
 or
 * exit the DN process to let admin user investigate this

  was:
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 

[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15651:
-
Description: 
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
{noformat}
Here the main point is that CommandProcessingThread crashed will lead a very 
bad impact. All the NN response commands will not be processed by DN side.

We enabled the block token to access the data, but here the DN command 
DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
Sasl error due to key expiration in DN log:
{noformat}
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
required block key (keyID=xxx) doesn't exist.]
{noformat}
 

For the impact in client side, our users receive lots of 'could not obtain 
block' error  with BlockMissingException.

CommandProcessingThread is a critical thread, it should always be running.
{code:java}
  /**
   * CommandProcessingThread that process commands asynchronously.
   */
  class CommandProcessingThread extends Thread {
private final BPServiceActor actor;
private final BlockingQueue queue;

...

@Override
public void run() {
  try {
processQueue();
  } catch (Throwable t) {
LOG.error("{} encountered fatal exception and exit.", getName(), t);   
<=== should not exit this thread
  }
}
{code}
Once a unexpected error happened, a better handing should be:
 * catch the exception
 or
 * exit the DN process to let admin user investigate this

  was:
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
at 

[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15651:
-
Description: 
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
{noformat}
Here the main point is that CommandProcessingThread crashed will lead a very 
bad impact. All the NN response commands will not be processed by DN side.

We enabled the block token to access the data, but here the DN command 
DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
Sasl error due to key expiration in DN log:
{noformat}
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
required block key (keyID=xxx) doesn't exist.]
{noformat}
 

For the impact in client side, our users receive lots of 'could not obtain 
block' error  with BlockMissingException.

CommandProcessingThread is a critical thread, it should always be running. Once 
a unexpected error happened, a better handing should be:
 * catch the exception
 or
 * exit the DN process to let admin user investigate this

  was:
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
at 

[jira] [Created] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Yiqun Lin (Jira)
Yiqun Lin created HDFS-15651:


 Summary: Client could not obtain block when DN 
CommandProcessingThread exit
 Key: HDFS-15651
 URL: https://issues.apache.org/jira/browse/HDFS-15651
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Yiqun Lin


In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
{noformat}
Here the main point is that CommandProcessingThread crashed will lead a very 
bad impact. All the NN response commands will not be processed by DN side.

As we enabled the block token to access the data, but here the DN command 
DNA_ACCESSKEYUPDATE is not processed on time. And then we see lots of Sasl 
error due to key expiration:
{noformat}
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
required block key (keyID=xxx) doesn't exist.]
{noformat}
 

For the impact in client side, our users receive lots of 'could not obtain 
block' error  with BlockMissingException.

CommandProcessingThread is a critical thread, it should always be running. Once 
a unexpected error happened, a better handing should be:
 * catch the exception
 or
 * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-25 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220448#comment-17220448
 ] 

Yiqun Lin commented on HDFS-15640:
--

Thanks for updating the patch, [~LiJinglun]! Looks very great now.

Catch one comment is outdated:
{code:java}
+   * @return true if moving to the next stage. false if the conditions are not
+   * satisfied.
+   * @throws RetryException if the conditions are not satisfied and there is no
+   * diff needed to be copied.x
+   */
+  @VisibleForTesting
+  boolean diffDistCpStageDone() throws IOException, RetryException {
{code}
Please update
{noformat}
...and there is no diff needed to be copied..
{noformat}
to
{noformat}
...and the diff size is under the given threshold scope..
{noformat}

+1 once this addressed.

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch, 
> HDFS-15640.003.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-23 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220003#comment-17220003
 ] 

Yiqun Lin commented on HDFS-15640:
--

[~LiJinglun], the latest patch almost looks good to me. 
 Minor comments from me:

*DistCpProcedure.java*
For below logic:
{code:java}
+  boolean diffDistCpStageDone() throws IOException, RetryException {
+int diffSize = getDiffSize();
+if (diffSize <= diffThreshold && (forceCloseOpenFiles
+|| !verifyOpenFiles())) {
+  return true;
+}
+if (diffSize == 0) {
+  throw new RetryException();
+} else {
+  return false;
+}
+  }
{code}
When diffSize is not 0 but it smaller than diffThreshold and 
(forceCloseOpenFiles || !verifyOpenFiles()) return false, we should also return 
RetryException.
 So above logic would be like below, below logic is consistent with original 
logic.
{code:java}
 boolean diffDistCpStageDone() throws IOException, RetryException {
  int diffSize = getDiffSize();
  if (diffSize <= diffThreshold) {
if (forceCloseOpenFiles || !verifyOpenFiles()) {
  return true;
} else {
  throw new RetryException();
}
  }

  return false;
}
{code}

*FedBalanceOptions.java*
Please update the description of DIFF_THRESHOLD option, I make a minor rewrite 
to let it easily understand.
{code:java}
final static Option DIFF_THRESHOLD = new Option("diffThreshold", true,
"This specifies the threshold of the diff entries that used in incremental 
copy stage. If the diff entries"
+ " size is no greater than this threshold and the open files check is 
satisfied(no open files or force"
+ " close all open files), the fedBalance will go to the final round"
+ " of distcp. Default value is 0, that means waiting until there is no 
diff.");
{code}
 

 
*HDFSFederationBalance.md*
Can we update 'Specify the threshold of the diff entries.' to 'Specify the 
threshold of the diff entries that used in incremental copy stage.'?

*TestDistCpProcedure.java*
 # Please add a cleanup operation in testDiffThreshold like other test methods 
does in this class.
 # We can add new method buildContext(Path src, Path dst, String mount, int 
diffThreshold) without change existed method. Change existed one will have to 
do some unnecessary update change.

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch, HDFS-15640.002.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-21 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218221#comment-17218221
 ] 

Yiqun Lin commented on HDFS-15640:
--

{quote}
Only a little problem is it might not be easy to know how much time will the 
diffs cost.
{quote}
Actually current logic already get the latest snapshot diff, and we can just 
reuse that result. So it won't add additionally cost compared with current 
logic.
{code}
  /**
   * Verify whether the src has changed since CURRENT_SNAPSHOT_NAME snapshot.
   *
   * @return true if the src has changed.
   */
  private boolean verifyDiff() throws IOException {
SnapshotDiffReport diffReport =
srcFs.getSnapshotDiffReport(src, CURRENT_SNAPSHOT_NAME, "");
return diffReport.getDiffList().size() > 0;
  }
{code}

Just depend on last 3 consecutive distcp execution time is not a 100% accurate 
way, for example an extreme case, the final distcp should be running very fast 
but actually it finished slowly due to unexpected thing, like abnormal node. So 
I still prefer to use the diff number.

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-20 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217678#comment-17217678
 ] 

Yiqun Lin commented on HDFS-15640:
--

[~LiJinglun] , use distcp execution time as fedbalance threshold is not an 
appropriate way. The execution time can be impacted by other aspects, like no 
enough resource to schedule task or  slow rpc calls.

I prefer to use the snapshot diff entries number as the threshold here. We 
could use getSnapshotDiffReport API to get this info. If snapshot diff entries 
reduced to a very low number value, that means only few files/dirs needed be 
synced. And then we can prepare to do the final distcp copy.

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2020-07-24 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164173#comment-17164173
 ] 

Yiqun Lin commented on HDFS-15486:
--

Hi [~yuanbo] , thanks for the comment. We don't have the  centos version 
changed in our cluster, seems this is not really related.

[~John Smith], the place you pointed is exactly what we want to improve.

> Costly sendResponse operation slows down async editlog handling
> ---
>
> Key: HDFS-15486
> URL: https://issues.apache.org/jira/browse/HDFS-15486
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yiqun Lin
>Priority: Major
> Attachments: Async-profile-(2).jpg, async-profile-(1).jpg
>
>
> When our cluster NameNode in a very high load, we find it often stuck in 
> Async-editlog handling.
> We use async-profile tool to get the flamegraph.
> !Async-profile-(2).jpg!
> This happened in that async editlog thread consumes Edit from the queue and 
> triggers the sendResponse call.
> But here the sendResponse call is a little expensive since our cluster 
> enabled the security env and will do some encode operations when doing the 
> return response operation.
> We often catch some moments of costly sendResponse operation when rpc call 
> queue is fulled.
> !async-profile-(1).jpg!
> Slowness on consuming Edit in async editlog will make Edit pending Queue 
> easily become the fulled state, then block its enqueue operation that is 
> invoked in writeLock type methods in FSNamesystem class.
> Here the enhancement is that we can use multiple thread to parallel execute 
> sendResponse call. sendResponse doesn't need use the write lock to do 
> protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2020-07-21 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15486:
-
Description: 
When our cluster NameNode in a very high load, we find it often stuck in 
Async-editlog handling.

We use async-profile tool to get the flamegraph.

!Async-profile-(2).jpg!

This happened in that async editlog thread consumes Edit from the queue and 
triggers the sendResponse call.

But here the sendResponse call is a little expensive since our cluster enabled 
the security env and will do some encode operations when doing the return 
response operation.

We often catch some moments of costly sendResponse operation when rpc call 
queue is fulled.

!async-profile-(1).jpg!

Slowness on consuming Edit in async editlog will make Edit pending Queue easily 
become the fulled state, then block its enqueue operation that is invoked in 
writeLock type methods in FSNamesystem class.

Here the enhancement is that we can use multiple thread to parallel execute 
sendResponse call. sendResponse doesn't need use the write lock to do 
protection, so this change is safe.

  was:
When our cluster NameNode in a very high load, we find it often stuck in 
Async-editlog handling.

We use async-profile tool to get the flamegraph.

!Async-profile-(2).jpg!

This happened in that async editlog thread consumes Edit from the queue and 
triggers the sendResponse call.

But here the sendResponse call is a little expensive since our cluster enabled 
the security env and will do some encode operations when doing the return 
response operation.

We often catch some moments of costly sendResponse operation when rpc call 
queue is fulled.

!async-profile-(1).jpg!

Slowness on consuming Edit in async editlog will make Edit pending Queue in the 
fulled state, then block its enqueue operation that is invoked in writeLock 
type methods in FSNamesystem class.

Here the enhancement is that we can use multiple thread to parallel execute 
sendResponse call. sendResponse doesn't need use the write lock to do 
protection, so this change is safe.


> Costly sendResponse operation slows down async editlog handling
> ---
>
> Key: HDFS-15486
> URL: https://issues.apache.org/jira/browse/HDFS-15486
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yiqun Lin
>Priority: Major
> Attachments: Async-profile-(2).jpg, async-profile-(1).jpg
>
>
> When our cluster NameNode in a very high load, we find it often stuck in 
> Async-editlog handling.
> We use async-profile tool to get the flamegraph.
> !Async-profile-(2).jpg!
> This happened in that async editlog thread consumes Edit from the queue and 
> triggers the sendResponse call.
> But here the sendResponse call is a little expensive since our cluster 
> enabled the security env and will do some encode operations when doing the 
> return response operation.
> We often catch some moments of costly sendResponse operation when rpc call 
> queue is fulled.
> !async-profile-(1).jpg!
> Slowness on consuming Edit in async editlog will make Edit pending Queue 
> easily become the fulled state, then block its enqueue operation that is 
> invoked in writeLock type methods in FSNamesystem class.
> Here the enhancement is that we can use multiple thread to parallel execute 
> sendResponse call. sendResponse doesn't need use the write lock to do 
> protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2020-07-21 Thread Yiqun Lin (Jira)
Yiqun Lin created HDFS-15486:


 Summary: Costly sendResponse operation slows down async editlog 
handling
 Key: HDFS-15486
 URL: https://issues.apache.org/jira/browse/HDFS-15486
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Yiqun Lin
 Attachments: Async-profile-(2).jpg, async-profile-(1).jpg

When our cluster NameNode in a very high load, we find it often stuck in 
Async-editlog handling.

We use async-profile tool to get the flamegraph.

!Async-profile-(2).jpg!

This happened in that async editlog thread consumes Edit from the queue and 
triggers the sendResponse call.

But here the sendResponse call is a little expensive since our cluster enabled 
the security env and will do some encode operations when doing the return 
response operation.

We often catch some moments of costly sendResponse operation when rpc call 
queue is fulled.

!async-profile-(1).jpg!

Slowness on consuming Edit in async editlog will make Edit pending Queue in the 
fulled state, then block its enqueue operation that is invoked in writeLock 
type methods in FSNamesystem class.

Here the enhancement is that we can use multiple thread to parallel execute 
sendResponse call. sendResponse doesn't need use the write lock to do 
protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.

2020-07-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149396#comment-17149396
 ] 

Yiqun Lin edited comment on HDFS-15448 at 7/1/20, 12:28 PM:


Not sure if it's a right behavior to remove startAll() in 
DataNode#runDatanodeDaemon.
The method BlockPoolManager#startAll is invoked in different places, see 
attached screenshot. The BlockPoolManager#startAll invoked in runDatanodeDaemon 
seems used for the test.
 



 


was (Author: linyiqun):
Not sure if it's a right behavior to remove startAll() in 
DataNode#runDatanodeDaemon.
The method BlockPoolManager#startAll is invoked in different places, see 
attached screenshot.
 



 

> When starting a DataNode, call BlockPoolManager#startAll() twice.
> -
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.

2020-07-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149396#comment-17149396
 ] 

Yiqun Lin commented on HDFS-15448:
--

Not sure if it's a right behavior to remove startAll() in 
DataNode#runDatanodeDaemon.
The method BlockPoolManager#startAll is invoked in different places, see 
attached screenshot.
 



 

> When starting a DataNode, call BlockPoolManager#startAll() twice.
> -
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15448:
-
Attachment: method_invoke_path.jpg

> When starting a DataNode, call BlockPoolManager#startAll() twice.
> -
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15294) Federation balance tool

2020-07-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149160#comment-17149160
 ] 

Yiqun Lin edited comment on HDFS-15294 at 7/1/20, 6:44 AM:
---

I update the description of this JIRA. [~LiJinglun] , can you update the 
description of two subtask HDFS-15340 and HDFS-15346. That will be better 
understanding.

All the subtasks of this feature have been done by [~LiJinglun]. If you are 
interested in detailed of this tool, please see the documentation JIRA  
HDFS-15374.

Thanks [~LiJinglun] for hard working and making the great contribution! And 
also thanks [~elgoiri], [~ayushtkn] and others for the discussion and reviews!

 Any further improvement or bug fixed for this feature is very welcomed, :).


was (Author: linyiqun):
I update the description of this JIRA. [~LiJinglun] , can you update the 
description of two subtask HDFS-15340 and HDFS-15346. That will be better 
understanding.

All the subtasks of this feature have been done by [~LiJinglun]. If you are 
interested in detailed of this tool, please see the documentation JIRA  
HDFS-15374.

Thanks [~LiJinglun] for hard working and making the great contribution! And 
also thanks [~elgoiri], [~ayushtkn] and others for the discussion and reviews!

 

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15294) Federation balance tool

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15294:
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

I update the description of this JIRA. [~LiJinglun] , can you update the 
description of two subtask HDFS-15340 and HDFS-15346. That will be better 
understanding.

All the subtasks of this feature have been done by [~LiJinglun]. If you are 
interested in detailed of this tool, please see the documentation JIRA  
HDFS-15374.

Thanks [~LiJinglun] for hard working and making the great contribution! And 
also thanks [~elgoiri], [~ayushtkn] and others for the discussion and reviews!

 

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15294) Federation balance tool

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15294:
-
Description: 
This jira introduces a new HDFS federation balance tool to balance data across 
different federation namespaces. It uses Distcp to copy data from the source 
path to the target path.

The process is:
 1. Use distcp and snapshot diff to sync data between src and dst until they 
are the same.
 2. Update mount table in Router if we specified RBF mode.
 3. Deal with src data, move to trash, delete or skip them.

The design of fedbalance tool comes from the discussion in HDFS-15087.

  was:
This jira introduces a new HDFS federation balance tool to balance data across 
different federation namespaces. It uses Distcp to copy data from the source 
path to the target path.

The process is:
 1. Use distcp and snapshot diff to sync data between src and dst until they 
are the same.
 2. Update mount table in Router if we specified RBF mode.
 3. Deal with src data, move to trash, delete or skip them.

This  

The patch is too big to review, so I split it into 2 patches:

Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
abstraction of job and scheduler model.   
{code:java}
org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
org.apache.hadoop.hdfs.procedure.BalanceProcedure;
org.apache.hadoop.hdfs.procedure.BalanceJob;
org.apache.hadoop.hdfs.procedure.BalanceJournal;
org.apache.hadoop.hdfs.procedure.HDFSJournal;
{code}
Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.    
{code:java}
org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
org.apache.hadoop.tools.DistCpFedBalance;
org.apache.hadoop.tools.DistCpProcedure;
org.apache.hadoop.tools.FedBalance;
org.apache.hadoop.tools.FedBalanceConfigs;
org.apache.hadoop.tools.FedBalanceContext;
org.apache.hadoop.tools.TrashProcedure;
{code}


> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15294) Federation balance tool

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15294:
-
Description: 
This jira introduces a new HDFS federation balance tool to balance data across 
different federation namespaces. It uses Distcp to copy data from the source 
path to the target path.

The process is:
 1. Use distcp and snapshot diff to sync data between src and dst until they 
are the same.
 2. Update mount table in Router if we specified RBF mode.
 3. Deal with src data, move to trash, delete or skip them.

This  

The patch is too big to review, so I split it into 2 patches:

Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
abstraction of job and scheduler model.   
{code:java}
org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
org.apache.hadoop.hdfs.procedure.BalanceProcedure;
org.apache.hadoop.hdfs.procedure.BalanceJob;
org.apache.hadoop.hdfs.procedure.BalanceJournal;
org.apache.hadoop.hdfs.procedure.HDFSJournal;
{code}
Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.    
{code:java}
org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
org.apache.hadoop.tools.DistCpFedBalance;
org.apache.hadoop.tools.DistCpProcedure;
org.apache.hadoop.tools.FedBalance;
org.apache.hadoop.tools.FedBalanceConfigs;
org.apache.hadoop.tools.FedBalanceContext;
org.apache.hadoop.tools.TrashProcedure;
{code}

  was:
This jira introduces a new balance command 'fedbalance' that is ran by the 
administrator. The process is:
 1. Use distcp and snapshot diff to sync data between src and dst until they 
are the same.
 2. Update mount table in Router.
 3. Delete the src to trash.

 

The patch is too big to review, so I split it into 2 patches:

Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
abstraction of job and scheduler model.   
{code:java}
org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
org.apache.hadoop.hdfs.procedure.BalanceProcedure;
org.apache.hadoop.hdfs.procedure.BalanceJob;
org.apache.hadoop.hdfs.procedure.BalanceJournal;
org.apache.hadoop.hdfs.procedure.HDFSJournal;
{code}
Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.    
{code:java}
org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
org.apache.hadoop.tools.DistCpFedBalance;
org.apache.hadoop.tools.DistCpProcedure;
org.apache.hadoop.tools.FedBalance;
org.apache.hadoop.tools.FedBalanceConfigs;
org.apache.hadoop.tools.FedBalanceContext;
org.apache.hadoop.tools.TrashProcedure;
{code}


> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> This  
> The patch is too big to review, so I split it into 2 patches:
> Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
> abstraction of job and scheduler model.   
> {code:java}
> org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
> org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
> org.apache.hadoop.hdfs.procedure.BalanceProcedure;
> org.apache.hadoop.hdfs.procedure.BalanceJob;
> org.apache.hadoop.hdfs.procedure.BalanceJournal;
> org.apache.hadoop.hdfs.procedure.HDFSJournal;
> {code}
> Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.     HDFS-15346>
> {code:java}
> org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
> org.apache.hadoop.tools.DistCpFedBalance;
> org.apache.hadoop.tools.DistCpProcedure;
> org.apache.hadoop.tools.FedBalance;
> org.apache.hadoop.tools.FedBalanceConfigs;
> org.apache.hadoop.tools.FedBalanceContext;
> org.apache.hadoop.tools.TrashProcedure;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15374:
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Commit this to trunk.

Thanks [~LiJinglun] for the contribution and thanks [~elgoiri] for the review.

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, 
> FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, 
> FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, 
> HDFS-15374.003.patch, HDFS-15374.004.patch, HDFS-15374.005.patch
>
>
> Add documentation for fedbalance tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15374:
-
Description: Add documentation for fedbalance tool.

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, 
> FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, 
> FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, 
> HDFS-15374.003.patch, HDFS-15374.004.patch, HDFS-15374.005.patch
>
>
> Add documentation for fedbalance tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15410) Add separated config file hdfs-fedbalance-default.xml for fedbalance tool

2020-07-01 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Commit this to trunk.

Thanks [~elgoiri] for the review and thanks [~LiJinglun] for the contribution!

> Add separated config file hdfs-fedbalance-default.xml for fedbalance tool
> -
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, 
> HDFS-15410.003.patch, HDFS-15410.004.patch, HDFS-15410.005.patch
>
>
> Add a separated config file named hdfs-fedbalance-default.xml for fedbalance 
> tool configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15410) Add separated config file hdfs-fedbalance-default.xml for fedbalance tool

2020-06-30 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Description: Add a separated config file named hdfs-fedbalance-default.xml 
for fedbalance tool configs. It's like the ditcp-default.xml for distcp tool.  
(was: Add a separated config file named fedbalance-default.xml for fedbalance 
tool configs. It's like the ditcp-default.xml for distcp tool.)

> Add separated config file hdfs-fedbalance-default.xml for fedbalance tool
> -
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, 
> HDFS-15410.003.patch, HDFS-15410.004.patch, HDFS-15410.005.patch
>
>
> Add a separated config file named hdfs-fedbalance-default.xml for fedbalance 
> tool configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15410) Add separated config file hdfs-fedbalance-default.xml for fedbalance tool

2020-06-30 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Summary: Add separated config file hdfs-fedbalance-default.xml for 
fedbalance tool  (was: Add separated config file fedbalance-default.xml for 
fedbalance tool)

> Add separated config file hdfs-fedbalance-default.xml for fedbalance tool
> -
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, 
> HDFS-15410.003.patch, HDFS-15410.004.patch, HDFS-15410.005.patch
>
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-29 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147742#comment-17147742
 ] 

Yiqun Lin commented on HDFS-15410:
--

[~inigoiri], would you mind to have a quick review for this JIRA and HDFS-15374?

[~LiJinglun] , I will hold off one day to commit and let [~inigoiri] to have a 
quick review once he gets the time.

 

> Add separated config file fedbalance-default.xml for fedbalance tool
> 
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, 
> HDFS-15410.003.patch, HDFS-15410.004.patch
>
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool

2020-06-22 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142178#comment-17142178
 ] 

Yiqun Lin commented on HDFS-15374:
--

I generated the markdown documentation page in my local, it rendered good now.

Thanks for addressing the comments, +1.

[~elgoiri], any further comment for this? I will hold off the commit in case 
you have other comments.

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, 
> FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, 
> FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, 
> HDFS-15374.003.patch, HDFS-15374.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-22 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142164#comment-17142164
 ] 

Yiqun Lin commented on HDFS-15410:
--

LGTM , +1.

[~elgoiri], Does the latest patch also look good to you?

> Add separated config file fedbalance-default.xml for fedbalance tool
> 
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, 
> HDFS-15410.003.patch, HDFS-15410.004.patch
>
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-21 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141672#comment-17141672
 ] 

Yiqun Lin commented on HDFS-15410:
--

[~LiJinglun], one minor comment: can you add more description for the setting 
hdfs.fedbalance.procedure.scheduler.journal.uri and 
hdfs.fedbalance.procedure.work.thread.num?

For example, we can add part of some definition:
 hdfs.fedbalance.procedure.scheduler.journal.uri : The uri of the journal, the 
journal file is used for handling the job persistence and recover.
 hdfs.fedbalance.procedure.work.thread.num: The worker threads number of the 
BalanceProcedureScheduler. BalanceProcedureScheduler is responsible for 
scheduling a balance job, including submit, run, delay and recover.


 Also please update above new description in FederationBalance.md configuration 
options section that tracked in HDFS-15374. Thanks.

> Add separated config file fedbalance-default.xml for fedbalance tool
> 
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch, 
> HDFS-15410.003.patch
>
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool

2020-06-21 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15374:
-
Attachment: FedBalance_Screenshot3.jpg
FedBalance_Screenshot2.jpg
FedBalance_Screenshot1.jpg

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, 
> FedBalance_Screenshot1.jpg, FedBalance_Screenshot2.jpg, 
> FedBalance_Screenshot3.jpg, HDFS-15374.001.patch, HDFS-15374.002.patch, 
> HDFS-15374.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15374) Add documentation for fedbalance tool

2020-06-21 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141662#comment-17141662
 ] 

Yiqun Lin edited comment on HDFS-15374 at 6/22/20, 3:27 AM:


The patch almost looks great now, I find one problem when I use mvn site:site 
to generate html page. We lack css file here.

Can you copy css directory from distp module(../site/resources/css) to same 
place in fed balance module?

Attach html page screenshot generated in my local.

BTW, can you answer my question in previous comment?
{quote}
I have a question here, can we support the full path like 
hdfs://my-ns01/src-folder instead of above specific nn port address now? In the 
local config, we often have the nn address configured in the hdfs-site.xml
{quote}

 

 


was (Author: linyiqun):
The patch almost looks great now, I find one problem when I use mvn site:site 
to generate html page. We lack css file here.

Can you copy css directory from distp module(../site/resources/css) to same 
place in fed balance module?

Attach html page screenshot generated in my local.

 

 

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, 
> HDFS-15374.002.patch, HDFS-15374.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool

2020-06-21 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141662#comment-17141662
 ] 

Yiqun Lin commented on HDFS-15374:
--

The patch almost looks great now, I find one problem when I use mvn site:site 
to generate html page. We lack css file here.

Can you copy css directory from distp module(../site/resources/css) to same 
place in fed balance module?

Attach html page screenshot generated in my local.

 

 

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, 
> HDFS-15374.002.patch, HDFS-15374.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool

2020-06-20 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140990#comment-17140990
 ] 

Yiqun Lin commented on HDFS-15374:
--

[~LiJinglun], thanks for updating the patch!

Minor comments from me:
{code:java}
Finally when the source and the target are the same, it
+  updates the mount table in Router and moves the source to trash.
{code}
It will better to mention in the normal federation mode and in the rbf mode..
{code:java}
In normal federation mode the source path must includes the source cluster.
{code}
This can update to
{code:java}
In normal federation mode the source path must includes the path schema.
{code}
I have a question here, can we support the full path like 
hdfs://my-ns01/src-folder instead of above specific nn port address now? In the 
local config, we often have the nn address configured in the hdfs-site.xml

The name {{DistCpFedBalance}} should be updated to FedBalance in doc since it 
has been renamed now,

Some whitespaces I find, please remove these redundant whitespaces which leads 
checkstyles warnings.
{noformat}
+  This will scan the journal to find all the unfinished jobs, recover and
+  continue to execute them.
+  <--- whitespaces
+  If we want to balance in a normal federation cluster, use the command below.
+
+bash$ /bin/hadoop fedbalance submit hdfs://nn0:8020/foo/src 
hdfs://nn1:8020/foo/dst
+<--- whitespaces
+  In normal federation mode the source path must includes the source cluster.
+
+### RBF Mode And Normal Federation Mode
+
+  The federation balance tool has 2 modes: <---whitespaces
+<---whitespaces
+  * the router-based federation mode (RBF mode).
+  * the normal federation mode.
+<---whitespaces
+  By default the command runs in the normal federation mode. You can specify 
the
+  rbf mode by using the option `-router`.
+<---whitespaces
+  In the rbf mode the first parameter is taken as the mount point. It disables
+  write by setting the mount point readonly.
+<---whitespaces
+  In the normal federation mode the first parameter is taken as the full path 
of
+  the source. The first parameter must include the source cluster. It disables
+  write by cancelling all the permissions of the source path.
+<---whitespaces
+  Details about disabling write see [DistCpFedBalance](#DistCpFedBalance).

...

when there is no diff and no open files. <---whitespaces

+* FINAL_DISTCP: Force close all the open files and submit the final distcp.
+* FINISH: Do the cleanup works. In normal federation mode the finish stage
+  also restores the permission of the dst path.
+  your patch name
{noformat}

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, 
> HDFS-15374.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-19 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140960#comment-17140960
 ] 

Yiqun Lin commented on HDFS-15410:
--

Besides [~elgoiri]'s review comment, some more reivew comments from me:

Not fully understand why we need to define the class impl config to do 
reflection and get the instance. Currently there is no other implement class, 
why not just create new FedBalance/BalanceJournalInfoHDFS instance in the code? 
From my understanding, this two config settings is can be removed.
{code:java}
federation.balance.class
hadoop.hdfs.procedure.journal.class

// init journal.
Class clazz = (Class) conf
.getClass(JOURNAL_CLASS, BalanceJournalInfoHDFS.class);
journal = ReflectionUtils.newInstance(clazz, conf);

Class balanceClazz = (Class) conf
.getClass(FEDERATION_BALANCE_CLASS, FedBalance.class);
Tool balancer = ReflectionUtils.newInstance(balanceClazz, conf);
{code}

Can we rename class name from {{DistCpBalanceOptions}} to 
{{FedBalanceOptions}}? This will look more readable that these options here are 
making sense for fedbalance tool.

Can we rename config prefix from {{hadoop.hdfs.procedure.work.thread.num}} to 
{{hdfs.fedbalance.procedure.work.thread.num}}?

Following description need to be updated here since -router option doesn't 
require to inout true or false as a parameter now.
{noformat}
  final static Option ROUTER = new Option("router", false,
  "If `true` the command runs in router mode. The source path is "
  + "taken as a mount point. It will disable write by setting the mount"
  + " point readonly. Otherwise the command works in normal federation"
  + " mode. The source path is taken as the full path. It will disable"
  + " write by cancelling all permissions of the source path. The"
  + " default value is `true`.");
{noformat}

> Add separated config file fedbalance-default.xml for fedbalance tool
> 
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch
>
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool

2020-06-17 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139105#comment-17139105
 ] 

Yiqun Lin commented on HDFS-15374:
--

Hi [~LiJinglun], can you attach the latest patch here? I am more accustomed to 
review the patch file way, :D. Thank you.

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15374.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15346) FedBalance tool implementation

2020-06-17 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15346:
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed this to trunk.

Thanks [~LiJinglun] for the great contribution!

> FedBalance tool implementation
> --
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, 
> HDFS-15346.012.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15346) FedBalance tool implementation

2020-06-17 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15346:
-
Summary: FedBalance tool implementation  (was: DistCpFedBalance 
implementation)

> FedBalance tool implementation
> --
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, 
> HDFS-15346.012.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) DistCpFedBalance implementation

2020-06-16 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136480#comment-17136480
 ] 

Yiqun Lin commented on HDFS-15346:
--

LGTM, +1. Will commit this the day after tomorrow once there is no other 
comment.

> DistCpFedBalance implementation
> ---
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch, 
> HDFS-15346.012.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15294) Federation balance tool

2020-06-15 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136263#comment-17136263
 ] 

Yiqun Lin commented on HDFS-15294:
--

As this feature tool is designed as a common tool like distcp, I removed all 
RBF label in uncommitted subtask.

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new balance command 'fedbalance' that is ran by the 
> administrator. The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router.
>  3. Delete the src to trash.
>  
> The patch is too big to review, so I split it into 2 patches:
> Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
> abstraction of job and scheduler model.   
> {code:java}
> org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
> org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
> org.apache.hadoop.hdfs.procedure.BalanceProcedure;
> org.apache.hadoop.hdfs.procedure.BalanceJob;
> org.apache.hadoop.hdfs.procedure.BalanceJournal;
> org.apache.hadoop.hdfs.procedure.HDFSJournal;
> {code}
> Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.     HDFS-15346>
> {code:java}
> org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
> org.apache.hadoop.tools.DistCpFedBalance;
> org.apache.hadoop.tools.DistCpProcedure;
> org.apache.hadoop.tools.FedBalance;
> org.apache.hadoop.tools.FedBalanceConfigs;
> org.apache.hadoop.tools.FedBalanceContext;
> org.apache.hadoop.tools.TrashProcedure;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Summary: Add separated config file fedbalance-default.xml for fedbalance 
tool  (was: RBF: Add separated config file fedbalance-default.xml for 
fedbalance tool)

> Add separated config file fedbalance-default.xml for fedbalance tool
> 
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15374) Add documentation for fedbalance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15374:
-
Summary: Add documentation for fedbalance tool  (was: RBF: Add 
documentation for fedbalance tool)

> Add documentation for fedbalance tool
> -
>
> Key: HDFS-15374
> URL: https://issues.apache.org/jira/browse/HDFS-15374
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15374.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15346) DistCpFedBalance implementation

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15346:
-
Summary: DistCpFedBalance implementation  (was: RBF: DistCpFedBalance 
implementation)

> DistCpFedBalance implementation
> ---
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15294) Federation balance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15294:
-
Summary: Federation balance tool  (was: RBF: Balance data across federation 
namespaces with DistCp and snapshot diff)

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new balance command 'fedbalance' that is ran by the 
> administrator. The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router.
>  3. Delete the src to trash.
>  
> The patch is too big to review, so I split it into 2 patches:
> Phase 1 / The State Machine(BalanceProcedureScheduler): Including the 
> abstraction of job and scheduler model.   
> {code:java}
> org.apache.hadoop.hdfs.procedure.BalanceProcedureScheduler;
> org.apache.hadoop.hdfs.procedure.BalanceProcedureConfigKeys;
> org.apache.hadoop.hdfs.procedure.BalanceProcedure;
> org.apache.hadoop.hdfs.procedure.BalanceJob;
> org.apache.hadoop.hdfs.procedure.BalanceJournal;
> org.apache.hadoop.hdfs.procedure.HDFSJournal;
> {code}
> Phase 2 / The DistCpFedBalance: It's an implementation of BalanceJob.     HDFS-15346>
> {code:java}
> org.apache.hadoop.hdfs.server.federation.procedure.MountTableProcedure;
> org.apache.hadoop.tools.DistCpFedBalance;
> org.apache.hadoop.tools.DistCpProcedure;
> org.apache.hadoop.tools.FedBalance;
> org.apache.hadoop.tools.FedBalanceConfigs;
> org.apache.hadoop.tools.FedBalanceContext;
> org.apache.hadoop.tools.TrashProcedure;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15410) RBF: Add separated config file fedbalance-default.xml for fedbalance tool

2020-06-15 Thread Yiqun Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15410:
-
Summary: RBF: Add separated config file fedbalance-default.xml for 
fedbalance tool  (was: Add separated config file fedbalance-default.xml for 
fedbalance tool.)

> RBF: Add separated config file fedbalance-default.xml for fedbalance tool
> -
>
> Key: HDFS-15410
> URL: https://issues.apache.org/jira/browse/HDFS-15410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> Add a separated config file named fedbalance-default.xml for fedbalance tool 
> configs. It's like the ditcp-default.xml for distcp tool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-15 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135819#comment-17135819
 ] 

Yiqun Lin commented on HDFS-15346:
--

[~LiJinglun], the refactor looks great. I find you decrease the timeout value, 
the new value seems too small and it will lead timeout  error.

Can you adjust all this time value to 3(@Test(timeout = 3) in 
TestDistCpProcedure? This value works well in my local.

Finally, can we add 'fedbalance' in current package name under fedbalance 
module?

Under module path src/test/java, src/main/java
 Update
{noformat}
org.apache.hadoop.tools
org.apache.hadoop.tools.procedure
{noformat}
to
{noformat}
org.apache.hadoop.tools.fedbalance
org.apache.hadoop.tools.fedbalance.procedure
{noformat}
Then please check and update some old class path that used in the module, like 
hadoop-federation-balance.sh, pom.xml or some other place.

Others looks good to me now. Thanks [~LiJinglun] for the so patient working for 
this. 
Once above are addressed, I will hold off the commit for few days in case there 
are some other comments from others.

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch, HDFS-15346.011.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135024#comment-17135024
 ] 

Yiqun Lin edited comment on HDFS-15346 at 6/14/20, 5:05 AM:


[~LiJinglun], thanks for addressing remaining comments.

These two days, I am trying to improve the efficiency of the unit test, current 
unit test is too slow.

I find another way that we don't have to depend on mini yarn cluster in test 
running. The job can be submitted and executed in LocalJobRunner when there is 
no mini yarn cluster env. But we need to make an adjustment in getting job 
status from job client.

I do some refactor in getCurrent method and apply them in DistCpProcedure.

Following are part of some necessary change we need to update.
{noformat}
  @VisibleForTesting
  private Job runningJob;
  static boolean ENABLED_FOR_TEST = false;
...
  private String submitDistCpJob(String srcParam, String dstParam,
  boolean useSnapshotDiff) throws IOException {
...
try {
  LOG.info("Submit distcp job={}", job);
  runningJob = job;   <--- need to reset there
  return job.getJobID().toString();
} catch (Exception e) {
  throw new IOException("Submit job failed.", e);
}
  }

  private RunningJobStatus getCurrentJob() throws IOException {
if (jobId != null) {
  if (ENABLED_FOR_TEST) {
if (this.runningJob != null) {
  Job latestJob = null;
  try {
latestJob = this.runningJob.getCluster()
.getJob(JobID.forName(jobId));
  } catch (InterruptedException e) {
throw new IOException(e);
  }
  return latestJob == null ? null
  : new RunningJobStatus(latestJob, null);
}
  } else {
RunningJob latestJob = client.getJob(JobID.forName(jobId));
return latestJob == null ? null :
  new RunningJobStatus(null, latestJob);
  }
}
return null;
  }

  class RunningJobStatus {
Job testJob;
RunningJob job;

public RunningJobStatus(Job testJob, RunningJob job) {
  this.testJob = testJob;
  this.job = job;
}

String getJobID() {
  return ENABLED_FOR_TEST ? testJob.getJobID().toString()
  : job.getID().toString();
}

boolean isComplete() throws IOException {
  return ENABLED_FOR_TEST ? testJob.isComplete() : job.isComplete();
}

boolean isSuccessful() throws IOException {
  return ENABLED_FOR_TEST ? testJob.isSuccessful() : job.isSuccessful();
}

String getFailureInfo() throws IOException {
  try {
return ENABLED_FOR_TEST ? testJob.getStatus().getFailureInfo()
: job.getFailureInfo();
  } catch (InterruptedException e) {
throw new IOException(e);
  }
}
  }
{noformat}
And mini yarn cluster related code lines can all be removed (include two pom 
dependencies mentioned above)
{code:java}
+mrCluster = new MiniMRYarnCluster(TestDistCpProcedure.class.getName(), 3);
+conf.set(MRJobConfig.MR_AM_STAGING_DIR, "/apps_staging_dir");
+mrCluster.init(conf);
+mrCluster.start();
+conf = mrCluster.getConfig();
{code}
We need additionally set test enabled flag.
{code:java}
 public static void beforeClass() throws IOException {
DistCpProcedure.ENABLED_FOR_TEST = true;
...
}
{code}
After this improvement, the whole test runs very faster than before, it totally 
costs less than 1 min.

In additional, we need to have a cleanup at the end of each test method.

like 
{code:java}
fs.delete(new Path(testRoot), true);
{code}
or
{code:java}
dcProcedure.finish(); (soemtimes need to call this since some case has snapshot 
created and cannot be deleted)
fs.delete(new Path(testRoot), true);
{code}
Also I catch some places still needed to update.
 # Can you update following description in router option? I update this content 
as well but seems this was not addressed in the latest patch.
{noformat}
It will disable read and write by cancelling all permissions of the source 
path. The default value  is `false`."
{noformat}

 # Method name cleanUpBeforeInitDistcp can be renamed to 
pathCheckBeforeInitDistcp since we don't do any cleanup operation now.


was (Author: linyiqun):
[~LiJinglun], thanks for addressing remaining comments.

These two days, I am trying to improve the efficiency of the unit test, current 
unit test is too slow.

I find another way that we don't have to depend on mini yarn cluster in test 
running. The job can be submitted and executed in LocalJobRunner when there is 
no mini yarn cluster env. But we need to make an adjustment in getting job 
status from job client.

I do some refactor in getCurrent method and apply them in DistCpProcedure.

Following are part of some necessary change we need to update.
{noformat}
  @VisibleForTesting
  private Job runningJob;
  static boolean ENABLED_FOR_TEST = false;
...
  private String 

[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135024#comment-17135024
 ] 

Yiqun Lin edited comment on HDFS-15346 at 6/14/20, 4:59 AM:


[~LiJinglun], thanks for addressing remaining comments.

These two days, I am trying to improve the efficiency of the unit test, current 
unit test is too slow.

I find another way that we don't have to depend on mini yarn cluster in test 
running. The job can be submitted and executed in LocalJobRunner when there is 
no mini yarn cluster env. But we need to make an adjustment in getting job 
status from job client.

I do some refactor in getCurrent method and apply them in DistCpProcedure.

Following are part of some necessary change we need to update.
{noformat}
  @VisibleForTesting
  private Job runningJob;
  static boolean ENABLED_FOR_TEST = false;
...
  private String submitDistCpJob(String srcParam, String dstParam,
  boolean useSnapshotDiff) throws IOException {
...
try {
  LOG.info("Submit distcp job={}", job);
  runningJob = job;   <--- need to reset there
  return job.getJobID().toString();
} catch (Exception e) {
  throw new IOException("Submit job failed.", e);
}
  }

  private RunningJobStatus getCurrentJob() throws IOException {
if (jobId != null) {
  if (ENABLED_FOR_TEST) {
if (this.runningJob != null) {
  Job latestJob = null;
  try {
latestJob = this.runningJob.getCluster()
.getJob(JobID.forName(jobId));
  } catch (InterruptedException e) {
throw new IOException(e);
  }
  return latestJob == null ? null
  : new RunningJobStatus(latestJob, null);
}
  } else {
RunningJob latestJob = client.getJob(JobID.forName(jobId));
return latestJob == null ? null :
  new RunningJobStatus(null, latestJob);
  }
}
return null;
  }

  class RunningJobStatus {
Job testJob;
RunningJob job;

public RunningJobStatus(Job testJob, RunningJob job) {
  this.testJob = testJob;
  this.job = job;
}

String getJobID() {
  return ENABLED_FOR_TEST ? testJob.getJobID().toString()
  : job.getID().toString();
}

boolean isComplete() throws IOException {
  return ENABLED_FOR_TEST ? testJob.isComplete() : job.isComplete();
}

boolean isSuccessful() throws IOException {
  return ENABLED_FOR_TEST ? testJob.isSuccessful() : job.isSuccessful();
}

String getFailureInfo() throws IOException {
  try {
return ENABLED_FOR_TEST ? testJob.getStatus().getFailureInfo()
: job.getFailureInfo();
  } catch (InterruptedException e) {
throw new IOException(e);
  }
}
  }
{noformat}
And mini yarn cluster related code lines can all be removed (include two pom 
dependencies mentioned above)
{code:java}
+mrCluster = new MiniMRYarnCluster(TestDistCpProcedure.class.getName(), 3);
+conf.set(MRJobConfig.MR_AM_STAGING_DIR, "/apps_staging_dir");
+mrCluster.init(conf);
+mrCluster.start();
+conf = mrCluster.getConfig();
{code}
We need additionally set test enabled flag.
{code:java}
 public static void beforeClass() throws IOException {
DistCpProcedure.ENABLED_FOR_TEST = true;
...
}
{code}
After this improvement, the whole test runs very faster than before, it totally 
costs less than 1 min.

Also I catch some places still needed to update.
 # Can you update following description in router option? I update this content 
as well but seems this was not addressed in the latest patch.
{noformat}
It will disable read and write by cancelling all permissions of the source 
path. The default value  is `false`."
{noformat}

 # Method name cleanUpBeforeInitDistcp can be renamed to 
pathCheckBeforeInitDistcp since we don't do any cleanup operation now.


was (Author: linyiqun):
[~LiJinglun], thanks for addressing remaining comments.

These two days, I am trying to improve the efficiency of the unit test, current 
unit test is too slow.

I find another way that we don't have to depend on mini yarn cluster in test 
running. The job can submitted and executed in LocalJobRunner way. But we need 
to make an adjustment in getting job status from job client.

I do some refactor in getCurrent method and apply them in DistCpProcedure.

Following are part of some necessary change we need to update.
{noformat}
  @VisibleForTesting
  private Job runningJob;
  static boolean ENABLED_FOR_TEST = false;
...
  private String submitDistCpJob(String srcParam, String dstParam,
  boolean useSnapshotDiff) throws IOException {
...
try {
  LOG.info("Submit distcp job={}", job);
  runningJob = job;   <--- need to reset there
  return job.getJobID().toString();
} catch (Exception e) {
  throw new IOException("Submit job failed.", e);
}
  }

  private 

[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135024#comment-17135024
 ] 

Yiqun Lin commented on HDFS-15346:
--

[~LiJinglun], thanks for addressing remaining comments.

These two days, I am trying to improve the efficiency of the unit test, current 
unit test is too slow.

I find another way that we don't have to depend on mini yarn cluster in test 
running. The job can submitted and executed in LocalJobRunner way. But we need 
to make an adjustment in getting job status from job client.

I do some refactor in getCurrent method and apply them in DistCpProcedure.

Following are part of some necessary change we need to update.
{noformat}
  @VisibleForTesting
  private Job runningJob;
  static boolean ENABLED_FOR_TEST = false;
...
  private String submitDistCpJob(String srcParam, String dstParam,
  boolean useSnapshotDiff) throws IOException {
...
try {
  LOG.info("Submit distcp job={}", job);
  runningJob = job;   <--- need to reset there
  return job.getJobID().toString();
} catch (Exception e) {
  throw new IOException("Submit job failed.", e);
}
  }

  private RunningJobStatus getCurrentJob() throws IOException {
if (jobId != null) {
  if (ENABLED_FOR_TEST) {
if (this.runningJob != null) {
  Job latestJob = null;
  try {
latestJob = this.runningJob.getCluster()
.getJob(JobID.forName(jobId));
  } catch (InterruptedException e) {
throw new IOException(e);
  }
  return latestJob == null ? null
  : new RunningJobStatus(latestJob, null);
}
  } else {
RunningJob latestJob = client.getJob(JobID.forName(jobId));
return latestJob == null ? null :
  new RunningJobStatus(null, latestJob);
  }
}
return null;
  }

  class RunningJobStatus {
Job testJob;
RunningJob job;

public RunningJobStatus(Job testJob, RunningJob job) {
  this.testJob = testJob;
  this.job = job;
}

String getJobID() {
  return ENABLED_FOR_TEST ? testJob.getJobID().toString()
  : job.getID().toString();
}

boolean isComplete() throws IOException {
  return ENABLED_FOR_TEST ? testJob.isComplete() : job.isComplete();
}

boolean isSuccessful() throws IOException {
  return ENABLED_FOR_TEST ? testJob.isSuccessful() : job.isSuccessful();
}

String getFailureInfo() throws IOException {
  try {
return ENABLED_FOR_TEST ? testJob.getStatus().getFailureInfo()
: job.getFailureInfo();
  } catch (InterruptedException e) {
throw new IOException(e);
  }
}
  }
{noformat}
And mini yarn cluster related code lines can all be removed (include two pom 
dependencies mentioned above)
{code:java}
+mrCluster = new MiniMRYarnCluster(TestDistCpProcedure.class.getName(), 3);
+conf.set(MRJobConfig.MR_AM_STAGING_DIR, "/apps_staging_dir");
+mrCluster.init(conf);
+mrCluster.start();
+conf = mrCluster.getConfig();
{code}
We need additionally set test enabled flag.
{code:java}
 public static void beforeClass() throws IOException {
DistCpProcedure.ENABLED_FOR_TEST = true;
...
}
{code}
After this improvement, the whole test runs very faster than before, it totally 
costs less than 1 min.

Also I catch some places still needed to update.
 # Can you update following description in router option? I update this content 
as well but seems this was not addressed in the latest patch.
{noformat}
It will disable read and write by cancelling all permissions of the source 
path. The default value  is `false`."
{noformat}

 # Method name cleanUpBeforeInitDistcp can be renamed to 
pathCheckBeforeInitDistcp since we don't do any cleanup operation now.

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch, HDFS-15346.010.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15087) RBF: Balance/Rename across federation namespaces

2020-06-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134832#comment-17134832
 ] 

Yiqun Lin edited comment on HDFS-15087 at 6/13/20, 3:16 PM:


Hi [~umamaheswararao] and others, I'd like to share current status about 
HDFS-15294 and maybe some others also want to know this feature.

Now [~LiJinglun] almost completed the majority implementation and we are 
actively working for the one core subtask HDFS-15346. There are still some 
remaining work like documentation. 
 And after discussed with [~LiJinglun] about this feature design, we are agreed 
that let this tool become a common balance tool not only used for RBF mode, but 
also can be used in normal federation clusters.
{quote}If it's mandatory, would it be possible to think and make it as optional 
and have alternative thoughts to get the diff?
{quote}
Good idea, we could let this diff to be pluggable in the future improvement.

Please share your wonderful thoughts/comments on HDFS-15294 if you are 
interested in this, :).


was (Author: linyiqun):
Hi [~umamaheswararao] and others, I'd like to share current status about 
HDFS-15294 and maybe some others also want to know this feature.

Now [~LiJinglun] almost completed the majority implementation and we are 
actively working for the one core subtask HDFS-15346. There are still some 
remaining work like documentation. 
 And after discussed with [~LiJinglun] about this feature design, we are agreed 
that let this tool become a common balance tool not only used for RBF mode, but 
also can be used in normal federation clusters.

Please share your wonderful thoughts/comments on HDFS-15294 if you are 
interested in this, :).

> RBF: Balance/Rename across federation namespaces
> 
>
> Key: HDFS-15087
> URL: https://issues.apache.org/jira/browse/HDFS-15087
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation 
> Namespaces.pdf
>
>
> The Xiaomi storage team has developed a new feature called HFR(HDFS 
> Federation Rename) that enables us to do balance/rename across federation 
> namespaces. The idea is to first move the meta to the dst NameNode and then 
> link all the replicas. It has been working in our largest production cluster 
> for 2 months. We use it to balance the namespaces. It turns out HFR is fast 
> and flexible. The detail could be found in the design doc. 
> Looking forward to a lively discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15087) RBF: Balance/Rename across federation namespaces

2020-06-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134832#comment-17134832
 ] 

Yiqun Lin commented on HDFS-15087:
--

Hi [~umamaheswararao] and others, I'd like to share current status about 
HDFS-15294 and maybe some others also want to know this feature.

Now [~LiJinglun] almost completed the majority implementation and we are 
actively working for the one core subtask HDFS-15346. There are still some 
remaining work like documentation. 
 And after discussed with [~LiJinglun] about this feature design, we are agreed 
that let this tool become a common balance tool not only used for RBF mode, but 
also can be used in normal federation clusters.

Please share your wonderful thoughts/comments on HDFS-15294 if you are 
interested in this, :).

> RBF: Balance/Rename across federation namespaces
> 
>
> Key: HDFS-15087
> URL: https://issues.apache.org/jira/browse/HDFS-15087
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation 
> Namespaces.pdf
>
>
> The Xiaomi storage team has developed a new feature called HFR(HDFS 
> Federation Rename) that enables us to do balance/rename across federation 
> namespaces. The idea is to first move the meta to the dst NameNode and then 
> link all the replicas. It has been working in our largest production cluster 
> for 2 months. We use it to balance the namespaces. It turns out HFR is fast 
> and flexible. The detail could be found in the design doc. 
> Looking forward to a lively discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-10 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130903#comment-17130903
 ] 

Yiqun Lin commented on HDFS-15346:
--

[~LiJinglun] , thanks for addressing the comments, almost looks good now.
{quote}Agree with you ! Using a fedbalance-default.xml is much better.
{quote}
Would you create a subtask JIRA for this? Let's try to complete this in a later 
time.
{quote}I'll try to figure it out. But it might be quite tricky as the unit 
tests use both MiniDFSCluster and MiniMRYarnCluster. And there are many rounds 
of distcp. Please tell me if you have any suggestions, thanks
{quote}
I will take a further look for this later. But anyway, currently the unit tests 
can all be passed, it's okay for me.

Still some remaining minor comments:

*hadoop-federation-balance/pom.xml*
{noformat}
+
+  org.bouncycastle
+  bcprov-jdk15on
+  test
+
+
+  org.bouncycastle
+  bcpkix-jdk15on
+  test
+
{noformat}
These two dependencies seems not related, can we remove this one?
 *DistCpFedBalance.java/FedBalance.java*
 I don't know why we define another class FedBalance. This FedBalance can just 
combined to DistCpFedBalance. I prefer to override main method in 
DistCpFedBalance and then renamed DistCpFedBalance to FedBalance.

*DistCpBalanceOptions.java*
 Find two places can be described more clear:
 # I prefer to move detailed comment message into option description and users 
can known detailed about this option.
{code:java}
/**
 * Run in router-based federation mode.
 */
final static Option ROUTER =
new Option("router", false, ". If `true` the command runs in router mode. 
The source path is taken as
   a mount point. It will disable write by setting the mount point
   readonly. Otherwise the command works in normal federation mode. The
   source path is taken as the full path. It will disable read and write by
   cancelling all permissions of the source path. The default value
   is `false`.");
{code}
 

 # The description of delay option is hard to understand. I make a minor change 
for this. [~LiJinglun], if you have a better description for this option, feel 
free to update your change on this.
{code:java}
/* Specify the delayed duration(millie seconds) to recover the Job.*/
final static Option DELAY_DURATION = new Option("delay", true,
  "The delayed duration(millie seconds) to recover the Job continue to run 
when the job is detected that it hasn't been finished and waits to complete.");
{code}

*DistCpProcedure.java*
 # Move {{srcFs.allowSnapshot(src);}} to at the end of method. Only after the 
snapshot check, then we do the allow snapshot opertion.
{code:java}
+
+  private void cleanUpBeforeInitDistcp() throws IOException {
+if (dstFs.exists(dst)) { // clean up.
+  throw new IOException("The dst path=" + dst + " already exists. The 
admin"
+  + " should delete it before submitting the initial distcp job.");
+}
+Path snapshotPath = new Path(src,
+HdfsConstants.DOT_SNAPSHOT_DIR_SEPARATOR + CURRENT_SNAPSHOT_NAME);
+if (srcFs.exists(snapshotPath)) {
+  throw new IOException("The src snapshot=" + snapshotPath +
+  " already exists. The admin should delete the snapshot before"
+  + " submitting the initial distcp.");
+}
 srcFs.allowSnapshot(src); <--- move to here 
+  }
{code}

*FedBalanceContext.java*
 # Please add necessary dot in toString method, like this:
{code:java}
  public String toString() {
StringBuilder builder = new StringBuilder("FedBalance context:");
builder.append(" src=").append(src);
builder.append(", dst=").append(dst);
if (useMountReadOnly) {
  builder.append(", router-mode=true");
  builder.append(", mount-point=").append(mount);
} else {
  builder.append(", router-mode=false");
}
builder.append(", forceCloseOpenFiles=").append(forceCloseOpenFiles);
builder.append(", trash=").append(trashOpt.name());
builder.append(", map=").append(mapNum);
builder.append(", bandwidth=").append(bandwidthLimit);
return builder.toString();
  }
{code}

 # Can you add new added option delayDuration option into this class?

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch, HDFS-15346.008.patch, 
> HDFS-15346.009.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by 

[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-06 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127489#comment-17127489
 ] 

Yiqun Lin edited comment on HDFS-15346 at 6/7/20, 4:35 AM:
---

Some more detailed review comments:

*HdfsConstants.java*
 Can we rename DOT_SNAPSHOT_SEPARATOR_DIR to the more readable name 
DOT_SNAPSHOT_DIR_SEPARATOR?

*DistCpFedBalance.java*
 # It would good to print the fed context that created from input options, so 
that we will know final options that we passed in.
{noformat}
+. // -->  print fed balancer context
+  // Construct the balance job.
+  BalanceJob.Builder builder = new 
BalanceJob.Builder<>();
+  DistCpProcedure dcp =
+  new DistCpProcedure(DISTCP_PROCEDURE, null, delayDuration, context);
+  builder.nextProcedure(dcp);
{noformat}

 # We can replace this system out in LOG instance,
{noformat}
+for (BalanceJob job : jobs) {
+  if (!job.isJobDone()) {
+unfinished++;
+  }
+  System.out.println(job);
+}
{noformat}

*DistCpProcedure.java*
 # The message in IOException(src + " doesn't exist.") not correctly described, 
should be 'src + " should be the directory."'
 # For each stage change, can we add aN additional output log, like this:
{noformat}
+if (srcFs.exists(new Path(src, HdfsConstants.DOT_SNAPSHOT_DIR))) {
+  throw new IOException(src + " shouldn't enable snapshot.");
+}
 LOG.info("Stage updated from {} to {}.", stage.name(), 
Stage.INIT_DISTCP.name())
+stage = Stage.INIT_DISTCP;
+  }
{noformat}

 # Here we reset permission to 0, that means no any operation is allowed? Is 
this expected, why not is 400 (only allow read)? The comment said that 
'cancelling the x permission of the source path.' makes me confused.
{noformat}
srcFs.setPermission(src, FsPermission.createImmutable((short) 0));
{noformat}

 # I prefer to throw IOException rather than doing delete operation in 
cleanUpBeforeInitDistcp. cleanUpBeforeInitDistcp is expected to be the final 
pre-check function before submitting ditcp job. And let admin users to check 
and do delete operation manually by themself.
{noformat}
+  private void initialCheckBeforeInitDistcp() throws IOException {
+if (dstFs.exists(dst)) {
+ throw IOException();
+}
+srcFs.allowSnapshot(src);
+if (srcFs.exists(new Path(src,
+HdfsConstants.DOT_SNAPSHOT_SEPARATOR_DIR + CURRENT_SNAPSHOT_NAME))) {
throw IOException();
+}
{noformat}

*FedBalanceConfigs.java*
 Can we move all keys from BalanceProcedureConfigKeys to this class? We don't 
need two duplicated Config class. One follow-up task I am thinking that we can 
have a separated config file something named fedbalance-default.xml for 
fedbalance tool, like ditcp-default.xml for distcp tool now. I don't prefer to 
add all tool config settings into hdfs-default.xml.

*FedBalanceContext.java*
 Override the toString method in FedBalanceContext to help us know the input 
options that actually be used.

*MountTableProcedure.java*
 The for loop can just break once we find the first source path that matched.
{noformat}
+for (MountTable result : results) {
+  if (mount.equals(result.getSourcePath())) {
+  existingEntry = result;
   break;   
+  }
+}
{noformat}
*TrashProcedure.java*
{noformat}
+  /**
+   * Delete source path to trash.
+   */
+  void moveToTrash() throws IOException {
+Path src = context.getSrc();
+if (srcFs.exists(src)) {
+  switch (context.getTrashOpt()) {
+  case TRASH:
+conf.setFloat(FS_TRASH_INTERVAL_KEY, 1);
+if (!Trash.moveToAppropriateTrash(srcFs, src, conf)) {
+  throw new IOException("Failed move " + src + " to trash.");
+}
+break;
+  case DELETE:
+if (!srcFs.delete(src, true)) {
+  throw new IOException("Failed delete " + src);
+}
+LOG.info("{} is deleted.", src);
+break;
+  default:
+break;
+  }
+}
+  }
{noformat}
For above lines, two review comments:
 # Can we add SKIP option check as well and throw unexpected option error?
{noformat}
case SKIP:
break;
+  default:
+  throw new IOException("Unexpected trash option=" + 
context.getTrashOpt());
+  }
{noformat}

 # FS_TRASH_INTERVAL_KEY defined with 1 is too small, that means we the trash 
will be deleted after 1 minute. Can you increased this to 60? Also please add 
necessary comment in trash option description to say the default trash behavior 
when trash is disabled in server side and client side value will be used.


was (Author: linyiqun):
Some more detailed review comments:

*HdfsConstants.java*
 Can we rename DOT_SNAPSHOT_SEPARATOR_DIR to the more readable name 
DOT_SNAPSHOT_DIR_SEPARATOR?

*DistCpFedBalance.java*
 # It would good to print the fed context that created from input options, so 
that we will 

[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-06 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127498#comment-17127498
 ] 

Yiqun Lin commented on HDFS-15346:
--

Review comments for unit tests:

*TestDistCpProcedure.java*
# Use CommonConfigurationKeysPublic.FS_DEFAULT_NAME_KEY to replace 
'fs.defaultFS'.
# In {{TestDistCpProcedure#testSuccessfulDistCpProcedure}}, can we add 
additional file length check between src file and dst file?
# Please complete the javadoc comment for method executeProcedure and 
createFiles.
# Method sede can update to a more readable name serializeProcedure.
# I think we missing a corner case test case that disable writer behavior in 
non-RBF mode.
# The test need a little long time to execute the whole test.
>From Jenkins test result:
{noformat}
testDiffDistCp  1 min 18 secPassed
testInitDistCp  22 sec  Passed
testRecoveryByStage 55 sec  Passed
testShutdown8.9 sec Passed
testStageFinalDistCp47 sec  Passed
testStageFinish 0.22 secPassed
testSuccessfulDistCpProcedure   38 sec  Passed
{noformat}
Can we look into why some ut spend so many time? Increasing timeout value is a 
quick-fix way but not the best way.

*TestMountTableProcedure.java*
Please update testSeDe to testSeDeserialize

*TestTrashProcedure.java*
Can we also add a test method testSeDeserialize like TestMountTableProcedure 
does?

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-06 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127489#comment-17127489
 ] 

Yiqun Lin commented on HDFS-15346:
--

Some more detailed review comments:

*HdfsConstants.java*
 Can we rename DOT_SNAPSHOT_SEPARATOR_DIR to the more readable name 
DOT_SNAPSHOT_DIR_SEPARATOR?

*DistCpFedBalance.java*
 # It would good to print the fed context that created from input options, so 
that we will know final options that we passed in.
{noformat}
+. // -->  print fed balancer context
+  // Construct the balance job.
+  BalanceJob.Builder builder = new 
BalanceJob.Builder<>();
+  DistCpProcedure dcp =
+  new DistCpProcedure(DISTCP_PROCEDURE, null, delayDuration, context);
+  builder.nextProcedure(dcp);
{noformat}
 # We can replace this system out in LOG instance,
{noformat}
+for (BalanceJob job : jobs) {
+  if (!job.isJobDone()) {
+unfinished++;
+  }
+  System.out.println(job);
+}
{noformat}

*DistCpProcedure.java*
 # The message in IOException(src + " doesn't exist.") not correctly described, 
should be 'src + " should be the directory."'
 # For each stage change, can we add aN additional output log, like this:
{noformat}
+if (srcFs.exists(new Path(src, HdfsConstants.DOT_SNAPSHOT_DIR))) {
+  throw new IOException(src + " shouldn't enable snapshot.");
+}
 LOG.info("Stage updated from {} to {}.", stage.name(), 
Stage.INIT_DISTCP.name())
+stage = Stage.INIT_DISTCP;
+  }
{noformat}
 # Here we reset permission to 0, that means no any operation is allowed? Is 
this expected, why not is 400 (only allow read)? The comment said that 
'cancelling the x permission of the source path.' makes me confused.
{noformat}
srcFs.setPermission(src, FsPermission.createImmutable((short) 0));
{noformat}
 # I prefer to throw IOException rather than doing delete operation in 
cleanUpBeforeInitDistcp. cleanUpBeforeInitDistcp is expected to be the final 
pre-check function before submit ditcp job.
{noformat}
+  private void initialCheckBeforeInitDistcp() throws IOException {
+if (dstFs.exists(dst)) {
+ throw IOException();
+}
+srcFs.allowSnapshot(src);
+if (srcFs.exists(new Path(src,
+HdfsConstants.DOT_SNAPSHOT_SEPARATOR_DIR + CURRENT_SNAPSHOT_NAME))) {
throw IOException();
+}
{noformat}

*FedBalanceConfigs.java*
 Can we move all keys from BalanceProcedureConfigKeys to this class? We don't 
need two duplicated Config class. One follow-up task I am thinking that we can 
have a separated config file something named fedbalance-default.xml for 
fedbalance tool, like ditcp-default.xml for distcp tool now. I don't prefer to 
add all tool config settings into hdfs-default.xml.

*FedBalanceContext.java*
 Override the toString method in FedBalanceContext to help us know the input 
options that actually be used.

*MountTableProcedure.java*
 The for loop can just break once we find the first source path that matched.
{noformat}
+for (MountTable result : results) {
+  if (mount.equals(result.getSourcePath())) {
+  existingEntry = result;
   break;   
+  }
+}
{noformat}

*TrashProcedure.java*
{noformat}
+  /**
+   * Delete source path to trash.
+   */
+  void moveToTrash() throws IOException {
+Path src = context.getSrc();
+if (srcFs.exists(src)) {
+  switch (context.getTrashOpt()) {
+  case TRASH:
+conf.setFloat(FS_TRASH_INTERVAL_KEY, 1);
+if (!Trash.moveToAppropriateTrash(srcFs, src, conf)) {
+  throw new IOException("Failed move " + src + " to trash.");
+}
+break;
+  case DELETE:
+if (!srcFs.delete(src, true)) {
+  throw new IOException("Failed delete " + src);
+}
+LOG.info("{} is deleted.", src);
+break;
+  default:
+break;
+  }
+}
+  }
{noformat}
For above lines, two review comments:
# Can we add SKIP option check as well and throw unexpected option error?
{noformat}
case SKIP:
break;
+  default:
+  throw new IOException("Unexpected trash option=" + 
context.getTrashOpt());
+  }
{noformat}
# FS_TRASH_INTERVAL_KEY defined with 1 is too small, that means we the trash 
will be deleted after 1 minute. Can you increased this to 60?  Also please add 
necessary comment in trash option description to say the default trash behavior 
when trash is disabled in server side and client side value will be used.

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> 

[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-04 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126332#comment-17126332
 ] 

Yiqun Lin commented on HDFS-15346:
--

Will give detailed review on this weekend, [~LiJinglun]. 

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch, 
> HDFS-15346.006.patch, HDFS-15346.007.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-02 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123467#comment-17123467
 ] 

Yiqun Lin edited comment on HDFS-15346 at 6/2/20, 7:56 AM:
---

[~LiJinglun], can you fix related failure ut and generated checkstyle warnings?

The patch generated 19 new + 2 unchanged - 0 fixed = 21 total (was 2)

[https://builds.apache.org/job/PreCommit-HDFS-Build/29395/artifact/out/diff-checkstyle-root.txt]


was (Author: linyiqun):
[~LiJinglun], can you fix related failure ut and generated checkstyle warnings?

The patch generated 19 new + 2 unchanged - 0 fixed = 21 total (was 2)

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-02 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123467#comment-17123467
 ] 

Yiqun Lin commented on HDFS-15346:
--

[~LiJinglun], can you fix related failure ut and generated checkstyle warnings?

The patch generated 19 new + 2 unchanged - 0 fixed = 21 total (was 2)

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch, HDFS-15346.005.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120981#comment-17120981
 ] 

Yiqun Lin edited comment on HDFS-15346 at 6/1/20, 12:23 PM:


Hi [~LiJinglun] , some initial review comments from me:

*DistCpFedBalance.java*
 # line 77 I suggest to extract 'submit' as a static variable in this class.
 # line 85 the same comment to extract.
 # line 127 Can you complete the javadoc of this method?
 # line 132: Why the default bandwidth is only 1 for fedbaalance, will not be 
too small?
 # line 137, 140, 150 We can use method CommandLine#hasOption to extract 
Boolean type input value.
 # line 178 Can you complete the javadoc of construct method?
 # line 199, 206, 210, 215 Also suggest to use static variable rather than 
hard-coded value in these places.
 # line 228 rClient not closed after it's used.

*DistCpProcedure.java*
 # line 191 We can use HdfsConstants.SEPARATOR_DOT_SNAPSHOT_DIR_SEPARATOR to 
replace '/.snapshot/'
 # line 306 It will be better if we can add some necessary describe for the 
steps of diff distcp job submission.
 # line 374 Can we replace '.snapshot' with HdfsConstants.DOT_SNAPSHOT_DIR in 
all other places in this class?

*TestDistCpProcedure.java*
 Can you use replace HdfsConstants.DOT_SNAPSHOT_DIR to replace '.snapshot' in 
this class?

*TestTrashProcedure.java*
{quote}Path src = new Path(nnUri + "/"+getMethodName()+"-src");
 Path dst = new Path(nnUri + "/"+getMethodName()+"-dst");
{quote}
We don't need to use nnUri here because we have already got the Filesystem 
instance. If we don't want to specified for one namespace, URI prefix can be 
ignored, default fs will be used.
 We can simplifed to
{quote}Path src = new Path("/" + +getMethodName() ++ "-src");
 Path dst = new Path("/" + +getMethodName() ++ "-dst");
{quote}


was (Author: linyiqun):
Hi [~LiJinglun] , some initial review comments from me:

*DistCpFedBalance.java*
 # line 77 I suggest to extract 'submit' as a static variable in this class.
 # line 85 the same comment to extract.
 # line 127 Can you complete the javadoc of this method?
 # line 132: Why the default bandwidth is only 1 for fedbaalance, will not be 
too small?
 # line 137, 140, 150 We can use method CommandLine#hasOption to extract 
Boolean type input value.
 # line 178 Can you complete the javadoc of construct method?
 # line 199, 206, 210, 215 Also suggest to use static variable rather than 
hard-coded value in these places.
 # line 228 rClient not closed after it's used.

*DistCpProcedure.java*
 # line 191 We can use HdfsConstants.SEPARATOR_DOT_SNAPSHOT_DIR_SEPARATOR to 
replace '/.snapshot/'
 # line 306 It will be better if we can add some necessary describe for the 
steps of diff distcp job submission.
 # line 374 Can we replace '.snapshot' with HdfsConstants.DOT_SNAPSHOT_DIR in 
all other places in this class?

*TestDistCpProcedure.java*
Can you use replace HdfsConstants.DOT_SNAPSHOT_DIR to replace '.snapshot' in 
this class?

*TestTrashProcedure.java*
{quote}Path src = new Path(nnUri + "/"+getMethodName()+"-src");
 Path dst = new Path(nnUri + "/"+getMethodName()+"-dst");
{quote}
We don't need to use nnUri here because we have already got the Filesystem 
instance. If we don't want to specified for one namespace, URI prefix can be 
ignored, default fs will be used.
 We can simplifed to
{quote}Path src = new Path("/"+getMethodName()+"-src");
 Path dst = new Path("/"+getMethodName()+"-dst");
{quote}

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15346) RBF: DistCpFedBalance implementation

2020-06-01 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120981#comment-17120981
 ] 

Yiqun Lin commented on HDFS-15346:
--

Hi [~LiJinglun] , some initial review comments from me:

*DistCpFedBalance.java*
 # line 77 I suggest to extract 'submit' as a static variable in this class.
 # line 85 the same comment to extract.
 # line 127 Can you complete the javadoc of this method?
 # line 132: Why the default bandwidth is only 1 for fedbaalance, will not be 
too small?
 # line 137, 140, 150 We can use method CommandLine#hasOption to extract 
Boolean type input value.
 # line 178 Can you complete the javadoc of construct method?
 # line 199, 206, 210, 215 Also suggest to use static variable rather than 
hard-coded value in these places.
 # line 228 rClient not closed after it's used.

*DistCpProcedure.java*
 # line 191 We can use HdfsConstants.SEPARATOR_DOT_SNAPSHOT_DIR_SEPARATOR to 
replace '/.snapshot/'
 # line 306 It will be better if we can add some necessary describe for the 
steps of diff distcp job submission.
 # line 374 Can we replace '.snapshot' with HdfsConstants.DOT_SNAPSHOT_DIR in 
all other places in this class?

*TestDistCpProcedure.java*
Can you use replace HdfsConstants.DOT_SNAPSHOT_DIR to replace '.snapshot' in 
this class?

*TestTrashProcedure.java*
{quote}Path src = new Path(nnUri + "/"+getMethodName()+"-src");
 Path dst = new Path(nnUri + "/"+getMethodName()+"-dst");
{quote}
We don't need to use nnUri here because we have already got the Filesystem 
instance. If we don't want to specified for one namespace, URI prefix can be 
ignored, default fs will be used.
 We can simplifed to
{quote}Path src = new Path("/"+getMethodName()+"-src");
 Path dst = new Path("/"+getMethodName()+"-dst");
{quote}

> RBF: DistCpFedBalance implementation
> 
>
> Key: HDFS-15346
> URL: https://issues.apache.org/jira/browse/HDFS-15346
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15346.001.patch, HDFS-15346.002.patch, 
> HDFS-15346.003.patch, HDFS-15346.004.patch
>
>
> Patch in HDFS-15294 is too big to review so we split it into 2 patches. This 
> is the second one. Detail can be found at HDFS-15294.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >