from:"AMC\-team \(Jira\)"

[jira] [Commented] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-09-21 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199350#comment-17199350
 ] 

AMC-team commented on HDFS-15442:
-

[~ayushtkn] Thanks for reminding.

Upload the patch again.

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Assignee: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-09-21 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Status: Open  (was: Patch Available)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-09-21 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: HDFS-15442.000.patch
  Assignee: AMC-team
Status: Patch Available  (was: Open)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Assignee: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-09-21 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: (was: HDFS-15442.000.patch)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-09-21 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: HDFS-15442.000.patch

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-09-21 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: (was: HDFS-15442.000.patch)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-09-03 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189950#comment-17189950
 ] 

AMC-team commented on HDFS-15438:
-

[~ayushtkn]

Sorry for the late reply, I check the related tests and they passed.

!Screen Shot 2020-09-03 at 4.33.53 PM.png!

Please let me know if I need to check more failed tests

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch, HDFS-15438.001.patch, Screen Shot 
> 2020-09-03 at 4.33.53 PM.png
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-09-03 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: Screen Shot 2020-09-03 at 4.33.53 PM.png

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch, HDFS-15438.001.patch, Screen Shot 
> 2020-09-03 at 4.33.53 PM.png
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-15 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178301#comment-17178301
 ] 

AMC-team commented on HDFS-15439:
-

Thanks for the review and suggestions!!! [~ayushtkn]

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Assignee: AMC-team
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch, 
> HDFS-15439.002.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-10 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.002.patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch, 
> HDFS-15439.002.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-10 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175158#comment-17175158
 ] 

AMC-team commented on HDFS-15439:
-

upload the new patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch, 
> HDFS-15439.002.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-10 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: (was: HDFS-15439.002.patch)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-10 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175150#comment-17175150
 ] 

AMC-team commented on HDFS-15439:
-

upload the new patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch, 
> HDFS-15439.002.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-10 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Comment: was deleted

(was: upload the new patch)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch, 
> HDFS-15439.002.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-09 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.002.patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch, 
> HDFS-15439.002.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-09 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.001.patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-09 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173891#comment-17173891
 ] 

AMC-team commented on HDFS-15439:
-

Add a new patch to check and correct misconf value

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-08-09 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173794#comment-17173794
 ] 

AMC-team commented on HDFS-15439:
-

Thanks! [~ayushtkn] Sorry I've trapped by some other things. 
Now I will try to solve this issue and if there is any problem I may need your 
help. 
Thanks again!


> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-27 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165538#comment-17165538
 ] 

AMC-team commented on HDFS-15439:
-

Sorry about the mis-operation, will upload a valid patch later

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-27 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: (was: HDFS-15439.001.patch)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-27 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Comment: was deleted

(was: Upload a patch based on [~ayushtkn]'s suggestion. Thanks!)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164724#comment-17164724
 ] 

AMC-team edited comment on HDFS-15442 at 7/25/20, 3:05 AM:
---

upload a patch to fall back the invalid chunksize value to default right after 
parsing


was (Author: amc-team):
upload a patch to fall back the invalid chunksize value to default

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164724#comment-17164724
 ] 

AMC-team commented on HDFS-15442:
-

upload a patch to fall back the invalid chunksize value to default

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: (was: HDFS-15442.000.patch)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: HDFS-15442.000.patch
Status: Patch Available  (was: Open)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Attachment: HDFS-15442.000.patch

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15442.000.patch
>
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However,
>  *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above.* *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15440) The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164721#comment-17164721
 ] 

AMC-team commented on HDFS-15440:
-

upload a patch to change the current logic and refine parameter check

> The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.  
> --
>
> Key: HDFS-15440
> URL: https://issues.apache.org/jira/browse/HDFS-15440
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15440.000.patch
>
>
> In HDFS disk balancer, configuration parameter 
> "dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
> means 10%) which defines a good enough move.
> The description in hdfs-default.xml is not so clear to me how the value 
> actually calculates and works
> {quote}When a disk balancer copy operation is proceeding, the datanode is 
> still active. So it might not be possible to move the exactly specified 
> amount of data. So tolerance allows us to define a percentage which defines a 
> good enough move.
> {quote}
> So I refer to the [official doc of HDFS disk 
> balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
>  and the description is:
> {quote}The tolerance percent specifies when we have reached a good enough 
> value for any copy step. For example, if you specify 10 then getting close to 
> 10% of the target value is good enough. It is to say if the move operation is 
> 20GB in size, if we can move 18GB (20 * (1-10%)) that operation is considered 
> successful.
> {quote}
> However from the source code in DiskBalancer.java
> {code:java}
> // Inflates bytesCopied and returns true or false. This allows us to stop
> // copying if we have reached close enough.
> private boolean isCloseEnough(DiskBalancerWorkItem item) {
>   long temp = item.getBytesCopied() +
>  ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
>   return (item.getBytesToCopy() >= temp) ? false : true;
> }
> {code}
> Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
> still not enough because 20 > 18 + 18*0.1
> Here, we should check whether 18 > 20*(1-0.1).
>  The calculation in isLessThanNeeded() (Checks if a given block is less than 
> needed size to meet our goal.) is also not intuitive in the same way.
> Also, this parameter doesn't have upper bound check, which means you can even 
> set it to 100% which is obviously wrong value.
> *How to fix*
> Although this may not lead severe failure, it is better to make it consistent 
> between doc and code, and also better to refine the description in 
> hdfs-default.xml to make it more precise and clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15440) The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15440:

Attachment: HDFS-15440.000.patch
Status: Patch Available  (was: Open)

> The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.  
> --
>
> Key: HDFS-15440
> URL: https://issues.apache.org/jira/browse/HDFS-15440
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15440.000.patch
>
>
> In HDFS disk balancer, configuration parameter 
> "dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
> means 10%) which defines a good enough move.
> The description in hdfs-default.xml is not so clear to me how the value 
> actually calculates and works
> {quote}When a disk balancer copy operation is proceeding, the datanode is 
> still active. So it might not be possible to move the exactly specified 
> amount of data. So tolerance allows us to define a percentage which defines a 
> good enough move.
> {quote}
> So I refer to the [official doc of HDFS disk 
> balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
>  and the description is:
> {quote}The tolerance percent specifies when we have reached a good enough 
> value for any copy step. For example, if you specify 10 then getting close to 
> 10% of the target value is good enough. It is to say if the move operation is 
> 20GB in size, if we can move 18GB (20 * (1-10%)) that operation is considered 
> successful.
> {quote}
> However from the source code in DiskBalancer.java
> {code:java}
> // Inflates bytesCopied and returns true or false. This allows us to stop
> // copying if we have reached close enough.
> private boolean isCloseEnough(DiskBalancerWorkItem item) {
>   long temp = item.getBytesCopied() +
>  ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
>   return (item.getBytesToCopy() >= temp) ? false : true;
> }
> {code}
> Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
> still not enough because 20 > 18 + 18*0.1
> Here, we should check whether 18 > 20*(1-0.1).
>  The calculation in isLessThanNeeded() (Checks if a given block is less than 
> needed size to meet our goal.) is also not intuitive in the same way.
> Also, this parameter doesn't have upper bound check, which means you can even 
> set it to 100% which is obviously wrong value.
> *How to fix*
> Although this may not lead severe failure, it is better to make it consistent 
> between doc and code, and also better to refine the description in 
> hdfs-default.xml to make it more precise and clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15440) The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15440:

Description: 
In HDFS disk balancer, configuration parameter 
"dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
means 10%) which defines a good enough move.

The description in hdfs-default.xml is not so clear to me how the value 
actually calculates and works
{quote}When a disk balancer copy operation is proceeding, the datanode is still 
active. So it might not be possible to move the exactly specified amount of 
data. So tolerance allows us to define a percentage which defines a good enough 
move.
{quote}
So I refer to the [official doc of HDFS disk 
balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
 and the description is:
{quote}The tolerance percent specifies when we have reached a good enough value 
for any copy step. For example, if you specify 10 then getting close to 10% of 
the target value is good enough. It is to say if the move operation is 20GB in 
size, if we can move 18GB (20 * (1-10%)) that operation is considered 
successful.
{quote}
However from the source code in DiskBalancer.java
{code:java}
// Inflates bytesCopied and returns true or false. This allows us to stop
// copying if we have reached close enough.
private boolean isCloseEnough(DiskBalancerWorkItem item) {
  long temp = item.getBytesCopied() +
 ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
  return (item.getBytesToCopy() >= temp) ? false : true;
}
{code}
Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
still not enough because 20 > 18 + 18*0.1
Here, we should check whether 18 > 20*(1-0.1).
 The calculation in isLessThanNeeded() (Checks if a given block is less than 
needed size to meet our goal.) is also not intuitive in the same way.

Also, this parameter doesn't have upper bound check, which means you can even 
set it to 100% which is obviously wrong value.

*How to fix*

Although this may not lead severe failure, it is better to make it consistent 
between doc and code, and also better to refine the description in 
hdfs-default.xml to make it more precise and clear.

  was:
In HDFS disk balancer, configuration parameter 
"dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
means 10%) which defines a good enough move.

The description in hdfs-default.xml is not so clear to me how the value 
actually calculates and works
{quote}When a disk balancer copy operation is proceeding, the datanode is still 
active. So it might not be possible to move the exactly specified amount of 
data. So tolerance allows us to define a percentage which defines a good enough 
move.
{quote}
So I refer to the [official doc of HDFS disk 
balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
 and the description is:
{quote}The tolerance percent specifies when we have reached a good enough value 
for any copy step. For example, if you specify 10 then getting close to 10% of 
the target value is good enough. It is to say if the move operation is 20GB in 
size, if we can move 18GB (20 * (1-10%)) that operation is considered 
successful.
{quote}
However from the source code in DiskBalancer.java
{code:java}
// Inflates bytesCopied and returns true or false. This allows us to stop
// copying if we have reached close enough.
private boolean isCloseEnough(DiskBalancerWorkItem item) {
  long temp = item.getBytesCopied() +
 ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
  return (item.getBytesToCopy() >= temp) ? false : true;
}
{code}
Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
still not enough because 20 > 18 + 18*0.1
 The calculation in isLessThanNeeded() (Checks if a given block is less than 
needed size to meet our goal.) is also not intuitive in the same way.

*How to fix*

Although this may not lead severe failure, it is better to make it consistent 
between doc and code, and also better to refine the description in 
hdfs-default.xml to make it more precise and clear.


> The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.  
> --
>
> Key: HDFS-15440
> URL: https://issues.apache.org/jira/browse/HDFS-15440
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, configuration parameter 
> "dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
> means 10%) which defines a good enough move.
> The description in hdfs-default.xml is not so clear to me how the value 
> actually calculates and works
> {quote}When a disk balancer copy

[jira] [Updated] (HDFS-15440) The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15440:

Summary: The usage of dfs.disk.balancer.block.tolerance.percent is not 
intuitive.(was: The doc of dfs.disk.balancer.block.tolerance.percent is 
misleading)

> The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.  
> --
>
> Key: HDFS-15440
> URL: https://issues.apache.org/jira/browse/HDFS-15440
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, configuration parameter 
> "dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
> means 10%) which defines a good enough move.
> The description in hdfs-default.xml is not so clear to me how the value 
> actually calculates and works
> {quote}When a disk balancer copy operation is proceeding, the datanode is 
> still active. So it might not be possible to move the exactly specified 
> amount of data. So tolerance allows us to define a percentage which defines a 
> good enough move.
> {quote}
> So I refer to the [official doc of HDFS disk 
> balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
>  and the description is:
> {quote}The tolerance percent specifies when we have reached a good enough 
> value for any copy step. For example, if you specify 10 then getting close to 
> 10% of the target value is good enough. It is to say if the move operation is 
> 20GB in size, if we can move 18GB (20 * (1-10%)) that operation is considered 
> successful.
> {quote}
> However from the source code in DiskBalancer.java
> {code:java}
> // Inflates bytesCopied and returns true or false. This allows us to stop
> // copying if we have reached close enough.
> private boolean isCloseEnough(DiskBalancerWorkItem item) {
>   long temp = item.getBytesCopied() +
>  ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
>   return (item.getBytesToCopy() >= temp) ? false : true;
> }
> {code}
> Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
> still not enough because 20 > 18 + 18*0.1
>  The calculation in isLessThanNeeded() (Checks if a given block is less than 
> needed size to meet our goal.) is also not intuitive in the same way.
> *How to fix*
> Although this may not lead severe failure, it is better to make it consistent 
> between doc and code, and also better to refine the description in 
> hdfs-default.xml to make it more precise and clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164706#comment-17164706
 ] 

AMC-team edited comment on HDFS-15443 at 7/25/20, 1:52 AM:
---

Thanks [~ayushtkn] for the great feedback.

I refined the patch  (change maxXceiverCount to this.maxXceiverCount)

I will check the failed test


was (Author: amc-team):
Thanks [~ayushtkn] for the great feedback.

I refined the patch  (change maxXceiverCount to this.maxXceiverCount)

I also checked standard output of the consistently failed test  
hadoop.fs.contract.hdfs.TestHDFSContractMultipartUploader and I think it is not 
relevant to this patch:
{quote}java.lang.IllegalArgumentException: Path /test is not under 
hdfs://localhost:36213/test
 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
 at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.checkPath(AbstractMultipartUploader.java:73)
 at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.abortUploadsUnderPath(AbstractMultipartUploader.java:136)
 at 
org.apache.hadoop.fs.contract.AbstractContractMultipartUploaderTest.teardown(AbstractContractMultipartUploaderTest.java:107)
{quote}

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164713#comment-17164713
 ] 

AMC-team commented on HDFS-15438:
-

Thanks [~ayushtkn] for the feedback. I upload a patch to change the while loop 
condition and if condition to support value 0.

What's more, IMHO, the current code logic may be more intuitive. Previously if 
we set  dfs.disk.balancer.max.disk.errors to n, it can actually just tolerate 
n-1 errors. Now it can tolerate n errors, which is more consistent with the 
parameter's documentation:
{quote}During a block move from a source to destination disk, we might 
encounter various errors. *This defines how many errors we can tolerate* before 
we declare a move between 2 disks (or a step) has failed.
{quote}

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch, HDFS-15438.001.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: HDFS-15438.001.patch

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch, HDFS-15438.001.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164710#comment-17164710
 ] 

AMC-team commented on HDFS-15439:
-

Upload a patch based on [~ayushtkn]'s suggestion. Thanks!

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.001.patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164706#comment-17164706
 ] 

AMC-team commented on HDFS-15443:
-

Thanks [~ayushtkn] for the great feedback.

I refined the patch  (change maxXceiverCount to this.maxXceiverCount)

I also checked standard output of the consistently failed test  
hadoop.fs.contract.hdfs.TestHDFSContractMultipartUploader and I think it is not 
relevant to this patch:
{quote}java.lang.IllegalArgumentException: Path /test is not under 
hdfs://localhost:36213/test
 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
 at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.checkPath(AbstractMultipartUploader.java:73)
 at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.abortUploadsUnderPath(AbstractMultipartUploader.java:136)
 at 
org.apache.hadoop.fs.contract.AbstractContractMultipartUploaderTest.teardown(AbstractContractMultipartUploaderTest.java:107)
{quote}

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.003.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: (was: HDFS-15443.003.patch)

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.003.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Comment: was deleted

(was: I upload a new patch based on [~ayushtkn]'s suggestion.

IMHO, the current logic may be better because previously if we set 
"dfs.disk.balancer.max.disk.errors" to n, it actually can just tolerate n-1 
errors because of the while loop condition. Now it can tolerate n errors, which 
is more consistent with the documentation:

{quote}During a block move from a source to destination disk, we might 
encounter various errors. This defines how many errors we can tolerate before 
we declare a move between 2 disks (or a step) has failed.{quote})

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Comment: was deleted

(was: Thanks [~ayushtkn] for the great feedback. 
 I refined the patch (change *maxXceiverCount* to *this.maxXceiverCount*)

 I also checked the Standard Output of the consistently failed test 
*hadoop.fs.contract.hdfs.TestHDFSContractMultipartUploader.* And I think it is 
not related to this patch

{quote}(AbstractContractMultipartUploaderTest.java:teardown(110)) - Exeception 
in teardown
java.lang.IllegalArgumentException: Path /test is not under 
hdfs://localhost:36213/test
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.checkPath(AbstractMultipartUploader.java:73)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.abortUploadsUnderPath(AbstractMultipartUploader.java:136)
at 
org.apache.hadoop.fs.contract.AbstractContractMultipartUploaderTest.teardown(AbstractContractMultipartUploaderTest.java:107)
{quote})

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Comment: was deleted

(was: upload a new patch based on [~ayushtkn]' feedback)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: (was: HDFS-15438.001.patch)

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: (was: HDFS-15439.001.patch)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: (was: HDFS-15443.003.patch)

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164690#comment-17164690
 ] 

AMC-team commented on HDFS-15438:
-

I upload a new patch based on [~ayushtkn]'s suggestion.

IMHO, the current logic may be better because previously if we set 
"dfs.disk.balancer.max.disk.errors" to n, it actually can just tolerate n-1 
errors because of the while loop condition. Now it can tolerate n errors, which 
is more consistent with the documentation:

{quote}During a block move from a source to destination disk, we might 
encounter various errors. This defines how many errors we can tolerate before 
we declare a move between 2 disks (or a step) has failed.{quote}

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch, HDFS-15438.001.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: HDFS-15438.001.patch

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch, HDFS-15438.001.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164684#comment-17164684
 ] 

AMC-team commented on HDFS-15439:
-

upload a new patch based on [~ayushtkn]' feedback

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.001.patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch, HDFS-15439.001.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164677#comment-17164677
 ] 

AMC-team edited comment on HDFS-15443 at 7/24/20, 11:34 PM:


Thanks [~ayushtkn] for the great feedback. 
 I refined the patch (change *maxXceiverCount* to *this.maxXceiverCount*)

 I also checked the Standard Output of the consistently failed test 
*hadoop.fs.contract.hdfs.TestHDFSContractMultipartUploader.* And I think it is 
not related to this patch

{quote}(AbstractContractMultipartUploaderTest.java:teardown(110)) - Exeception 
in teardown
java.lang.IllegalArgumentException: Path /test is not under 
hdfs://localhost:36213/test
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.checkPath(AbstractMultipartUploader.java:73)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.abortUploadsUnderPath(AbstractMultipartUploader.java:136)
at 
org.apache.hadoop.fs.contract.AbstractContractMultipartUploaderTest.teardown(AbstractContractMultipartUploaderTest.java:107)
{quote}


was (Author: amc-team):
Thanks [~ayushtkn] for the great feedback. 
 I refined the patch (change *maxXceiverCount* to *this.maxXceiverCount*)
 I also checked the Standard Output of the consistently failed test 
*hadoop.fs.contract.hdfs.TestHDFSContractMultipartUploader.* And I think it is 
not related to this patch

{quote}(AbstractContractMultipartUploaderTest.java:teardown(110)) - Exeception 
in teardown
java.lang.IllegalArgumentException: Path /test is not under 
hdfs://localhost:36213/test
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.checkPath(AbstractMultipartUploader.java:73)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.abortUploadsUnderPath(AbstractMultipartUploader.java:136)
at 
org.apache.hadoop.fs.contract.AbstractContractMultipartUploaderTest.teardown(AbstractContractMultipartUploaderTest.java:107)
{quote}

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164677#comment-17164677
 ] 

AMC-team commented on HDFS-15443:
-

Thanks [~ayushtkn] for the great feedback. 
 I refined the patch (change *maxXceiverCount* to *this.maxXceiverCount*)
 I also checked the Standard Output of the consistently failed test 
*hadoop.fs.contract.hdfs.TestHDFSContractMultipartUploader.* And I think it is 
not related to this patch

{quote}(AbstractContractMultipartUploaderTest.java:teardown(110)) - Exeception 
in teardown
java.lang.IllegalArgumentException: Path /test is not under 
hdfs://localhost:36213/test
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.checkPath(AbstractMultipartUploader.java:73)
at 
org.apache.hadoop.fs.impl.AbstractMultipartUploader.abortUploadsUnderPath(AbstractMultipartUploader.java:136)
at 
org.apache.hadoop.fs.contract.AbstractContractMultipartUploaderTest.teardown(AbstractContractMultipartUploaderTest.java:107)
{quote}

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.003.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: (was: HDFS-15443.003.patch)

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.003.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch, HDFS-15443.003.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163966#comment-17163966
 ] 

AMC-team edited comment on HDFS-15443 at 7/24/20, 8:40 AM:
---

Sure, thanks for reminding!

Before that, I'm thinking that can we fall back the parameter value to its 
default value (4096) and give a log message? Just like what [~ayushtkn] suggest 
in [HDFS-15439|https://issues.apache.org/jira/browse/HDFS-15439]. Since this is 
a sanity check, falling back to default value can be a safe and conservative 
choice.

Do you have any suggestion? [~elgoiri] [~ayushtkn] [~jianghuazhu]


was (Author: amc-team):
Sure

But before that, I'm thinking that can we fall back the parameter value to its 
default value (4096) and give a log message. Just like what [~ayushtkn] suggest 
in [HDFS-15439|https://issues.apache.org/jira/browse/HDFS-15439]. Since this is 
a sanity check, falling back to default value can be a safe and conservative 
choice.

How do you think? [~elgoiri] [~ayushtkn] [~jianghuazhu]

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-24 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163974#comment-17163974
 ] 

AMC-team edited comment on HDFS-15439 at 7/24/20, 8:39 AM:
---

Thanks [~ayushtkn] for the great suggestion! That's actually what I want to do 
initially: To correct the parameter value at the beginning and don't let the 
invalid value going through the program. 

I will try to upload a patch soon.


was (Author: amc-team):
Thanks [~ayushtkn] for the great suggestion! That's actually what I want to do 
initially: To correct the parameter value at the beginning and don't let the 
invalid value going through the program. 

 

I will try to upload a patch soon.

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-23 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163974#comment-17163974
 ] 

AMC-team commented on HDFS-15439:
-

Thanks [~ayushtkn] for the great suggestion! That's actually what I want to do 
initially: To correct the parameter value at the beginning and don't let the 
invalid value going through the program. 

 

I will try to upload a patch soon.

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-23 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163966#comment-17163966
 ] 

AMC-team commented on HDFS-15443:
-

Sure

But before that, I'm thinking that can we fall back the parameter value to its 
default value (4096) and give a log message. Just like what [~ayushtkn] suggest 
in [HDFS-15439|https://issues.apache.org/jira/browse/HDFS-15439]. Since this is 
a sanity check, falling back to default value can be a safe and conservative 
choice.

How do you think? [~elgoiri] [~ayushtkn] [~jianghuazhu]

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-17 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160235#comment-17160235
 ] 

AMC-team commented on HDFS-15443:
-

Upload .002 patch to solve the whitespace problem

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-17 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.002.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch, 
> HDFS-15443.002.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-16 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159657#comment-17159657
 ] 

AMC-team commented on HDFS-15443:
-

[~elgoiri]

Sorry about my late feedback. I have uploaded a new patch. Thanks for reviewing 
it.

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-16 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.001.patch
Status: Patch Available  (was: Open)

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch, HDFS-15443.001.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-16 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.000.patch
Status: Patch Available  (was: Open)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-16 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: (was: HDFS-15439.000.patch)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-16 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: (was: HDFS-15438.000.patch)

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-16 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: HDFS-15438.000.patch
Status: Patch Available  (was: Open)

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-05 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: HDFS-15438.000.patch

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-05 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: (was: HDFS-15438.000.patch)

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-05 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: (was: HDFS-15438.000.patch)

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-05 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: HDFS-15438.000.patch

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-04 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151394#comment-17151394
 ] 

AMC-team commented on HDFS-15438:
-

I added a patch to change the while loop condition to adapt that max disk 
errors = 0

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-04 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151393#comment-17151393
 ] 

AMC-team commented on HDFS-15439:
-

I added a patch to change the if condition, thus handling negative value of 
dfs.mover.retry.max.attempts

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Attachment: HDFS-15439.000.patch

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15439.000.patch
>
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Description: 
Configuration parameter "dfs.mover.retry.max.attempts" is to define the maximum 
number of retries before the mover consider the move failed. There is no 
checking code so this parameter can accept any int value.

Theoratically, setting this value to <=0 should mean that no retry at all. 
However, if you set the value to negative value. The checking condition for 
retry failed will never satisfied because the if statement is "*if 
(retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
{code:java}
private Result processNamespace() throws IOException {
  ... //wait for pending move to finish and retry the failed migration
  if (hasFailed && !hasSuccess) {
if (retryCount.get() == retryMaxAttempts) {
  result.setRetryFailed();
  LOG.error("Failed to move some block's after "
  + retryMaxAttempts + " retries.");
  return result;
} else {
  retryCount.incrementAndGet();
}
  } else {
// Reset retry count if no failure.
retryCount.set(0);
  }
  ...
}
{code}
*How to fix*

Add checking code of "dfs.mover.retry.max.attempts" to accept only non-negative 
value or change the if statement condition when retry count exceeds max 
attempts.

  was:
Configuration parameter "dfs.mover.retry.max.attempts" is to define the maximum 
number of retries before the mover consider the move failed. There is no 
checking code so this parameter can accept any int value.

Theoratically, setting this value to <=0 should mean that no retry at all. 
However, if you set the value to negative value. The checking condition for 
retry failed will never satisfied because the if statement is "*if 
(retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
{code:java}
private Result processNamespace() throws IOException {
  ... //wait for pending move to finish and retry the failed migration
  if (hasFailed && !hasSuccess) {
if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
  result.setRetryFailed();
  LOG.error("Failed to move some block's after "
  + retryMaxAttempts + " retries.");
  return result;
} else {
  retryCount.incrementAndGet();
}
  } else {
// Reset retry count if no failure.
retryCount.set(0);
  }
  ...
}
{code}
*How to fix*

Add checking code of "dfs.mover.retry.max.attempts" to accept only non-negative 
value or change the if statement condition when retry count exceeds max 
attempts.


> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) {
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands,

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Attachment: HDFS-15438.000.patch

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15438.000.patch
>
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Summary: Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block 
copy  (was: dfs.disk.balancer.max.disk.errors = 0 will fail the block copy)

> Setting dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-04 Thread AMC-team (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151386#comment-17151386
 ] 

AMC-team commented on HDFS-15443:
-

[~jianghuazhu], [~elgoiri]

I added a simple patch to do the sanity check. I think it may help at least to 
prevent some accidental misconfigurations. 

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.000.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: (was: HDFS-15443.000.patch)

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-07-04 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Attachment: HDFS-15443.000.patch

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
> Attachments: HDFS-15443.000.patch
>
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever.

2020-07-03 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Summary: Setting dfs.mover.retry.max.attempts to negative value will retry 
forever.  (was: Setting dfs.mover.retry.max.attempts to negative value will 
retry forever (theoratcially).)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Description: 
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no check code for this parameter. User may accidentally set this value 
to a wrong value. Here, if the user set chunkSize to a negative value. Chunked 
streaming mode will always be used. In setChunkedStreamingMode(chunkSize), 
there is a correction code that if the chunkSize is <=0, it will be change to 
DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However,
 *If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above.* *(This scenario may not be 
common, but* *we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (setChunkedStreamingMode). 
  

  was:
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no check code for this parameter. User may accidentally set this value 
to a wrong value. Here, if the user set chunkSize to a negative value. Chunked 
streaming mode will always be used. In setChunkedStreamingMode(chunkSize), 
there is a correction code that if the chunkSize is <=0, it will be change to 
DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may not be 
common, but* *we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  


> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Description: 
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no check code for this parameter. User may accidentally set this value 
to a wrong value. Here, if the user set chunkSize to a negative value. Chunked 
streaming mode will always be used. In setChunkedStreamingMode(chunkSize), 
there is a correction code that if the chunkSize is <=0, it will be change to 
DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may not be 
common, but* *we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  

  was:
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may not be 
common, but* *we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  


> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever (theoratcially).

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Description: 
Configuration parameter "dfs.mover.retry.max.attempts" is to define the maximum 
number of retries before the mover consider the move failed. There is no 
checking code so this parameter can accept any int value.

Theoratically, setting this value to <=0 should mean that no retry at all. 
However, if you set the value to negative value. The checking condition for 
retry failed will never satisfied because the if statement is "*if 
(retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
{code:java}
private Result processNamespace() throws IOException {
  ... //wait for pending move to finish and retry the failed migration
  if (hasFailed && !hasSuccess) {
if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
  result.setRetryFailed();
  LOG.error("Failed to move some block's after "
  + retryMaxAttempts + " retries.");
  return result;
} else {
  retryCount.incrementAndGet();
}
  } else {
// Reset retry count if no failure.
retryCount.set(0);
  }
  ...
}
{code}
*How to fix*

Add checking code of "dfs.mover.retry.max.attempts" to accept only non-negative 
value or change the if statement condition when retry count exceeds max 
attempts.

  was:
Configuration parameter "dfs.mover.retry.max.attempts" is to define the maximum 
number of retries before the mover consider the move failed. There is no 
checking code so this parameter can accept any int value.

Theoratically, setting this value to <=0 should mean that no retry at all. 
However, if you set the value to negative value. The checking condition for 
retry failed will never satisfied because the if statement is "*if 
(retryCount.get() == retryMaxAttempts)*". The code is in Mover.java.
{code:java}
private Result processNamespace() throws IOException {
  ... //wait for pending move to finish and retry the failed migration
  if (hasFailed && !hasSuccess) {
if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
  result.setRetryFailed();
  LOG.error("Failed to move some block's after "
  + retryMaxAttempts + " retries.");
  return result;
} else {
  retryCount.incrementAndGet();
}
  } else {
// Reset retry count if no failure.
retryCount.set(0);
  }
  ...
}
{code}

*How to fix*

Add checking code of "dfs.mover.retry.max.attempts" to accept only non-negative 
value or change the if statement condition when retry count exceeds max 
attempts. 


> Setting dfs.mover.retry.max.attempts to negative value will retry forever 
> (theoratcially).
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The retry count will always +1 by 
> retryCount.incrementAndGet() after failed but never *=* *retryMaxAttempts.* 
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands,

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will retry forever (theoratcially).

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Summary: Setting dfs.mover.retry.max.attempts to negative value will retry 
forever (theoratcially).  (was: Setting dfs.mover.retry.max.attempts to 
negative value will fail the retry policy.)

> Setting dfs.mover.retry.max.attempts to negative value will retry forever 
> (theoratcially).
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The code is in Mover.java.
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Description: 
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simply don't do 
the block copy even there is no disk error occur because the while loop 
condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition to support value 0.
  

  was:
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simply don't do 
the block copy even there is no disk error occur because the while loop 
condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition to support value 0.
  


> dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) {
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Description: 
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simply don't do 
the block copy even there is no disk error occur because the while loop 
condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition to support value 0.
  

  was:
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simple don't do 
the block copy because the while loop condition *item.getErrorCount() < 
getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition to support value 0.
  


> dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simply don't do 
> the block copy even there is no disk error occur because the while loop 
> condition *item.getErrorCount() < getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
> //getMaxError = 0
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Description: 
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simple don't do 
the block copy because the while loop condition *item.getErrorCount() < 
getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition to support value 0.
  

  was:
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simple don't do 
the block copy because the while loop condition *item.getErrorCount() < 
getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition and the following if statement condition to 
support value 0.
  


> dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simple don't do 
> the block copy because the while loop condition *item.getErrorCount() < 
> getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
> //getMaxError = 0
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition to support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15438) dfs.disk.balancer.max.disk.errors = 0 will fail the block copy

2020-07-01 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15438:

Description: 
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

The parameter can accept value that >= 0. And setting the value to 0 should 
mean no error tolerance. However, setting the value to 0 will simple don't do 
the block copy because the while loop condition *item.getErrorCount() < 
getMaxError(item)* will not satisfied.
{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}
*How to fix*

Change the while loop condition and the following if statement condition to 
support value 0.
  

  was:
In HDFS disk balancer, the config parameter "dfs.disk.balancer.max.disk.errors" 
is to control the value of maximum number of errors we can ignore for a 
specific move between two disks before it is abandoned.

>From the checking code, the parameter can accept value that >= 0. And setting 
>the value to 0 should mean no error tolerance (IIUC). However, in current code 
>implementation in DiskBalancer.java. Setting the value to 0 will simple don't 
>do the block copy because the while loop condition will not satisfied.

{code:java}
// Gets the next block that we can copy
private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
 DiskBalancerWorkItem item) {
  while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
//getMaxError = 0
try {
  ... //get the block
}  catch (IOException e) {
item.incErrorCount();
}
   if (item.getErrorCount() >= getMaxError(item)) {
item.setErrMsg("Error count exceeded.");
LOG.info("Maximum error count exceeded. Error count: {} Max error:{} ",
item.getErrorCount(), item.getMaxDiskErrors());
  }
{code}

*How to fix*

Change the while loop condition and the following if statement condition to 
support value 0.
 


> dfs.disk.balancer.max.disk.errors = 0 will fail the block copy
> --
>
> Key: HDFS-15438
> URL: https://issues.apache.org/jira/browse/HDFS-15438
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, the config parameter 
> "dfs.disk.balancer.max.disk.errors" is to control the value of maximum number 
> of errors we can ignore for a specific move between two disks before it is 
> abandoned.
> The parameter can accept value that >= 0. And setting the value to 0 should 
> mean no error tolerance. However, setting the value to 0 will simple don't do 
> the block copy because the while loop condition *item.getErrorCount() < 
> getMaxError(item)* will not satisfied.
> {code:java}
> // Gets the next block that we can copy
> private ExtendedBlock getBlockToCopy(FsVolumeSpi.BlockIterator iter,
>  DiskBalancerWorkItem item) {
>   while (!iter.atEnd() && item.getErrorCount() < getMaxError(item)) { 
> //getMaxError = 0
> try {
>   ... //get the block
> }  catch (IOException e) {
> item.incErrorCount();
> }
>if (item.getErrorCount() >= getMaxError(item)) {
> item.setErrMsg("Error count exceeded.");
> LOG.info("Maximum error count exceeded. Error count: {} Max error:{} 
> ",
> item.getErrorCount(), item.getMaxDiskErrors());
>   }
> {code}
> *How to fix*
> Change the while loop condition and the following if statement condition to 
> support value 0.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Description: 
Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
maximum number of threads to use for transferring data in and out of the DN. 
This is a vital param that need to tune carefully. 
{code:java}
// DataXceiverServer.java
// Make sure the xceiver count is not exceeded
intcurXceiverCount = datanode.getXceiverCount();
if (curXceiverCount > maxXceiverCount) {
thrownewIOException("Xceiver count " + curXceiverCount
+ " exceeds the limit of concurrent xceivers: "
+ maxXceiverCount);
}
{code}
There are many issues that caused by not setting this param to an appropriate 
value. However, there is no any check code to restrict the parameter. Although 
having a hard-and-fast rule is difficult because we need to consider number of 
cores, main memory etc, *we can prevent users from setting this value to an 
absolute wrong value by accident.* (e.g. a negative value that totally break 
the availability of datanode.)

*How to fix:*

Add proper check code for the parameter.

  was:
Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
maximum number of threads to use for transferring data in and out of the DN. 
This is a vital param that need to tune carefully. There are many bad cases 
that caused by not setting this param to an appropriate value.

{code:java}
// DataXceiverServer.java
// Make sure the xceiver count is not exceeded
intcurXceiverCount = datanode.getXceiverCount();
if (curXceiverCount > maxXceiverCount) {
thrownewIOException("Xceiver count " + curXceiverCount
+ " exceeds the limit of concurrent xceivers: "
+ maxXceiverCount);
}
{code}

However, there is no any check code to restrict the parameter. Although having 
a hard-and-fast rule is difficult because we need to consider number of cores, 
main memory etc, *we should at least prevent users from setting this value to 
an absolute wrong value by accident.* (e.g. a negative value that totally break 
the availability of datanode.)

*How to fix:*

Add proper check code for the parameter.


> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. 
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> There are many issues that caused by not setting this param to an appropriate 
> value. However, there is no any check code to restrict the parameter. 
> Although having a hard-and-fast rule is difficult because we need to consider 
> number of cores, main memory etc, *we can prevent users from setting this 
> value to an absolute wrong value by accident.* (e.g. a negative value that 
> totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting dfs.datanode.max.transfer.threads to a very small value can cause strange failure.

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Summary: Setting dfs.datanode.max.transfer.threads to a very small value 
can cause strange failure.  (was: Setting )

> Setting dfs.datanode.max.transfer.threads to a very small value can cause 
> strange failure.
> --
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. There are many bad cases 
> that caused by not setting this param to an appropriate value.
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> However, there is no any check code to restrict the parameter. Although 
> having a hard-and-fast rule is difficult because we need to consider number 
> of cores, main memory etc, *we should at least prevent users from setting 
> this value to an absolute wrong value by accident.* (e.g. a negative value 
> that totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15443) Setting

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15443:

Summary: Setting   (was: dfs.datanode.max.transfer.threads should have 
check code)

> Setting 
> 
>
> Key: HDFS-15443
> URL: https://issues.apache.org/jira/browse/HDFS-15443
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter dfs.datanode.max.transfer.threads is to specify the 
> maximum number of threads to use for transferring data in and out of the DN. 
> This is a vital param that need to tune carefully. There are many bad cases 
> that caused by not setting this param to an appropriate value.
> {code:java}
> // DataXceiverServer.java
> // Make sure the xceiver count is not exceeded
> intcurXceiverCount = datanode.getXceiverCount();
> if (curXceiverCount > maxXceiverCount) {
> thrownewIOException("Xceiver count " + curXceiverCount
> + " exceeds the limit of concurrent xceivers: "
> + maxXceiverCount);
> }
> {code}
> However, there is no any check code to restrict the parameter. Although 
> having a hard-and-fast rule is difficult because we need to consider number 
> of cores, main memory etc, *we should at least prevent users from setting 
> this value to an absolute wrong value by accident.* (e.g. a negative value 
> that totally break the availability of datanode.)
> *How to fix:*
> Add proper check code for the parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Summary: Image upload may fail if dfs.image.transfer.chunksize wrongly set 
to negative value  (was: Image upload will fail if dfs.image.transfer.chunksize 
wrongly set to negative value)

> Image upload may fail if dfs.image.transfer.chunksize wrongly set to negative 
> value
> ---
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no any check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However, the correction may be too late. 
>  If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above. *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (if statement and setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload will fail if dfs.image.transfer.chunksize wrongly set to negative value

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Summary: Image upload will fail if dfs.image.transfer.chunksize wrongly set 
to negative value  (was: Image upload will fail if )

> Image upload will fail if dfs.image.transfer.chunksize wrongly set to 
> negative value
> 
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no any check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However, the correction may be too late. 
>  If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above. *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (if statement and setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) Image upload will fail if

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Summary: Image upload will fail if   (was: dfs.image.transfer.chunksize 
should have check code before use)

> Image upload will fail if 
> --
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use 
> chunked streaming mode to avoid internal buffering. This mode should be used 
> only if more than chunkSize data is present to upload, otherwise upload may 
> not happen sometimes.
> {code:java}
> //TransferFsImage.java
> int chunkSize = (int) conf.getLongBytes(
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
> DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
> if (imageFile.length() > chunkSize) {
>   // using chunked streaming mode to support upload of 2GB+ files and to
>   // avoid internal buffering.
>   // this mode should be used only if more than chunkSize data is present
>   // to upload. otherwise upload may not happen sometimes.
>   connection.setChunkedStreamingMode(chunkSize);
> }
> {code}
> There is no any check code for this parameter. User may accidentally set this 
> value to a wrong value. Here, if the user set chunkSize to a negative value. 
> Chunked streaming mode will always be used. In 
> setChunkedStreamingMode(chunkSize), there is a correction code that if the 
> chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
> {code:java}
> public void setChunkedStreamingMode (int chunklen) {
> if (connected) {
> throw new IllegalStateException ("Can't set streaming mode: already 
> connected");
> }
> if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
> throw new IllegalStateException ("Fixed length streaming mode set");
> }
> chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
> }
> {code}
> However, the correction may be too late. 
>  If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
> images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked 
> streaming mode and may fail the upload as mentioned above. *(This scenario 
> may not be common, but* *we can prevent users setting this param to an 
> extremely small value.**)*
> *How to fix:*
> Add checking code or correction code right after parsing the config value 
> before really use the value (if statement and setChunkedStreamingMode). 
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15440) The doc of dfs.disk.balancer.block.tolerance.percent is misleading

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15440:

Summary: The doc of dfs.disk.balancer.block.tolerance.percent is misleading 
 (was: The using of dfs.disk.balancer.block.tolerance.percent is inconsistent 
with doc)

> The doc of dfs.disk.balancer.block.tolerance.percent is misleading
> --
>
> Key: HDFS-15440
> URL: https://issues.apache.org/jira/browse/HDFS-15440
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> In HDFS disk balancer, configuration parameter 
> "dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
> means 10%) which defines a good enough move.
> The description in hdfs-default.xml is not so clear to me how the value 
> actually calculates and works
> {quote}When a disk balancer copy operation is proceeding, the datanode is 
> still active. So it might not be possible to move the exactly specified 
> amount of data. So tolerance allows us to define a percentage which defines a 
> good enough move.
> {quote}
> So I refer to the [official doc of HDFS disk 
> balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
>  and the description is:
> {quote}The tolerance percent specifies when we have reached a good enough 
> value for any copy step. For example, if you specify 10 then getting close to 
> 10% of the target value is good enough. It is to say if the move operation is 
> 20GB in size, if we can move 18GB (20 * (1-10%)) that operation is considered 
> successful.
> {quote}
> However from the source code in DiskBalancer.java
> {code:java}
> // Inflates bytesCopied and returns true or false. This allows us to stop
> // copying if we have reached close enough.
> private boolean isCloseEnough(DiskBalancerWorkItem item) {
>   long temp = item.getBytesCopied() +
>  ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
>   return (item.getBytesToCopy() >= temp) ? false : true;
> }
> {code}
> Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
> still not enough because 20 > 18 + 18*0.1
>  The calculation in isLessThanNeeded() (Checks if a given block is less than 
> needed size to meet our goal.) is also not intuitive in the same way.
> *How to fix*
> Although this may not lead severe failure, it is better to make it consistent 
> between doc and code, and also better to refine the description in 
> hdfs-default.xml to make it more precise and clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) dfs.image.transfer.chunksize should have check code before use

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Description: 
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may not be 
common, but* *we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  

  was:
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may not be 
common, but a**t least we can prevent users setting this param to an extremely 
small value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  


> dfs.image.transfer.chunksize should have check code before use
> --
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will use

[jira] [Updated] (HDFS-15439) Setting dfs.mover.retry.max.attempts to negative value will fail the retry policy.

2020-06-28 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15439:

Summary: Setting dfs.mover.retry.max.attempts to negative value will fail 
the retry policy.  (was: dfs.mover.retry.max.attempts should not accept 
negative value )

> Setting dfs.mover.retry.max.attempts to negative value will fail the retry 
> policy.
> --
>
> Key: HDFS-15439
> URL: https://issues.apache.org/jira/browse/HDFS-15439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer  mover
>Reporter: AMC-team
>Priority: Major
>
> Configuration parameter "dfs.mover.retry.max.attempts" is to define the 
> maximum number of retries before the mover consider the move failed. There is 
> no checking code so this parameter can accept any int value.
> Theoratically, setting this value to <=0 should mean that no retry at all. 
> However, if you set the value to negative value. The checking condition for 
> retry failed will never satisfied because the if statement is "*if 
> (retryCount.get() == retryMaxAttempts)*". The code is in Mover.java.
> {code:java}
> private Result processNamespace() throws IOException {
>   ... //wait for pending move to finish and retry the failed migration
>   if (hasFailed && !hasSuccess) {
> if (retryCount.get() == retryMaxAttempts) { // retryMaxAttempts < 0
>   result.setRetryFailed();
>   LOG.error("Failed to move some block's after "
>   + retryMaxAttempts + " retries.");
>   return result;
> } else {
>   retryCount.incrementAndGet();
> }
>   } else {
> // Reset retry count if no failure.
> retryCount.set(0);
>   }
>   ...
> }
> {code}
> *How to fix*
> Add checking code of "dfs.mover.retry.max.attempts" to accept only 
> non-negative value or change the if statement condition when retry count 
> exceeds max attempts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15442) dfs.image.transfer.chunksize should have check code before use

2020-06-27 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Description: 
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may not be 
common, but a**t least we can prevent users setting this param to an extremely 
small value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  

  was:
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may be common, 
but a**t least we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  


> dfs.image.transfer.chunksize should have check code before use
> --
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value dfs.image.transfer.chunksize, it will

[jira] [Updated] (HDFS-15442) dfs.image.transfer.chunksize should have check code before use

2020-06-27 Thread AMC-team (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AMC-team updated HDFS-15442:

Description: 
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may be common, 
but a**t least we can prevent users setting this param to an extremely small 
value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  

  was:
In current implementation of checkpoint image transfer, if the file length is 
bigger than the configured value dfs.image.transfer.chunksize, it will use 
chunked streaming mode to avoid internal buffering. This mode should be used 
only if more than chunkSize data is present to upload, otherwise upload may not 
happen sometimes.
{code:java}
//TransferFsImage.java
int chunkSize = (int) conf.getLongBytes(
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_KEY,
DFSConfigKeys.DFS_IMAGE_TRANSFER_CHUNKSIZE_DEFAULT);
if (imageFile.length() > chunkSize) {
  // using chunked streaming mode to support upload of 2GB+ files and to
  // avoid internal buffering.
  // this mode should be used only if more than chunkSize data is present
  // to upload. otherwise upload may not happen sometimes.
  connection.setChunkedStreamingMode(chunkSize);
}
{code}
There is no any check code for this parameter. User may accidentally set this 
value to a wrong value. Here, if the user set chunkSize to a negative value. 
Chunked streaming mode will always be used. In 
setChunkedStreamingMode(chunkSize), there is a correction code that if the 
chunkSize is <=0, it will be change to DEFAULT_CHUNK_SIZE.
{code:java}
public void setChunkedStreamingMode (int chunklen) {
if (connected) {
throw new IllegalStateException ("Can't set streaming mode: already 
connected");
}
if (fixedContentLength != -1 || fixedContentLengthLong != -1) {
throw new IllegalStateException ("Fixed length streaming mode set");
}
chunkLength = chunklen <=0? DEFAULT_CHUNK_SIZE : chunklen;
}
{code}
However, the correction may be too late. 
 If the user set dfs.image.transfer.chunksize to value that <= 0, even for 
images whose imageFile.length() < DEFAULT_CHUNK_SIZE will use chunked streaming 
mode and may fail the upload as mentioned above. *(This scenario may be rare or 
even impossible, but a**t least we can prevent users setting this param to an 
extremely small value.**)*

*How to fix:*

Add checking code or correction code right after parsing the config value 
before really use the value (if statement and setChunkedStreamingMode). 
  


> dfs.image.transfer.chunksize should have check code before use
> --
>
> Key: HDFS-15442
> URL: https://issues.apache.org/jira/browse/HDFS-15442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: AMC-team
>Priority: Major
>
> In current implementation of checkpoint image transfer, if the file length is 
> bigger than the configured value

1 2 >

1 - 100 of 120 matches

Mail list logo