[jira] [Updated] (HDDS-2032) Ozone client should retry writes in case of any ratis/stateMachine exceptions

2019-08-25 Thread Shashikant Banerjee (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashikant Banerjee updated HDDS-2032:
--
Description: Currently, Ozone client retry writes on a different pipeline 
or container in case of some specific exceptions. But in case, it sees 
exception such as DISK_FULL, CONTAINER_UNHEALTHY or any corruption , it just 
aborts the write. In general, the every such exception on the client should be 
a retriable  exception in ozone client and on some specific exceptions, it 
should take some more specific exception like excluding certain containers or 
pipelines while retrying or informing SCM of a corrupt replica etc.  (was: 
Currently, Ozone client retry writes on a different pipeline or container in 
case of some specific exceptions. But in case, it sees exception such as 
DISK_FULL, CONTAINER_UNHEALTHY or any corruption , it just aborts the right. In 
general, the every such exception on the client should be a retriable  
exception in ozone client and on some specific exceptions, it should take some 
more specific exception like excluding certain containers or pipelines while 
retrying or informing SCM of a corrupt replica etc.)

> Ozone client should retry writes in case of any ratis/stateMachine exceptions
> -
>
> Key: HDDS-2032
> URL: https://issues.apache.org/jira/browse/HDDS-2032
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> Currently, Ozone client retry writes on a different pipeline or container in 
> case of some specific exceptions. But in case, it sees exception such as 
> DISK_FULL, CONTAINER_UNHEALTHY or any corruption , it just aborts the write. 
> In general, the every such exception on the client should be a retriable  
> exception in ozone client and on some specific exceptions, it should take 
> some more specific exception like excluding certain containers or pipelines 
> while retrying or informing SCM of a corrupt replica etc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2032) Ozone client should retry writes in case of any ratis/stateMachine exceptions

2019-08-25 Thread Shashikant Banerjee (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashikant Banerjee updated HDDS-2032:
--
Summary: Ozone client should retry writes in case of any ratis/stateMachine 
exceptions  (was: Ozone client retry writes in case of any ratis/stateMachine 
exceptions)

> Ozone client should retry writes in case of any ratis/stateMachine exceptions
> -
>
> Key: HDDS-2032
> URL: https://issues.apache.org/jira/browse/HDDS-2032
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> Currently, Ozone client retry writes on a different pipeline or container in 
> case of some specific exceptions. But in case, it sees exception such as 
> DISK_FULL, CONTAINER_UNHEALTHY or any corruption , it just aborts the right. 
> In general, the every such exception on the client should be a retriable  
> exception in ozone client and on some specific exceptions, it should take 
> some more specific exception like excluding certain containers or pipelines 
> while retrying or informing SCM of a corrupt replica etc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2032) Ozone client retry writes in case of any ratis/stateMachine exceptions

2019-08-25 Thread Shashikant Banerjee (Jira)
Shashikant Banerjee created HDDS-2032:
-

 Summary: Ozone client retry writes in case of any 
ratis/stateMachine exceptions
 Key: HDDS-2032
 URL: https://issues.apache.org/jira/browse/HDDS-2032
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Client
Affects Versions: 0.5.0
Reporter: Shashikant Banerjee
Assignee: Shashikant Banerjee
 Fix For: 0.5.0


Currently, Ozone client retry writes on a different pipeline or container in 
case of some specific exceptions. But in case, it sees exception such as 
DISK_FULL, CONTAINER_UNHEALTHY or any corruption , it just aborts the right. In 
general, the every such exception on the client should be a retriable  
exception in ozone client and on some specific exceptions, it should take some 
more specific exception like excluding certain containers or pipelines while 
retrying or informing SCM of a corrupt replica etc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12904) Add DataTransferThrottler to the Datanode transfers

2019-08-25 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915466#comment-16915466
 ] 

Lisheng Sun commented on HDFS-12904:


I confirmed the two UT are ok in my local. So the UT failures are unrelated to 
this patch.

[~elgoiri] Could you help take a review for the patch? Thank you.

> Add DataTransferThrottler to the Datanode transfers
> ---
>
> Key: HDFS-12904
> URL: https://issues.apache.org/jira/browse/HDFS-12904
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Reporter: Íñigo Goiri
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-12904.000.patch, HDFS-12904.001.patch, 
> HDFS-12904.002.patch, HDFS-12904.003.patch
>
>
> The {{DataXceiverServer}} already uses throttling for the balancing. The 
> Datanode should also allow throttling the regular data transfers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14748) Make DataNodePeerMetrics#minOutlierDetectionSamples configurable

2019-08-25 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915464#comment-16915464
 ] 

Lisheng Sun commented on HDFS-14748:


hi [~xkrogen] [~jojochuang] Could you help take a reiview for this patch? Thank 
you.

> Make DataNodePeerMetrics#minOutlierDetectionSamples configurable
> 
>
> Key: HDFS-14748
> URL: https://issues.apache.org/jira/browse/HDFS-14748
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14748.001.patch, HDFS-14748.002.patch, 
> HDFS-14748.003.patch
>
>
> Slow node monitoring is to transfer 1000 packets between DataNodes within 
> three hours before they are eligible to calculate and upload transmission 
> delays to the namenode.
> But if Write data is very small and number of packets is less than 1000, the 
> slow node will not be reported to NameNode, so make 
> DataNodePeerMetrics#minOutlierDetectionSamplesconfigurable.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915460#comment-16915460
 ] 

Wei-Chiu Chuang commented on HDFS-11246:


+1

any one else like to review/comment? Otherwise I'll commit the v011 patch in a 
few days before it goes stale again.

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915452#comment-16915452
 ] 

He Xiaoqiao commented on HDFS-11246:


check failed unit tests and it seems that most of them is related with OOM, I 
try to test them at local and all passed. Please help to check if have time. 
Thanks.

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1930) Test Topology Aware Job scheduling with Ozone Topology

2019-08-25 Thread Sammi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-1930:
-
Description: 
My initial results with Terasort does not seem to report the counter properly. 
Most of the requests are handled by rack local but no node local. This ticket 
is opened to add more system testing to validate the feature. 

Total Allocated Containers: 3778
Each table cell represents the number of NodeLocal/RackLocal/OffSwitch 
containers satisfied by NodeLocal/RackLocal/OffSwitch resource requests.
Node Local Request  Rack Local Request  Off Switch Request
Num Node Local Containers (satisfied by)0   
Num Rack Local Containers (satisfied by)0   3648
Num Off Switch Containers (satisfied by)0   96  34

  was:
My initial results with Terasort does not seem to report the counter properly. 
Most of the requests are handled by rack locl but no node local. This ticket is 
opened to add more system testing to validate the feature. 

Total Allocated Containers: 3778
Each table cell represents the number of NodeLocal/RackLocal/OffSwitch 
containers satisfied by NodeLocal/RackLocal/OffSwitch resource requests.
Node Local Request  Rack Local Request  Off Switch Request
Num Node Local Containers (satisfied by)0   
Num Rack Local Containers (satisfied by)0   3648
Num Off Switch Containers (satisfied by)0   96  34


> Test Topology Aware Job scheduling with Ozone Topology
> --
>
> Key: HDDS-1930
> URL: https://issues.apache.org/jira/browse/HDDS-1930
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Xiaoyu Yao
>Priority: Major
>
> My initial results with Terasort does not seem to report the counter 
> properly. Most of the requests are handled by rack local but no node local. 
> This ticket is opened to add more system testing to validate the feature. 
> Total Allocated Containers: 3778
> Each table cell represents the number of NodeLocal/RackLocal/OffSwitch 
> containers satisfied by NodeLocal/RackLocal/OffSwitch resource requests.
> Node Local RequestRack Local Request  Off Switch Request
> Num Node Local Containers (satisfied by)  0   
> Num Rack Local Containers (satisfied by)  0   3648
> Num Off Switch Containers (satisfied by)  0   96  34



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2031) Choose datanode for pipeline creation based on network topology

2019-08-25 Thread Sammi Chen (Jira)
Sammi Chen created HDDS-2031:


 Summary: Choose datanode for pipeline creation based on network 
topology
 Key: HDDS-2031
 URL: https://issues.apache.org/jira/browse/HDDS-2031
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
Reporter: Sammi Chen
Assignee: Sammi Chen


There are regular heartbeats between datanodes in a pipeline. Choose datanodes 
based on network topology, to guarantee data reliability and reduce heartbeat 
network traffic latency.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14568) setStoragePolicy should check quota and update consume on storage type quota.

2019-08-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-14568:
---
Attachment: HDFS-14568.004.patch
Status: Patch Available  (was: Open)

Re-upload patch-003 as patch-004 to trigger jenkins.

> setStoragePolicy should check quota and update consume on storage type quota.
> -
>
> Key: HDFS-14568
> URL: https://issues.apache.org/jira/browse/HDFS-14568
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-14568-001.patch, HDFS-14568-unit-test.patch, 
> HDFS-14568.002.patch, HDFS-14568.003.patch, HDFS-14568.004.patch
>
>
> The quota and consume of the file's ancestors are not handled when the 
> storage policy of the file is changed. For example:
>  1. Set quota StorageType.SSD fileSpace-1 to the parent dir;
>  2. Create a file size of fileSpace with storage policy \{DISK,DISK,DISK} 
> under it;
>  3. Change the storage policy of the file to ALLSSD_STORAGE_POLICY_NAME and 
> expect a QuotaByStorageTypeExceededException.
> Because the quota and consume is not handled, the expected exception is not 
> threw out.
>  
> There are 3 reasons why we should handle the consume and the quota.
> 1. Replication uses the new storage policy. Considering a file with BlockType 
> CONTIGUOUS. It's replication factor is 3 and it's storage policy is "HOT". 
> Now we change the policy to "ONE_SSD". If a DN goes down and the file needs 
> replication, the NN will choose storages in policy "ONE_SSD" and replicate 
> the block to a SSD storage.
> 2. We acturally have a cluster storaging both HOT and COLD data. We have a 
> backgroud process searching all the files to find those that are not accessed 
> for a period of time. Then we set them to COLD and start a mover to move the 
> replicas. After moving, all the replicas are consistent with the storage 
> policy.
> 3. The NameNode manages the global state of the cluster. If there is any 
> inconsistent situation, such as the replicas doesn't match the storage policy 
> of the file, we should take the NameNode as the standard and make the cluster 
> to match the NameNode. The block replication is a good example of the rule. 
> When we count the consume of a file(CONTIGUOUS), we multiply the replication 
> factor with the file's length, no matter the file is under replicated or 
> excessed. So does the storage type quota and consume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14568) setStoragePolicy should check quota and update consume on storage type quota.

2019-08-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-14568:
---
Status: Open  (was: Patch Available)

> setStoragePolicy should check quota and update consume on storage type quota.
> -
>
> Key: HDFS-14568
> URL: https://issues.apache.org/jira/browse/HDFS-14568
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-14568-001.patch, HDFS-14568-unit-test.patch, 
> HDFS-14568.002.patch, HDFS-14568.003.patch
>
>
> The quota and consume of the file's ancestors are not handled when the 
> storage policy of the file is changed. For example:
>  1. Set quota StorageType.SSD fileSpace-1 to the parent dir;
>  2. Create a file size of fileSpace with storage policy \{DISK,DISK,DISK} 
> under it;
>  3. Change the storage policy of the file to ALLSSD_STORAGE_POLICY_NAME and 
> expect a QuotaByStorageTypeExceededException.
> Because the quota and consume is not handled, the expected exception is not 
> threw out.
>  
> There are 3 reasons why we should handle the consume and the quota.
> 1. Replication uses the new storage policy. Considering a file with BlockType 
> CONTIGUOUS. It's replication factor is 3 and it's storage policy is "HOT". 
> Now we change the policy to "ONE_SSD". If a DN goes down and the file needs 
> replication, the NN will choose storages in policy "ONE_SSD" and replicate 
> the block to a SSD storage.
> 2. We acturally have a cluster storaging both HOT and COLD data. We have a 
> backgroud process searching all the files to find those that are not accessed 
> for a period of time. Then we set them to COLD and start a mover to move the 
> replicas. After moving, all the replicas are consistent with the storage 
> policy.
> 3. The NameNode manages the global state of the cluster. If there is any 
> inconsistent situation, such as the replicas doesn't match the storage policy 
> of the file, we should take the NameNode as the standard and make the cluster 
> to match the NameNode. The block replication is a good example of the rule. 
> When we count the consume of a file(CONTIGUOUS), we multiply the replication 
> factor with the file's length, no matter the file is under replicated or 
> excessed. So does the storage type quota and consume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2013) Add flag gdprEnabled for BucketInfo in OzoneManager proto

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2013?focusedWorklogId=300985=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300985
 ]

ASF GitHub Bot logged work on HDDS-2013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 02:45
Start Date: 26/Aug/19 02:45
Worklog Time Spent: 10m 
  Work Description: dineshchitlangia commented on issue #1345: HDDS-2013. 
Add flag gdprEnabled for BucketInfo in OzoneManager proto
URL: https://github.com/apache/hadoop/pull/1345#issuecomment-524695251
 
 
   Thanks @bharatviswa504 for review. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300985)
Time Spent: 0.5h  (was: 20m)

> Add flag gdprEnabled for BucketInfo in OzoneManager proto
> -
>
> Key: HDDS-2013
> URL: https://issues.apache.org/jira/browse/HDDS-2013
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Dinesh Chitlangia
>Assignee: Dinesh Chitlangia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14633) The StorageType quota and consume in QuotaFeature is not handled when rename and setStoragePolicy etc.

2019-08-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-14633:
---
Attachment: HDFS-14633.007.patch

> The StorageType quota and consume in QuotaFeature is not handled when rename 
> and setStoragePolicy etc. 
> ---
>
> Key: HDFS-14633
> URL: https://issues.apache.org/jira/browse/HDFS-14633
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-14633-testcases-explanation, HDFS-14633.002.patch, 
> HDFS-14633.003.patch, HDFS-14633.004.patch, HDFS-14633.005.patch, 
> HDFS-14633.006.patch, HDFS-14633.007.patch
>
>
> The NameNode manages the global state of the cluster. We should always take 
> NameNode's records as the sole criterion because no matter what inconsistent 
> is the NameNode should finally make everything right based on it's records. 
> Let's call it rule NSC(NameNode is the Sole Criterion). That means when we do 
> all quota related rpcs, we do the quota check according to NameNode's records 
> regardless of any inconsistent situation, such as the replicas doesn't match 
> the storage policy of the file, or the replicas count doesn't match the 
> file's set replica.
>  The work SPS deals with the wrongly palced replicas. There is a thought 
> about putting off the consume update of the DirectoryQuota until all replicas 
> are re-placed by SPS. I can't agree that because if we do so we will abandon 
> letting the NameNode's records to be the sole criterion. The block 
> replication is a good example of the rule NSC. When we count the consume of a 
> file(CONTIGUOUS), we multiply the replication factor with the file's length, 
> no matter the blocks are under replicated or excessed. We should do the same 
> thing for the storage type quota.
>  Another concern is the change will let setStoragePolicy throw 
> QuotaByStorageTypeExceededException which it doesn't before. I don't think 
> it's a big problem since the setStoragePolicy already throws IOException. Or 
> we can wrap the QuotaByStorageTypeExceededException in an IOException, but I 
> won't recommend that because it's ugly.
>  To make storage type consume follow the rule NSC, we need change 
> rename(moving a file with storage policy inherited from it's parent) and 
> setStoragePolicy. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14633) The StorageType quota and consume in QuotaFeature is not handled when rename and setStoragePolicy etc.

2019-08-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915427#comment-16915427
 ] 

Jinglun commented on HDFS-14633:


Upload patch-007, fix unit test failures.

> The StorageType quota and consume in QuotaFeature is not handled when rename 
> and setStoragePolicy etc. 
> ---
>
> Key: HDFS-14633
> URL: https://issues.apache.org/jira/browse/HDFS-14633
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-14633-testcases-explanation, HDFS-14633.002.patch, 
> HDFS-14633.003.patch, HDFS-14633.004.patch, HDFS-14633.005.patch, 
> HDFS-14633.006.patch, HDFS-14633.007.patch
>
>
> The NameNode manages the global state of the cluster. We should always take 
> NameNode's records as the sole criterion because no matter what inconsistent 
> is the NameNode should finally make everything right based on it's records. 
> Let's call it rule NSC(NameNode is the Sole Criterion). That means when we do 
> all quota related rpcs, we do the quota check according to NameNode's records 
> regardless of any inconsistent situation, such as the replicas doesn't match 
> the storage policy of the file, or the replicas count doesn't match the 
> file's set replica.
>  The work SPS deals with the wrongly palced replicas. There is a thought 
> about putting off the consume update of the DirectoryQuota until all replicas 
> are re-placed by SPS. I can't agree that because if we do so we will abandon 
> letting the NameNode's records to be the sole criterion. The block 
> replication is a good example of the rule NSC. When we count the consume of a 
> file(CONTIGUOUS), we multiply the replication factor with the file's length, 
> no matter the blocks are under replicated or excessed. We should do the same 
> thing for the storage type quota.
>  Another concern is the change will let setStoragePolicy throw 
> QuotaByStorageTypeExceededException which it doesn't before. I don't think 
> it's a big problem since the setStoragePolicy already throws IOException. Or 
> we can wrap the QuotaByStorageTypeExceededException in an IOException, but I 
> won't recommend that because it's ugly.
>  To make storage type consume follow the rule NSC, we need change 
> rename(moving a file with storage policy inherited from it's parent) and 
> setStoragePolicy. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2013) Add flag gdprEnabled for BucketInfo in OzoneManager proto

2019-08-25 Thread Dinesh Chitlangia (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Chitlangia updated HDDS-2013:

Status: Patch Available  (was: Open)

> Add flag gdprEnabled for BucketInfo in OzoneManager proto
> -
>
> Key: HDDS-2013
> URL: https://issues.apache.org/jira/browse/HDDS-2013
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Dinesh Chitlangia
>Assignee: Dinesh Chitlangia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.

2019-08-25 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Summary: In some cases, erasure blocks are corruption  when they are 
reconstruct.  (was: In some cases, erasure blocks are corruption  when they are 
rebuilt.)

> In some cases, erasure blocks are corruption  when they are reconstruct.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: HDFS-14768.000.patch
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   // 
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   for (int i = 0; i < 100; i++) {
> datanodeDescriptor.incrementPendingReplicationWithoutTargets();
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   //assertNull(checkFile(dfs, ecFile, 9, decommisionNodes, numDNs));
>   // Ensure decommissioned datanode is not automatically shutdown
>   DFSClient client = getDfsClient(cluster.getNameNode(0), conf);
>   assertEquals("All datanodes must be alive", numDNs,
>   client.datanodeReport(DatanodeReportType.LIVE).length);
>   FileChecksum fileChecksum2 = dfs.getFileChecksum(ecFile, writeBytes);
>   Assert.assertTrue("Checksum mismatches!",
>   fileChecksum1.equals(fileChecksum2));
>   StripedFileTestUtil.checkData(dfs, ecFile, writeBytes, decommisionNodes,
>   null, blockGroupSize);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2019-08-25 Thread Xudong Cao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915412#comment-16915412
 ] 

Xudong Cao edited comment on HDFS-14646 at 8/26/19 2:08 AM:


Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
HDFS clusters).  Not long ago, we frequently encountered the problem described 
by this jira.  and after this patch was merged, the errors are no longer 
reported.


was (Author: xudongcao):
Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
HDFS clusters).  Not long ago, we frequently encountered the problem described 
by the jira.  and after this patch was merged, the errors are no longer 
reported.

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch, 
> HDFS-14646.002.patch, HDFS-14646.003.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences, I tested it under 2.7.2 and trunk version. 
> *1.In Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.In trunk version (with Jetty 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> 

[jira] [Comment Edited] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2019-08-25 Thread Xudong Cao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915412#comment-16915412
 ] 

Xudong Cao edited comment on HDFS-14646 at 8/26/19 2:07 AM:


Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
HDFS clusters).  Not long ago, we frequently encountered the problem described 
by the jira.  and after this patch was merged, the errors are no longer 
reported.


was (Author: xudongcao):
Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
clusters).  Not long ago, we frequently encountered the problem described by 
the jira.  and after this patch was merged, the errors are no longer reported.

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch, 
> HDFS-14646.002.patch, HDFS-14646.003.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences, I tested it under 2.7.2 and trunk version. 
> *1.In Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.In trunk version (with Jetty 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> 

[jira] [Comment Edited] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2019-08-25 Thread Xudong Cao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915412#comment-16915412
 ] 

Xudong Cao edited comment on HDFS-14646 at 8/26/19 1:58 AM:


Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
clusters).  Not long ago, we frequently encountered the problem described by 
the jira.  and after this patch was merged, the errors are no longer reported.


was (Author: xudongcao):
Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
clusters),  not long ago, we frequently encountered the problem described by 
the jira.  and after this patch was merged, the errors are no longer reported.

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch, 
> HDFS-14646.002.patch, HDFS-14646.003.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences, I tested it under 2.7.2 and trunk version. 
> *1.In Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.In trunk version (with Jetty 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  

[jira] [Commented] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2019-08-25 Thread Xudong Cao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915412#comment-16915412
 ] 

Xudong Cao commented on HDFS-14646:
---

Thanks [~csun] , our production environment uses Hadoop version 2.7.2, but 
merged multi-SBN feature. We have about 20,000 machines (divided into dozens of 
clusters),  not long ago, we frequently encountered the problem described by 
the jira.  and after this patch was merged, the errors are no longer reported.

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch, 
> HDFS-14646.002.patch, HDFS-14646.003.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences, I tested it under 2.7.2 and trunk version. 
> *1.In Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.In trunk version (with Jetty 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>   {code}
>                   
> *Solution:*
>  A standby NameNode should not upload fsimage to an inappropriate NameNode, 
> when he plans to put a FsImage to the peer NN, 

[jira] [Commented] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915404#comment-16915404
 ] 

Feilong He commented on HDFS-14745:
---

Thanks [~rakeshr] for your comment. I have renamed the patch. Yes, backport for 
other branch-3.x will be done one by one.

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1-000.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-14745:
--
Attachment: (was: HDFS-14745.002.patch)

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1-000.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-14745:
--
Attachment: (was: HDFS-14745-branch-3.1.002.patch)

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1-000.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-14745:
--
Attachment: (was: HDFS-14745.000.patch)

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1-000.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-14745:
--
Attachment: HDFS-14745-branch-3.1-000.patch

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1-000.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-14745:
--
Attachment: (was: HDFS-14745.001.patch)

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1-000.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14745) Backport HDFS persistent memory read cache support to branch-3.1

2019-08-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-14745:
--
Attachment: HDFS-14745-branch-3.1.002.patch

> Backport HDFS persistent memory read cache support to branch-3.1
> 
>
> Key: HDFS-14745
> URL: https://issues.apache.org/jira/browse/HDFS-14745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
>  Labels: cache, datanode
> Fix For: 3.3.0
>
> Attachments: HDFS-14745-branch-3.1.002.patch, HDFS-14745.000.patch, 
> HDFS-14745.001.patch, HDFS-14745.002.patch
>
>
> We are proposing to backport the patches for HDFS-13762, HDFS persistent 
> memory read cache support, to branch-3.1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13220) Change lastCheckpointTime to use fsimage mostRecentCheckpointTime

2019-08-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915401#comment-16915401
 ] 

Hadoop QA commented on HDFS-13220:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m  
7s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 56m 11s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
36s{color} | {color:red} The patch generated 7 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}142m 48s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestFileCreation |
|   | hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup |
|   | hadoop.hdfs.TestBlockTokenWrappingQOP |
|   | hadoop.hdfs.TestBlocksScheduledCounter |
|   | hadoop.hdfs.TestMaintenanceState |
|   | hadoop.hdfs.TestDFSStripedInputStream |
|   | hadoop.hdfs.server.balancer.TestBalancerService |
|   | hadoop.hdfs.server.datanode.TestDataNodeInitStorage |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.server.datanode.TestBPOfferService |
|   | hadoop.hdfs.server.datanode.TestDataNodeMetrics |
|   | hadoop.hdfs.TestErasureCodingExerciseAPIs |
|   | hadoop.hdfs.TestPread |
|   | hadoop.hdfs.TestErasureCodingPolicies |
|   | hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy |
|   | hadoop.hdfs.TestHDFSFileSystemContract |
|   | hadoop.hdfs.server.balancer.TestBalancerWithSaslDataTransfer |
|   | hadoop.hdfs.TestFileChecksumCompositeCrc |
|   | hadoop.hdfs.server.balancer.TestBalancerRPCDelay |
|   | hadoop.hdfs.TestQuota |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 

[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are rebuilt.

2019-08-25 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915395#comment-16915395
 ] 

guojh commented on HDFS-14768:
--

[~drankye] Can we talk about this issue?

> In some cases, erasure blocks are corruption  when they are rebuilt.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: HDFS-14768.000.patch
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   // 
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   for (int i = 0; i < 100; i++) {
> datanodeDescriptor.incrementPendingReplicationWithoutTargets();
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   //assertNull(checkFile(dfs, ecFile, 9, decommisionNodes, numDNs));
>   // Ensure decommissioned datanode is not automatically shutdown
>   DFSClient client = getDfsClient(cluster.getNameNode(0), conf);
>   assertEquals("All datanodes must be alive", numDNs,
>   client.datanodeReport(DatanodeReportType.LIVE).length);
>   FileChecksum fileChecksum2 = dfs.getFileChecksum(ecFile, writeBytes);
>   Assert.assertTrue("Checksum mismatches!",
>   fileChecksum1.equals(fileChecksum2));
>   StripedFileTestUtil.checkData(dfs, ecFile, writeBytes, decommisionNodes,
>   null, blockGroupSize);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are rebuilt.

2019-08-25 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: (was: HDFS-14768.000.patch)

> In some cases, erasure blocks are corruption  when they are rebuilt.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: HDFS-14768.000.patch
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   // 
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   for (int i = 0; i < 100; i++) {
> datanodeDescriptor.incrementPendingReplicationWithoutTargets();
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   //assertNull(checkFile(dfs, ecFile, 9, decommisionNodes, numDNs));
>   // Ensure decommissioned datanode is not automatically shutdown
>   DFSClient client = getDfsClient(cluster.getNameNode(0), conf);
>   assertEquals("All datanodes must be alive", numDNs,
>   client.datanodeReport(DatanodeReportType.LIVE).length);
>   FileChecksum fileChecksum2 = dfs.getFileChecksum(ecFile, writeBytes);
>   Assert.assertTrue("Checksum mismatches!",
>   fileChecksum1.equals(fileChecksum2));
>   StripedFileTestUtil.checkData(dfs, ecFile, writeBytes, decommisionNodes,
>   null, blockGroupSize);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are rebuilt.

2019-08-25 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
  Attachment: HDFS-14768.000.patch
   Fix Version/s: 3.3.0
Target Version/s: 3.3.0
  Labels: patch  (was: )
  Status: Patch Available  (was: Open)

fixed the block swell and build bug

> In some cases, erasure blocks are corruption  when they are rebuilt.
> 
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Priority: Major
>  Labels: patch
> Fix For: 3.3.0
>
> Attachments: HDFS-14768.000.patch, HDFS-14768.000.patch
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   // 
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   for (int i = 0; i < 100; i++) {
> datanodeDescriptor.incrementPendingReplicationWithoutTargets();
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   //assertNull(checkFile(dfs, ecFile, 9, decommisionNodes, numDNs));
>   // Ensure decommissioned datanode is not automatically shutdown
>   DFSClient client = getDfsClient(cluster.getNameNode(0), conf);
>   assertEquals("All datanodes must be alive", numDNs,
>   client.datanodeReport(DatanodeReportType.LIVE).length);
>   FileChecksum fileChecksum2 = dfs.getFileChecksum(ecFile, writeBytes);
>   Assert.assertTrue("Checksum mismatches!",
>   fileChecksum1.equals(fileChecksum2));
>   StripedFileTestUtil.checkData(dfs, ecFile, writeBytes, decommisionNodes,
>   null, blockGroupSize);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: 

[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915364#comment-16915364
 ] 

Hadoop QA commented on HDFS-14771:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  9m 
47s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
19s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} branch-2 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} branch-2 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} branch-2 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} branch-2 passed with JDK v1.8.0_222 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 34s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 3 new + 535 unchanged - 3 fixed = 538 total (was 538) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 31s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 98m  9s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:da675796017 |
| JIRA Issue | HDFS-14771 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12978529/HDFS-14771.branch-2.001.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 0d8425cc7cb9 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 

[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915355#comment-16915355
 ] 

Hadoop QA commented on HDFS-11246:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 
26s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 23s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 164 unchanged - 2 fixed = 164 total (was 166) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 16s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 92m  5s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
36s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}161m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestEncryptionZonesWithKMS |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy |
|   | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy |
|   | hadoop.hdfs.tools.TestDFSAdmin |
|   | hadoop.hdfs.server.namenode.TestNamenodeRetryCache |
|   | hadoop.hdfs.tools.TestECAdmin |
|   | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.server.balancer.TestBalancerRPCDelay |
|   | hadoop.hdfs.TestMultipleNNPortQOP |
|   | hadoop.hdfs.server.namenode.TestFSEditLogLoader |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-11246 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12978525/HDFS-11246.011.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 177306f0b57a 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Commented] (HDFS-12904) Add DataTransferThrottler to the Datanode transfers

2019-08-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915350#comment-16915350
 ] 

Hadoop QA commented on HDFS-12904:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  3s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 86m 45s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}157m 20s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.snapshot.TestSnapshot |
|   | hadoop.hdfs.server.namenode.TestBlockPlacementPolicyRackFaultTolerant |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-12904 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12978524/HDFS-12904.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 88d2a2940b18 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d2225c8 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27666/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27666/testReport/ |
| Max. process+thread count | 3945 (vs. ulimit of 5500) |
| modules | C: 

[jira] [Resolved] (HDFS-14776) Log more detail for slow RPC

2019-08-25 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang resolved HDFS-14776.
---
Resolution: Abandoned

> Log more detail for slow RPC
> 
>
> Key: HDFS-14776
> URL: https://issues.apache.org/jira/browse/HDFS-14776
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Priority: Major
>
> Current implementation only log process time
> {code:java}
> if ((rpcMetrics.getProcessingSampleCount() > minSampleSize) &&
> (processingTime > threeSigma)) {
>   LOG.warn("Slow RPC : {} took {} {} to process from client {}",
>   methodName, processingTime, RpcMetrics.TIMEUNIT, call);
>   rpcMetrics.incrSlowRpc();
> }
> {code}
> We need to log more details to help us locate the problem (eg. how long it 
> take to request lock, holding lock, or do other things)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14776) Log more detail for slow RPC

2019-08-25 Thread Chen Zhang (Jira)
Chen Zhang created HDFS-14776:
-

 Summary: Log more detail for slow RPC
 Key: HDFS-14776
 URL: https://issues.apache.org/jira/browse/HDFS-14776
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chen Zhang


Current implementation only log process time
{code:java}
if ((rpcMetrics.getProcessingSampleCount() > minSampleSize) &&
(processingTime > threeSigma)) {
  LOG.warn("Slow RPC : {} took {} {} to process from client {}",
  methodName, processingTime, RpcMetrics.TIMEUNIT, call);
  rpcMetrics.incrSlowRpc();
}
{code}
We need to log more details to help us locate the problem (eg. how long it take 
to request lock, holding lock, or do other things)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-08-25 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915284#comment-16915284
 ] 

Chen Zhang commented on HDFS-14775:
---

cc [~xkrogen] and [~linyiqun].

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-08-25 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14775:
--
Description: 
HDFS-13946 improved the log for longest read/write lock held time, it's very 
useful improvement.

In some condition, we need to locate the detailed call information(user, ip, 
path, etc.) for longest lock holder, but the default throttle interval(10s) is 
too long to find the corresponding audit log. I think we should add the 
timestamp for the {{longestWriteLockHeldStackTrace}}

  was:
HDFS-13946 improved the log for longest read/write lock held time, it's very 
useful improvement.

In some condition, we need to locate the detailed call information(user, ip, 
path, etc.) for longest lock holder, but the default throttle interval(10s) is 
too long to find the corresponding audit log. We need to add the timestamp for 
the {{longestWriteLockHeldStackTrace}}


> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-08-25 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang reassigned HDFS-14775:
-

Assignee: Chen Zhang

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. We need to add the timestamp 
> for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-08-25 Thread Chen Zhang (Jira)
Chen Zhang created HDFS-14775:
-

 Summary: Add Timestamp for longest FSN write/read lock held log
 Key: HDFS-14775
 URL: https://issues.apache.org/jira/browse/HDFS-14775
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chen Zhang


HDFS-13946 improved the log for longest read/write lock held time, it's very 
useful improvement.

In some condition, we need to locate the detailed call information(user, ip, 
path, etc.) for longest lock holder, but the default throttle interval(10s) is 
too long to find the corresponding audit log. We need to add the timestamp for 
the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-25 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915281#comment-16915281
 ] 

Lisheng Sun commented on HDFS-14648:


Thank [~zhangchen] for good suggestion. I will upload the a design doc which 
describes this patch in detail later.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915273#comment-16915273
 ] 

He Xiaoqiao commented on HDFS-14771:


Thanks [~linyiqun] for your feedback. [^HDFS-14771.branch-2.001.patch] this 
demo patch is totally same compared with patch merged to trunk and no other 
changes not.
{quote}I prefer to convert this JIRA to the independent JIRA and add a link to 
HDFS-14617 since HDFS-14617 has been done and closed.{quote}
+1, consider this feature is not ready completely, such as OIV support 
mentioned by HDFS-14617 and so on. I think we could need another Über-jira? 
Thanks again.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915262#comment-16915262
 ] 

He Xiaoqiao commented on HDFS-14648:


Thanks [~leosun08] for your contributions. I try to learn 
[^HDFS-14648.004.patch] with some minor comments,
1. Some codestyles reported by Jenkins above, Please take a look.
2. It is better to locate same module configuration together in 
hdfs-default.xml to find conveniently relevant items,  just suggest to define 
`dfs.client.deadnode.detect.enabled` together with other `dfs.client.*` items.
3. Some method name is open to different interpretations. for instance, 
`DeadNodeDetector#removeNodeFromDetect`, my first impression is that remove 
node from `deadNodes` shared by all DFSInputStreams in the same DFSClient, 
however it just mean remove this node from `localNodes` which just visible to 
some one DFSInputStream actually. right?
4. about DeadNodeDetector, I do not get why set different STATEs for 
`DeadNodeDetector`, will be used by the following implements?
5. about DeadNodeDetector#run, does it need to catch InterruptedException out 
of while loop and return?
6. Some constant such as '5000'/'1' but not explain why, I think we should 
define this constant at the begin of Class and add some annotation.
7. IIUC, some nodes are detected as dead, it should not be in next pipeline, 
right? but I do not see anywhere to add this deadNodes to `excludeNodes`.
Thanks [~leosun08] again.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-25 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915252#comment-16915252
 ] 

Yiqun Lin commented on HDFS-14771:
--

Thanks for backporting this to branch-2, [~hexiaoqiao]. It's a great 
improvement. I take a deep look for the logic in trunk but have two comments 
before reviewing detailed for branch-2:
 * For the branch-2 patch, any other major change compared with trunk patch?
 * I prefer to convert this JIRA to the independent JIRA and add a link to 
HDFS-14617 since HDFS-14617 has been done and closed.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13220) Change lastCheckpointTime to use fsimage mostRecentCheckpointTime

2019-08-25 Thread hemanthboyina (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemanthboyina updated HDFS-13220:
-
Attachment: HDFS-13220.patch
Status: Patch Available  (was: Open)

> Change lastCheckpointTime to use fsimage mostRecentCheckpointTime
> -
>
> Key: HDFS-13220
> URL: https://issues.apache.org/jira/browse/HDFS-13220
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Nie Gus
>Assignee: hemanthboyina
>Priority: Minor
> Attachments: HDFS-13220.patch
>
>
> we found the our standby nn did not do the checkpoint, and the checkpoint 
> alert keep alert, we use the jmx last checkpoint time and 
> dfs.namenode.checkpoint.period to do the monitor check.
>  
> then check the code and log, found the standby NN are using monotonicNow, not 
> fsimage checkpoint time, so when Standby NN restart or switch to Active, then 
> the
> lastCheckpointTime in doWork will be reset. so there is risk standby nn 
> restart or stand active switch will cause the checkpoint delay. 
>  StandbyCheckpointer.java
> {code:java}
> private void doWork() {
> final long checkPeriod = 1000 * checkpointConf.getCheckPeriod();
> // Reset checkpoint time so that we don't always checkpoint
> // on startup.
> lastCheckpointTime = monotonicNow();
> while (shouldRun) {
> boolean needRollbackCheckpoint = namesystem.isNeedRollbackFsImage();
> if (!needRollbackCheckpoint) {
> try {
> Thread.sleep(checkPeriod);
> } catch (InterruptedException ie) {
> }
> if (!shouldRun) {
> break;
> }
> }
> try {
> // We may have lost our ticket since last checkpoint, log in again, just in 
> case
> if (UserGroupInformation.isSecurityEnabled()) {
> UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
> }
> final long now = monotonicNow();
> final long uncheckpointed = countUncheckpointedTxns();
> final long secsSinceLast = (now - lastCheckpointTime) / 1000;
> boolean needCheckpoint = needRollbackCheckpoint;
> if (needCheckpoint) {
> LOG.info("Triggering a rollback fsimage for rolling upgrade.");
> } else if (uncheckpointed >= checkpointConf.getTxnCount()) {
> LOG.info("Triggering checkpoint because there have been " +
> uncheckpointed + " txns since the last checkpoint, which " +
> "exceeds the configured threshold " +
> checkpointConf.getTxnCount());
> needCheckpoint = true;
> } else if (secsSinceLast >= checkpointConf.getPeriod()) {
> LOG.info("Triggering checkpoint because it has been " +
> secsSinceLast + " seconds since the last checkpoint, which " +
> "exceeds the configured interval " + checkpointConf.getPeriod());
> needCheckpoint = true;
> }
> synchronized (cancelLock) {
> if (now < preventCheckpointsUntil) {
> LOG.info("But skipping this checkpoint since we are about to failover!");
> canceledCount++;
> continue;
> }
> assert canceler == null;
> canceler = new Canceler();
> }
> if (needCheckpoint) {
> doCheckpoint();
> // reset needRollbackCheckpoint to false only when we finish a ckpt
> // for rollback image
> if (needRollbackCheckpoint
> && namesystem.getFSImage().hasRollbackFSImage()) {
> namesystem.setCreatedRollbackImages(true);
> namesystem.setNeedRollbackFsImage(false);
> }
> lastCheckpointTime = now;
> }
> } catch (SaveNamespaceCancelledException ce) {
> LOG.info("Checkpoint was cancelled: " + ce.getMessage());
> canceledCount++;
> } catch (InterruptedException ie) {
> LOG.info("Interrupted during checkpointing", ie);
> // Probably requested shutdown.
> continue;
> } catch (Throwable t) {
> LOG.error("Exception in doCheckpoint", t);
> } finally {
> synchronized (cancelLock) {
> canceler = null;
> }
> }
> }
> }
> }
> {code}
>  
> can we use the fsimage's mostRecentCheckpointTime to do the check.
>  
> thanks,
> Gus



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao reopened HDFS-14497:


> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915230#comment-16915230
 ] 

He Xiaoqiao edited comment on HDFS-14497 at 8/25/19 1:45 PM:
-

Thanks [~jojochuang],
{quote}suggests metaSaveLock should be a final object{quote}
it makes sense to me. Reopen this JIRA and submit  
[^HDFS-14497-addendum.001.patch] try to change metaSaveLock to be a final 
object. Please help to take reviews.


was (Author: hexiaoqiao):
Thanks [~jojochuang],
{quote}suggests metaSaveLock should be a final object{quote}
it makes sense to me.  [^HDFS-14497-addendum.001.patch] try to change 
metaSaveLock to be a final object. Please help to take reviews.

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Attachment: HDFS-14771.branch-2.001.patch
Status: Patch Available  (was: Open)

submit demo patch following HDFS-14617 and pending what Jenkins says.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14758) Decrease lease hard limit

2019-08-25 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915244#comment-16915244
 ] 

hemanthboyina commented on HDFS-14758:
--

yes [~zhangchen] even decreasing hard limit doesn't solve in all the scenarios 
, 
as [~jojochuang] mentioned _there can be network partitions or client may 
simply crash (NN crash doesn't loose state). HDFS-14694 doesn't address all 
failure scenarios._

so i feel better we have both which covers all the failure scenarios 

> Decrease lease hard limit
> -
>
> Key: HDFS-14758
> URL: https://issues.apache.org/jira/browse/HDFS-14758
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: hemanthboyina
>Priority: Minor
>
> The hard limit is currently hard-coded to be 1 hour. This also determines the 
> NN automatic lease recovery interval. Something like 20 min will make more 
> sense.
> After the 5 min soft limit, other clients can recover the lease. If no one 
> else takes the lease away, the original client still can renew the lease 
> within the hard limit. So even after a NN full GC of 8 minutes, leases can be 
> still valid.
> However, there is one risk in reducing the hard limit. E.g. Reduced to 20 
> min. If the NN crashes and the manual failover takes more than 20 minutes, 
> clients will abort.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"

2019-08-25 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915242#comment-16915242
 ] 

hemanthboyina commented on HDFS-14762:
--

[~zsxwing] , then we need to handle the exception with a proper message ? 

whats your expecting point on the issue ?

> "Path(Path/String parent, String child)" will fail when "child" contains ":"
> 
>
> Key: HDFS-14762
> URL: https://issues.apache.org/jira/browse/HDFS-14762
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>Assignee: hemanthboyina
>Priority: Major
>
> When the "child" parameter contains ":", "Path(Path/String parent, String 
> child)" will throw the following exception:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: ...
> {code}
> Not sure if this is a legit bug. But the following places will hit this error 
> when seeing a Path with a file name containing ":":
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"

2019-08-25 Thread hemanthboyina (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemanthboyina updated HDFS-14762:
-
Comment: was deleted

(was: [~zsxwing] , then we need to handle the exception with a proper message ? 

whats your expecting point on this ?)

> "Path(Path/String parent, String child)" will fail when "child" contains ":"
> 
>
> Key: HDFS-14762
> URL: https://issues.apache.org/jira/browse/HDFS-14762
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>Assignee: hemanthboyina
>Priority: Major
>
> When the "child" parameter contains ":", "Path(Path/String parent, String 
> child)" will throw the following exception:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: ...
> {code}
> Not sure if this is a legit bug. But the following places will hit this error 
> when seeing a Path with a file name containing ":":
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"

2019-08-25 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915241#comment-16915241
 ] 

hemanthboyina commented on HDFS-14762:
--

[~zsxwing] , then we need to handle the exception with a proper message ? 

whats your expecting point on this ?

> "Path(Path/String parent, String child)" will fail when "child" contains ":"
> 
>
> Key: HDFS-14762
> URL: https://issues.apache.org/jira/browse/HDFS-14762
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>Assignee: hemanthboyina
>Priority: Major
>
> When the "child" parameter contains ":", "Path(Path/String parent, String 
> child)" will throw the following exception:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: ...
> {code}
> Not sure if this is a legit bug. But the following places will hit this error 
> when seeing a Path with a file name containing ":":
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300837=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300837
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 12:29
Start Date: 25/Aug/19 12:29
Worklog Time Spent: 10m 
  Work Description: adoroszlai commented on pull request #1348: HDDS-2030. 
Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317395504
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/acceptance.sh
 ##
 @@ -16,7 +16,19 @@
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
 cd "$DIR/../../.." || exit 1
 
+REPORT_DIR=${OUTPUT_DIR:-"$DIR/../../../target/acceptance"}
+mkdir -p "$REPORT_DIR"
+
 OZONE_VERSION=$(grep "" "$DIR/../../pom.xml" | sed 
's/<[^>]*>//g'|  sed 's/^[ \t]*//')
-cd "$DIR/../../dist/target/ozone-$OZONE_VERSION/compose" || exit 1
+DIST_DIR="$DIR/../../dist/target/ozone-$OZONE_VERSION"
+
+if [ ! -d "$DIST_DIR" ]; then
+echo "Distribution dir is missing. Doing a full build"
+"$DIR/build.sh"
+fi
+
+cd "$DIST_DIR/compose" || exit 1
 ./test-all.sh
+cp results/* "$REPORT_DIR/"
 
 Review comment:
   ```suggestion
   cp result/* "$REPORT_DIR/"
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300837)
Time Spent: 1h  (was: 50m)

> Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300838=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300838
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 12:29
Start Date: 25/Aug/19 12:29
Worklog Time Spent: 10m 
  Work Description: adoroszlai commented on pull request #1348: HDDS-2030. 
Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317395338
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/unit.sh
 ##
 @@ -13,12 +13,20 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$DIR/../../.." || exit 1
+set -x
+
 export MAVEN_OPTS="-Xmx4096m"
-mvn -fn test -f pom.ozone.xml -pl 
\!:hadoop-ozone-integration-test,\!:hadoop-ozone-filesystem,\!:hadoop-ozone-tools
-module_failed_tests=$(find "." -name 'TEST*.xml' -print0 \
-| xargs -n1 -0 "grep" -l -E " Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300835
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 12:29
Start Date: 25/Aug/19 12:29
Worklog Time Spent: 10m 
  Work Description: adoroszlai commented on pull request #1348: HDDS-2030. 
Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317395567
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/README.md
 ##
 @@ -0,0 +1,27 @@
+
+
+# Ozone checks
+
+This directory contains a collection of easy-to-user helper scripts to execute 
various type of tests on the ozone/hdds codebase.
 
 Review comment:
   ```suggestion
   This directory contains a collection of easy-to-use helper scripts to 
execute various type of tests on the ozone/hdds codebase.
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300835)
Time Spent: 50m  (was: 40m)

> Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300839=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300839
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 12:29
Start Date: 25/Aug/19 12:29
Worklog Time Spent: 10m 
  Work Description: adoroszlai commented on pull request #1348: HDDS-2030. 
Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317395273
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/_mvn_unit_report.sh
 ##
 @@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+## generate summary txt file
+find "." -name 'TEST*.xml' -print0 \
+| xargs -n1 -0 "grep" -l -E " "$SUMMARY_FILE"
+for TEST_RESULT_FILE in $(find "$REPORT_DIR" -name "*.txt" | grep -v output); 
do
+
+FAILURES=$(grep FAILURE "$TEST_RESULT_FILE" | grep "Tests run" | awk 
'{print $18}' | sort | uniq)
+
+for FAILURE in $FAILURES; do
+TEST_RESULT_LOCATION="$(realpath --relative-to="$REPORT_DIR" 
"$TEST_RESULT_FILE")"
+TEST_OUTPUT_LOCATION="${TEST_RESULT_LOCATION//.txt/-output.txt/}"
+printf " * [%s](%s) ([output](%s))" "$FAILURE" "$TEST_RESULT_LOCATION" 
"$TEST_OUTPUT_LOCATION" >> "$SUMMARY_FILE"
 
 Review comment:
   ```suggestion
   printf " * [%s](%s) ([output](%s))" "$FAILURE" 
"$TEST_RESULT_LOCATION" "$TEST_OUTPUT_LOCATION\n" >> "$SUMMARY_FILE"
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300839)
Time Spent: 1h  (was: 50m)

> Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300836=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300836
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 12:29
Start Date: 25/Aug/19 12:29
Worklog Time Spent: 10m 
  Work Description: adoroszlai commented on pull request #1348: HDDS-2030. 
Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317395730
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/acceptance.sh
 ##
 @@ -16,7 +16,19 @@
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
 cd "$DIR/../../.." || exit 1
 
+REPORT_DIR=${OUTPUT_DIR:-"$DIR/../../../target/acceptance"}
+mkdir -p "$REPORT_DIR"
+
 OZONE_VERSION=$(grep "" "$DIR/../../pom.xml" | sed 
's/<[^>]*>//g'|  sed 's/^[ \t]*//')
-cd "$DIR/../../dist/target/ozone-$OZONE_VERSION/compose" || exit 1
+DIST_DIR="$DIR/../../dist/target/ozone-$OZONE_VERSION"
+
+if [ ! -d "$DIST_DIR" ]; then
+echo "Distribution dir is missing. Doing a full build"
+"$DIR/build.sh"
+fi
+
+cd "$DIST_DIR/compose" || exit 1
 ./test-all.sh
+cp results/* "$REPORT_DIR/"
+cp "$REPORT_DIR/log.html" "$REPORT_DIR/summary.html"
 exit $?
 
 Review comment:
   Now this `$?` reflects the result of `cp`, so result of `test-all.sh` should 
be saved and used for `exit`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300836)
Time Spent: 1h  (was: 50m)

> Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915230#comment-16915230
 ] 

He Xiaoqiao commented on HDFS-14497:


Thanks [~jojochuang],
{quote}suggests metaSaveLock should be a final object{quote}
it makes sense to me.  [^HDFS-14497-addendum.001.patch] try to change 
metaSaveLock to be a final object. Please help to take reviews.

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14497:
---
Attachment: HDFS-14497-addendum.001.patch

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled

2019-08-25 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915227#comment-16915227
 ] 

Chen Zhang commented on HDFS-14711:
---

Hi [~ayushtkn], since HDFS-14656 is not updated for over one month, so I 
uploaded a patch here to add NULL check, hope you don't mind.

> RBF: RBFMetrics throws NullPointerException if stateStore disabled
> --
>
> Key: HDFS-14711
> URL: https://issues.apache.org/jira/browse/HDFS-14711
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14711.001.patch, HDFS-14711.002.patch, 
> HDFS-14711.003.patch
>
>
> In current implementation, if \{{stateStore}} initialize fail, only log an 
> error message. Actually RBFMetrics can't work normally at this time.
> {code:java}
> 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet 
> (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of 
> Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception
> javax.management.RuntimeMBeanException: java.lang.NullPointerException
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
> at 
> org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
> at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316)
> at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> 

[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915221#comment-16915221
 ] 

He Xiaoqiao commented on HDFS-14617:


[~csun],[~sodonnell],[~jojochuang] HDFS-14771 is tracking for backport this 
patch to branch-2. I will try to test and coverage most cases about FsImage 
when  branch-2 patch is ready.

> Improve fsimage load time by writing sub-sections to the fsimage index
> --
>
> Key: HDFS-14617
> URL: https://issues.apache.org/jira/browse/HDFS-14617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14617.001.patch, ParallelLoading.svg, 
> SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, 
> flamegraph.serial.svg, inodes.svg
>
>
> Loading an fsimage is basically a single threaded process. The current 
> fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, 
> Snapshot_Diff etc. Then at the end of the file, an index is written that 
> contains the offset and length of each section. The image loader code uses 
> this index to initialize an input stream to read and process each section. It 
> is important that one section is fully loaded before another is started, as 
> the next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the 
> index. That way, a given section would effectively be split into several 
> sections, eg:
> {code:java}
>inode_section offset 10 length 1000
>  inode_sub_section offset 10 length 500
>  inode_sub_section offset 510 length 500
>  
>inode_dir_section offset 1010 length 1000
>  inode_dir_sub_section offset 1010 length 500
>  inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we also 
> have sub-section entries that cover the entire section. Then a processor can 
> either read the full section in serial, or read each sub-section in parallel.
> 2. In the Image Writer code, we should set a target number of sub-sections, 
> and then based on the total inodes in memory, it will create that many 
> sub-sections per major image section. I think the only sections worth doing 
> this for are inode, inode_reference, inode_dir and snapshot_diff. All others 
> tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother 
> with the sub-sections as a serial load only takes a few seconds at that scale.
> 4. The image loading code can then have a switch to enable 'parallel loading' 
> and a 'number of threads' where it uses the sub-sections, or if not enabled 
> falls back to the existing logic to read the entire section in serial.
> Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
> concept of this change working, allowing just inode and inode_dir to be 
> loaded in parallel, but I believe inode_reference and snapshot_diff can be 
> make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1 2 3 4 
> 
> inodes448   290   226   189 
> inode_dir 326   211   170   161 
> Total 927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and the 
> inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half the 
> load time of the two sections. With the patch in HDFS-13694 it would take a 
> further 100 seconds off the run time, going from 927 seconds to 388, which is 
> a significant improvement. Adding more threads beyond 4 has diminishing 
> returns as there are some synchronized points in the loading code to protect 
> the in memory structures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915218#comment-16915218
 ] 

He Xiaoqiao commented on HDFS-11246:


Thanks [~jojochuang], [^HDFS-11246.011.patch] try to correct lock about 
#addCachePool and fix checkstyle. Pending what Jenkins says.

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-11246:
---
Attachment: HDFS-11246.011.patch

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12904) Add DataTransferThrottler to the Datanode transfers

2019-08-25 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915199#comment-16915199
 ] 

Lisheng Sun commented on HDFS-12904:


Uploaded the v003 patch.

> Add DataTransferThrottler to the Datanode transfers
> ---
>
> Key: HDFS-12904
> URL: https://issues.apache.org/jira/browse/HDFS-12904
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Reporter: Íñigo Goiri
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-12904.000.patch, HDFS-12904.001.patch, 
> HDFS-12904.002.patch, HDFS-12904.003.patch
>
>
> The {{DataXceiverServer}} already uses throttling for the balancing. The 
> Datanode should also allow throttling the regular data transfers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12904) Add DataTransferThrottler to the Datanode transfers

2019-08-25 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-12904:
---
Attachment: HDFS-12904.003.patch

> Add DataTransferThrottler to the Datanode transfers
> ---
>
> Key: HDFS-12904
> URL: https://issues.apache.org/jira/browse/HDFS-12904
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Reporter: Íñigo Goiri
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-12904.000.patch, HDFS-12904.001.patch, 
> HDFS-12904.002.patch, HDFS-12904.003.patch
>
>
> The {{DataXceiverServer}} already uses throttling for the balancing. The 
> Datanode should also allow throttling the regular data transfers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300826=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300826
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 07:25
Start Date: 25/Aug/19 07:25
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1348: HDDS-2030. 
Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#issuecomment-524607886
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 145 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 0 | No case conflicting files found. |
   | 0 | shelldocs | 1 | Shelldocs was not available. |
   | 0 | @author | 0 | Skipping @author checks as author.sh has been patched. |
   | -1 | test4tests | 0 | The patch doesn't appear to include any new or 
modified tests.  Please justify why no new tests are needed for this patch. 
Also please list what manual steps were performed to verify this patch. |
   ||| _ trunk Compile Tests _ |
   | +1 | mvninstall | 1223 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 1542 | branch has no errors when building and testing 
our client artifacts. |
   ||| _ Patch Compile Tests _ |
   | +1 | mvninstall | 1218 | the patch passed |
   | +1 | mvnsite | 0 | the patch passed |
   | -1 | shellcheck | 2 | The patch generated 2 new + 0 unchanged - 0 fixed = 
2 total (was 0) |
   | +1 | whitespace | 0 | The patch has no whitespace issues. |
   | +1 | shadedclient | 775 | patch has no errors when building and testing 
our client artifacts. |
   ||| _ Other Tests _ |
   | +1 | unit | 119 | hadoop-hdds in the patch passed. |
   | +1 | unit | 322 | hadoop-ozone in the patch passed. |
   | +1 | asflicense | 50 | The patch does not generate ASF License warnings. |
   | | | 5905 | |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1348/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1348 |
   | Optional Tests | dupname asflicense mvnsite unit shellcheck shelldocs |
   | uname | Linux 2a2de9de6b88 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / d2225c8 |
   | shellcheck | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1348/1/artifact/out/diff-patch-shellcheck.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1348/1/testReport/ |
   | Max. process+thread count | 329 (vs. ulimit of 5500) |
   | modules | C: hadoop-ozone U: hadoop-ozone |
   | Console output | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1348/1/console |
   | versions | git=2.7.4 maven=3.3.9 shellcheck=0.4.6 |
   | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org |
   
   
   This message was automatically generated.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300826)
Time Spent: 40m  (was: 0.5h)

> Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300825=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300825
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 07:25
Start Date: 25/Aug/19 07:25
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1348: 
HDDS-2030. Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317385576
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/unit.sh
 ##
 @@ -13,12 +13,20 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+cd "$DIR/../../.." || exit 1
+set -x
+
 export MAVEN_OPTS="-Xmx4096m"
-mvn -fn test -f pom.ozone.xml -pl 
\!:hadoop-ozone-integration-test,\!:hadoop-ozone-filesystem,\!:hadoop-ozone-tools
-module_failed_tests=$(find "." -name 'TEST*.xml' -print0 \
-| xargs -n1 -0 "grep" -l -E " Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2030) Generate simplifed reports by the dev-support/checks/*.sh scripts

2019-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2030?focusedWorklogId=300824=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300824
 ]

ASF GitHub Bot logged work on HDDS-2030:


Author: ASF GitHub Bot
Created on: 25/Aug/19 07:25
Start Date: 25/Aug/19 07:25
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1348: 
HDDS-2030. Generate simplifed reports by the dev-support/checks/*.sh scripts
URL: https://github.com/apache/hadoop/pull/1348#discussion_r317385575
 
 

 ##
 File path: hadoop-ozone/dev-support/checks/integration.sh
 ##
 @@ -20,10 +20,14 @@ export MAVEN_OPTS="-Xmx4096m"
 mvn -B install -f pom.ozone.xml -DskipTests
 mvn -B -fn test -f pom.ozone.xml -pl 
:hadoop-ozone-integration-test,:hadoop-ozone-filesystem,:hadoop-ozone-tools \
   -Dtest=\!TestMiniChaosOzoneCluster
-module_failed_tests=$(find "." -name 'TEST*.xml' -print0 \
-| xargs -0 -n1 "grep" -l -E " Generate simplifed reports by the dev-support/checks/*.sh scripts
> -
>
> Key: HDDS-2030
> URL: https://issues.apache.org/jira/browse/HDDS-2030
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> hadoop-ozone/dev-support/checks directory contains shell scripts to execute 
> different type of code checks (findbugs, checkstyle, etc.)
> Currently the contract is very simple. Every shell script executes one (and 
> only one) check and the shell response code is set according to the result 
> (non-zero code if failed).
> To have better reporting in the github pr build, it would be great to improve 
> the scripts to generate simple summary files and save the relevant files for 
> archiving.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2006) Autogenerated docker config fails with space in the file name issue.

2019-08-25 Thread Aravindan Vijayan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aravindan Vijayan resolved HDDS-2006.
-
Resolution: Cannot Reproduce

> Autogenerated docker config fails with space in the file name issue.
> 
>
> Key: HDDS-2006
> URL: https://issues.apache.org/jira/browse/HDDS-2006
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Anu Engineer
>Assignee: Aravindan Vijayan
>Priority: Major
>
> If you follow the instructions in the "Local multi-container cluster" and 
> generate docker-config, and later try to use it. The docker-compose up -d 
> command will fail with
>  
> *ERROR: In file ~/testOzoneInstructions/docker-config: environment variable 
> name 'Setting up environment!' may not contains whitespace.*



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org