[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-07-02 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877439#comment-16877439
 ] 

star edited comment on HDFS-12914 at 7/3/19 2:43 AM:
-

Yes, It is broken in branch-3.0 with 'private' indicator for 'processReport'. 
There's no such issue for other branch HDFS-11673, in which 'private' indicator 
is removed.
{quote} Collection processReport(
       final DatanodeStorageInfo storageInfo,
       final BlockListAsLongs report,
       BlockReportContext context) throws IOException {
{quote}
I am not sure whether branch-3.0 should be covered for this issue. 

[~jojochuang], what do you think? 


was (Author: starphin):
Yes, It is broken in branch-3.0 with 'private' indicator for 'processReport'. 
There's no such issue for other banch HDFS-11673, in which 'private' indicator 
is removed.
{quote} Collection processReport(
      final DatanodeStorageInfo storageInfo,
      final BlockListAsLongs report,
      BlockReportContext context) throws IOException {{quote}
I am not sure whether branch-3.0 should be covered for this issue. 

[~jojochuang], what do you think? 

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, 
> HDFS-12914.006.patch, HDFS-12914.007.patch, HDFS-12914.008.patch, 
> HDFS-12914.009.patch, HDFS-12914.branch-2.patch, 
> HDFS-12914.branch-3.1.001.patch, HDFS-12914.branch-3.1.002.patch, 
> HDFS-12914.branch-3.2.patch, HDFS-12914.utfix.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-06-23 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870714#comment-16870714
 ] 

Wei-Chiu Chuang edited comment on HDFS-12914 at 6/24/19 12:36 AM:
--

Looks like HDFS-12487 breaks the TestDiskBalancer test. Filed HDFS-14599 for 
that.


was (Author: jojochuang):
Looks like HDFS-12487 breaks the getBlockToCopy test. Filed HDFS-14599 for that.

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, 
> HDFS-12914.006.patch, HDFS-12914.007.patch, HDFS-12914.008.patch, 
> HDFS-12914.009.patch, HDFS-12914.branch-3.1.001.patch, 
> HDFS-12914.branch-3.1.002.patch, HDFS-12914.branch-3.2.patch, 
> HDFS-12914.utfix.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-06-23 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870714#comment-16870714
 ] 

Wei-Chiu Chuang edited comment on HDFS-12914 at 6/24/19 12:36 AM:
--

Looks like HDFS-12487 breaks the getBlockToCopy test. Filed HDFS-14599 for that.


was (Author: jojochuang):
Looks like HDFS-12487 breaks the getBlockToCopy test.

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, 
> HDFS-12914.006.patch, HDFS-12914.007.patch, HDFS-12914.008.patch, 
> HDFS-12914.009.patch, HDFS-12914.branch-3.1.001.patch, 
> HDFS-12914.branch-3.1.002.patch, HDFS-12914.branch-3.2.patch, 
> HDFS-12914.utfix.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-06-08 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859250#comment-16859250
 ] 

star edited comment on HDFS-12914 at 6/8/19 4:10 PM:
-

Few comments about your unit tests. 
 # Following codes bypass lease expiration checking logic by removing valid 
lease id. Better to keep it as it is in running time.

{code:java}
// Remove full block report lease about dn
  spyBlockManager.getBlockReportLeaseManager()
  .removeLease(datanodeDescriptor);{code}
      2. Do we really need to response with a RegisterCommand.REGISTER command 
to client? It's a somewhat heavy command. Should we just let the client know 
its failure on block report(such as a IOException response as my codes) and 
just try few more times. 


was (Author: starphin):
Few comments about your unit tests. 
 # Following codes bypass lease expiration checking logic by removing valid 
lease id. Better to keep it as it is in running time.

{code:java}
// Remove full block report lease about dn
  spyBlockManager.getBlockReportLeaseManager()
  .removeLease(datanodeDescriptor);{code}
      2. Do we really need to response with a RegisterCommand.REGISTER command 
to client? It's a somewhat heavy command. Should we just let the client know 
its failure on block report and just try few more times. 

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, 
> HDFS-12914.006.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-06-08 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859250#comment-16859250
 ] 

star edited comment on HDFS-12914 at 6/8/19 4:09 PM:
-

Few comments about your unit tests. 
 # Following codes bypass lease expiration checking logic by removing valid 
lease id. Better to keep it as it is in running time.

{code:java}
// Remove full block report lease about dn
  spyBlockManager.getBlockReportLeaseManager()
  .removeLease(datanodeDescriptor);{code}
      2. Do we really need to response with a RegisterCommand.REGISTER command 
to client? It's a somewhat heavy command. Should we just let the client know 
its failure on block report and just try few more times. 


was (Author: starphin):
Few comments about your unit tests. 

Following codes bypass lease expiration checking logic by removing valid lease 
id. Better to keep it as it is in running time.
{code:java}
// Remove full block report lease about dn
  spyBlockManager.getBlockReportLeaseManager()
  .removeLease(datanodeDescriptor);
{code}

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, 
> HDFS-12914.006.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-06-08 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859246#comment-16859246
 ] 

star edited comment on HDFS-12914 at 6/8/19 3:56 PM:
-

[~hexiaoqiao], I also write a unit test for this issue, mostly similar to 
yours. Pasted here just for ref.

Other than the test code, a piece of code changed. BlockManager#processReport 
will throw IOException to indicate an invalid lease id. Client will get the 
exception.
{code:java}
if (context != null) {
  if (!blockReportLeaseManager.checkLease(node, startTime,
context.getLeaseId())) {
throw new IOException("Invalid block report lease id 
'"+context.getLeaseId()+"'");
  }
}{code}
{code:java}
//Before test start
conf.setLong(DFSConfigKeys.DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS, 
500L);


@Test
public void testDelayedBlockReport() throws IOException{
  FSNamesystem namesystem = cluster.getNameNode(0).getNamesystem();

  BlockManager testBlockManager = Mockito.spy(namesystem.getBlockManager());

  Mockito.doAnswer(new Answer() {
@Override
public Boolean answer(InvocationOnMock invocationOnMock) throws Throwable {
  //sleep 1000 ms to delay processing of current report
  Thread.sleep(1000);
  return (Boolean)invocationOnMock.callRealMethod();
}
  }).when(testBlockManager).processReport(
  Mockito.any(DatanodeID.class), Mockito.any(DatanodeStorage.class),
  Mockito.any(BlockListAsLongs.class),
  Mockito.any(BlockReportContext.class));
  namesystem.setBlockManagerForTesting(testBlockManager);

  String bpid = namesystem.getBlockPoolId();
  DataNode dn = cluster.getDataNodes().get(0);
  DatanodeRegistration dnReg = dn.getDNRegistrationForBP(bpid);

  namesystem.readLock();
  long leaseId = testBlockManager.requestBlockReportLeaseId(dnReg);
  namesystem.readUnlock();

  Map report = cluster.getBlockReport(bpid, 
0);

  List reportList = new ArrayList<>();
  for(Map.Entry en : report.entrySet()){
reportList.add(new StorageBlockReport(en.getKey(), en.getValue()));
  }

  //it will throw IOException if lease id is invalid
  cluster.getNameNode().getRpcServer().blockReport(
  dnReg, bpid, reportList.toArray(new StorageBlockReport[]{}),
  new BlockReportContext(1, 0, System.nanoTime(), leaseId, true));
}
{code}


was (Author: starphin):
[~hexiaoqiao], I also write a unit test for this issue, mostly similar to 
yours. Pasted here just for ref.

Other than the test code, a piece of code changed. BlockManager#processReport 
will throw IOException to indicate an invalid lease id. Client will get the 
exception.
{code:java}
if (context != null) {
  if (!blockReportLeaseManager.checkLease(node, startTime,
context.getLeaseId())) {
throw new IOException("Invalid block report lease id 
'"+context.getLeaseId()+"'");
  }
}{code}
{code:java}
@Test
public void testDelayedBlockReport() throws IOException{
  FSNamesystem namesystem = cluster.getNameNode(0).getNamesystem();

  BlockManager testBlockManager = Mockito.spy(namesystem.getBlockManager());

  Mockito.doAnswer(new Answer() {
@Override
public Boolean answer(InvocationOnMock invocationOnMock) throws Throwable {
  //sleep 1000 ms to delay processing of current report
  Thread.sleep(1000);
  return (Boolean)invocationOnMock.callRealMethod();
}
  }).when(testBlockManager).processReport(
  Mockito.any(DatanodeID.class), Mockito.any(DatanodeStorage.class),
  Mockito.any(BlockListAsLongs.class),
  Mockito.any(BlockReportContext.class));
  namesystem.setBlockManagerForTesting(testBlockManager);

  String bpid = namesystem.getBlockPoolId();
  DataNode dn = cluster.getDataNodes().get(0);
  DatanodeRegistration dnReg = dn.getDNRegistrationForBP(bpid);

  namesystem.readLock();
  long leaseId = testBlockManager.requestBlockReportLeaseId(dnReg);
  namesystem.readUnlock();

  Map report = cluster.getBlockReport(bpid, 
0);

  List reportList = new ArrayList<>();
  for(Map.Entry en : report.entrySet()){
reportList.add(new StorageBlockReport(en.getKey(), en.getValue()));
  }

  //it will throw IOException if lease id is invalid
  cluster.getNameNode().getRpcServer().blockReport(
  dnReg, bpid, reportList.toArray(new StorageBlockReport[]{}),
  new BlockReportContext(1, 0, System.nanoTime(), leaseId, true));
}
{code}

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, 

[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-22 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845764#comment-16845764
 ] 

He Xiaoqiao edited comment on HDFS-12914 at 5/22/19 10:48 AM:
--

[~smarella], some minor comments about  [^HDFS-12914-trunk.01.patch] ,
a. we need to check if #context is null when check lease;
b. maybe we should catch #UnregisteredNodeException and return 
{{RegisterCommand.REGISTER}} also;
c. {{datanodeManager.getDatanode(nodeId)}} is possible to return null, so we 
should check {{null}} before pass as one parameter of 
BlockReportLeaseManager#checkLease;
d. it is better to add some unit test as [~jojochuang] and [~starphin] 
mentioned above.


was (Author: hexiaoqiao):
[~smarella], some minor comments about  [^HDFS-12914-trunk.01.patch] ,
a. we need to check if #context is null when check lease;
b. maybe we should catch #UnregisteredNodeException and return 
{{RegisterCommand.REGISTER}} also;
c. {{datanodeManager.getDatanode(nodeId)}} is possible to return null, so we 
should check {{null}} before pass as one parameter of 
BlockReportLeaseManager#checkLease;
d. it is better to add some unit test as [~starphin] mentioned above.

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-21 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845414#comment-16845414
 ] 

star edited comment on HDFS-12914 at 5/22/19 2:58 AM:
--

Thanks [~smarella]. A few questioins for reference only.

1. Guess it's a mistake updating version of protobuf.

 
{code:java}
2.5.0.t02
{code}
2. Do we need a test case?

3. Should we do more logs about full block lease as INFO level so that we can 
inspect issues easier? [~smarella], [~hexiaoqiao],[~jojochuang].


was (Author: starphin):
Thanks [~smarella]. A few questioins for reference only.

1. Guess it's a mistake updating version of protobuf.

 
{code:java}
2.5.0.t02
{code}
2. Do we need a test case? 

3. Should we do more logs about full block lease so that we can inspect issues 
easier? [~smarella], [~hexiaoqiao],[~jojochuang].

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-21 Thread Santosh Marella (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845374#comment-16845374
 ] 

Santosh Marella edited comment on HDFS-12914 at 5/22/19 12:31 AM:
--

Attaching the patches for {{branch-2}} and {{trunk}}. [~jojochuang], 
[~hexiaoqiao] - can you please review and let me know your feedback? Thanks.


was (Author: smarella):
Attaching the patches for {{branch-2}} and {{trunk}}. 

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 2.9.2
>Reporter: Daryn Sharp
>Assignee: Santosh Marella
>Priority: Critical
> Attachments: HDFS-12914-branch-2.001.patch, HDFS-12914-trunk.00.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread Santosh Marella (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844178#comment-16844178
 ] 

Santosh Marella edited comment on HDFS-12914 at 5/20/19 6:14 PM:
-

{quote} Santosh Marella how many DNs do you have?  According to the limited 
logs, I think it is caused by following case. A high cpu load of SNN delayed 
the processing of full block report.{quote}

[~starphin] - DNs are in the order of hundreds. You are right that a high cpu 
load on SNN has delayed processing a FBR from a DN that was issued a lease. The 
SNN started processing the reports, but the lease expired after it processed 3 
out of 12 reports.


was (Author: smarella):
{quote} Santosh Marella how many DNs do you have?  According to the limited 
logs, I think it is caused by following case. A high cpu load of SNN delayed 
the processing of full block report.{quote}

DNs are in the order of hundreds. You are right that a high cpu load on SNN has 
delayed processing a FBR from a DN that was issued a lease. The SNN started 
processing the reports, but the lease expired after it processed 3 out of 12 
reports.

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844056#comment-16844056
 ] 

star edited comment on HDFS-12914 at 5/20/19 3:47 PM:
--

[~hexiaoqiao] proposed a good solution to solve this issue.

Except for rpc call queue, there's also a queue for processing block report. 
Maybe we should consider this and check lease id before putting it in the block 
report queue. I think this mainly causes delaying of block report process.


was (Author: starphin):
[~hexiaoqiao] proposed a good solution to solve this issue.

Except for rpc call queue, there's also a queue for processing block report. 
Maybe we should consider this and check lease id before putting it in the block 
report queue.

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843936#comment-16843936
 ] 

He Xiaoqiao edited comment on HDFS-12914 at 5/20/19 12:58 PM:
--

[~smarella] Thanks for your report, I think you offer complete information and 
the reason is clear. 
As we all know, block report processing is very heavy, and process time may be 
longer than other RPCs, especially blocks number is very large located one 
DataNode and not the first block report just during NameNode startup.
To your report issue,
a. t1 request full block report lease through heart beat from NameNode.
b. t2 lease return to DataNode.
c. t3 send FBR from DataNode.
d. t4 FBR enter NameNode call queue.
e. t5 NameNode begin to process FBR one by one #StorageBlockReport, and finish 
to process first 3 #StorageBlockReport successfully.
f. t6 NameNode process the fourth #StorageBlockReport and find lease has 
expired and log `the lease has expired` then remove this lease;
g. t7 finish to process the remain 8 #StorageBlockReport and no lease found 
then log `the DN is not in the pending set`;
which t5 - t1 < 5min and t6 - t1 > 5min. 

I think during that times, load of NameNode is very high, and CallQueue of 
service rpc port (if not config, it is rpc port) is continued full for long 
times (maybe it is long than 5min)
As mentioned above, the root cause is that we check lease for every 
#StorageBlockReport of one DataNode. So I think the solution is also clear, 
just check lease once for each DataNode rather than every  #StorageBlockReport 
of DataNode.
I would like to follow this issue and submit patch later.


was (Author: hexiaoqiao):
[~smarella] Thanks for your report, I think you offer complete information and 
the reason is clear. 
As we all know, block report processing is very heavy request, and process time 
may be longer than other RPCs, especially blocks number is very large located 
one DataNode and not the first block report just after NameNode startup.
To your report issue,
a. t1 request full block report lease through heart beat from NameNode.
b. t2 lease return to DataNode.
c. t3 send FBR from DataNode.
d. t4 FBR enter NameNode call queue.
e. t5 NameNode begin to process FBR one by one #StorageBlockReport, and finish 
to process first 3 #StorageBlockReport successfully.
f. t6 NameNode process the fourth #StorageBlockReport and find lease has 
expired and log `the lease has expired` then remove this lease;
g. t7 finish to process the remain 8 #StorageBlockReport and lease also has 
expired and log `the DN is not in the pending set`;
which t5 - t1 < 5min and t6 - t1 > 5min. 

I think during that times, load of NameNode is very high, and CallQueue of 
service rpc port (if not config, it is rpc port) is continued full for long 
times (maybe it is long than 5min)
As mentioned above, the root cause is that we check lease for every 
#StorageBlockReport of one DataNode. So I think the solution is also clear, 
just check lease once for each DataNode rather than every  #StorageBlockReport 
of DataNode.
I would like to follow this issue and submit patch later.

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843829#comment-16843829
 ] 

star edited comment on HDFS-12914 at 5/20/19 10:08 AM:
---

[~smarella] how many DNs do you have?  According to the limited logs, I think 
it is caused by following case. A high cpu load of SNN delayed the processing 
of full block report.

 
||DN1...||DN2||
|register|register|
|request Lease| |
|process Report| |
|...|request Lease|
|process Report|{color:#707070}_more than 5 minutes_{color}|
|...|{color:#d04437}process Report (failed){color}|

There's no logs between 2019-05-16 15:15:35 and 2019-05-16 15:31:11. Logs 
unrelated to 10.54.63.120:50010 are filtered out, right [~smarella]?

In that time, I think the SNN is processing blockreports from other DN. Untill 
2019-05-16 15:31:11, SNN began to process block reports from that DN. It is 6 
minutes after when full block lease id is requested, beyond default expire 
value 5 minutes (DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS_DEFAULT). 

Don't known when a full block lease id is got from server, for there's no info 
log about it. I guess it's about 5 minutes before the first failed report, say 
15:26:29. 

 


was (Author: starphin):
[~smarella] how many DNs do you have?  According to the limited logs, I think 
it is caused by following case. A high cpu load of SNN delayed the processing 
of full block report.

 
||DN1...||DN2||
|register|register|
|request Lease| |
|process Report| |
|...|request Lease|
|process Report|{color:#707070}_more than 5 minutes_{color}|
|...|process Report|

 

There's no logs between 2019-05-16 15:15:35 and 2019-05-16 15:31:11. Logs 
unrelated to 10.54.63.120:50010 are filtered out, right [~smarella]?

In that time, I think the SNN is processing blockreports from other DN. Untill 
2019-05-16 15:31:11, SNN began to process block reports from that DN. It is 6 
minutes after when full block lease id is requested, beyond default expire 
value 5 minutes (DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS_DEFAULT). 

Don't known when a full block lease id is got from server, for there's no info 
log about it. I guess it's about 5 minutes before the first failed report, say 
15:26:29. 

 

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843829#comment-16843829
 ] 

star edited comment on HDFS-12914 at 5/20/19 10:07 AM:
---

[~smarella] how many DNs do you have?  According to the limited logs, I think 
it is caused by following case. A high cpu load of SNN delayed the processing 
of full block report.

 
||DN1...||DN2||
|register|register|
|request Lease| |
|process Request| |
|...|request Lease|
|process Request|{color:#707070}_more than 5 minutes_{color}|
|...|process Request|

 

There's no logs between 2019-05-16 15:15:35 and 2019-05-16 15:31:11. Logs 
unrelated to 10.54.63.120:50010 are filtered out, right [~smarella]?

In that time, I think the SNN is processing blockreports from other DN. Untill 
2019-05-16 15:31:11, SNN began to process block reports from that DN. It is 6 
minutes after when full block lease id is requested, beyond default expire 
value 5 minutes (DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS_DEFAULT). 

Don't known when a full block lease id is got from server, for there's no info 
log about it. I guess it's about 5 minutes before the first failed report, say 
15:26:29. 

 


was (Author: starphin):
[~smarella] how many DNs do you have?  According to the limited logs, I think 
it is caused by following case. A high load delayed the process of full block 
report.

 
||DN1...||DN2||
|register|register|
|request Lease| |
|process Request| |
|...|request Lease|
|process Request|{color:#707070}_more than 5 minutes_{color}|
|...|process Request|

 

There's no logs between 2019-05-16 15:15:35 and 2019-05-16 15:31:11. Logs 
unrelated to 10.54.63.120:50010 are filtered out, right [~smarella]?

In that time, I think the SNN is processing blockreports from other DN. Untill 
2019-05-16 15:31:11, SNN began to process block reports from that DN. It is 6 
minutes after when full block lease id is requested, beyond default expire 
value 5 minutes (DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS_DEFAULT). 

Don't known when a full block lease id is got from server, for there's no info 
log about it. I guess it's about 5 minutes before the first failed report, say 
15:26:29. 

 

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843829#comment-16843829
 ] 

star edited comment on HDFS-12914 at 5/20/19 10:07 AM:
---

[~smarella] how many DNs do you have?  According to the limited logs, I think 
it is caused by following case. A high cpu load of SNN delayed the processing 
of full block report.

 
||DN1...||DN2||
|register|register|
|request Lease| |
|process Report| |
|...|request Lease|
|process Report|{color:#707070}_more than 5 minutes_{color}|
|...|process Report|

 

There's no logs between 2019-05-16 15:15:35 and 2019-05-16 15:31:11. Logs 
unrelated to 10.54.63.120:50010 are filtered out, right [~smarella]?

In that time, I think the SNN is processing blockreports from other DN. Untill 
2019-05-16 15:31:11, SNN began to process block reports from that DN. It is 6 
minutes after when full block lease id is requested, beyond default expire 
value 5 minutes (DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS_DEFAULT). 

Don't known when a full block lease id is got from server, for there's no info 
log about it. I guess it's about 5 minutes before the first failed report, say 
15:26:29. 

 


was (Author: starphin):
[~smarella] how many DNs do you have?  According to the limited logs, I think 
it is caused by following case. A high cpu load of SNN delayed the processing 
of full block report.

 
||DN1...||DN2||
|register|register|
|request Lease| |
|process Request| |
|...|request Lease|
|process Request|{color:#707070}_more than 5 minutes_{color}|
|...|process Request|

 

There's no logs between 2019-05-16 15:15:35 and 2019-05-16 15:31:11. Logs 
unrelated to 10.54.63.120:50010 are filtered out, right [~smarella]?

In that time, I think the SNN is processing blockreports from other DN. Untill 
2019-05-16 15:31:11, SNN began to process block reports from that DN. It is 6 
minutes after when full block lease id is requested, beyond default expire 
value 5 minutes (DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS_DEFAULT). 

Don't known when a full block lease id is got from server, for there's no info 
log about it. I guess it's about 5 minutes before the first failed report, say 
15:26:29. 

 

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread Santosh Marella (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843647#comment-16843647
 ] 

Santosh Marella edited comment on HDFS-12914 at 5/20/19 9:48 AM:
-

[~jojochuang] - We might have hit this issue recently and some initial 
investigation seems to lead to this issue described by [~daryn] in the 
description. I'm new to this area, but from what I have learnt so far, it seems 
that HDFS-14314 fixes a related, but slightly different scenario. Please see 
below.

 

We have a DN that has 12 disks. When we restarted a standby NN, the DN 
registers itself, gets a lease for FBR and sends a FBR containing 12 reports, 
one for each disk. However, only 3 of them got processed and the 9 aren't 
processed, as the lease had expired before processing the 4th report. This 
essentially meant the FBR was partially processed, and, in our case, this 
*might* be one of the reasons it's taking too long for the NN to come out of 
safe mode (as the safe block count is taking too long to reach the threshold 
due to partial FBR processing).

 

Some raw notes that I've collected investigating this issue. Sorry for being 
verbose, but hope it helps everyone. We are on 2.9.2 version of Hadoop.

Dug through the logs and observed this for a DN for which only 3 out of 12 
reports were processed. The DN registered itself with the NN, then sent a FBR 
that contained 12 reports (one for each disk). The NN processed 3 of them 
(indicated by *processing first storage report*  and *processing time* entries 
in the log statements). However, for the 9 remaining reports, it prints *lease 
xxx is not valid for DN* messages.  

 
{code:java}
2019-05-16 15:15:35,028 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xx.xx.xx.xx:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005)
 storage 7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:15:35,028 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: 
Registered DN 7493442a-c552-43f4-b6bd-728be292f66d (xx.xx.xx.xx:50010).
2019-05-16 15:31:11,578 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: Processing first storage report for 
DS-3e8d8352-ecc9-45cb-a39b-86f10d8aa386 from datanode 
7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:31:11,941 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: from storage DS-3e8d8352-ecc9-45cb-a39b-86f10d8aa386 node 
DatanodeRegistration(xx.xx.xx.xx:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005),
 blocks: 12690, hasStaleStorage: true, processing time: 363 msecs, 
invalidatedBlocks: 0
2019-05-16 15:31:17,496 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: Processing first storage report for 
DS-600c8ca1-3f99-41fc-a784-f663b928fe21 from datanode 
7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:31:17,851 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: from storage DS-600c8ca1-3f99-41fc-a784-f663b928fe21 node 
DatanodeRegistration(xx.xx.xx.xx:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005),
 blocks: 12670, hasStaleStorage: true, processing time: 355 msecs, 
invalidatedBlocks: 0
2019-05-16 15:31:23,465 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: Processing first storage report for 
DS-3e7dc4c5-ab4e-40d1-8f32-64fe28081f94 from datanode 
7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:31:23,821 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: from storage DS-3e7dc4c5-ab4e-40d1-8f32-64fe28081f94 node 
DatanodeRegistration(xx.xx.xx.xx:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005),
 blocks: 12698, hasStaleStorage: true, processing time: 356 msecs, 
invalidatedBlocks: 0
2019-05-16 15:31:29,419 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: Removing 
expired block report lease 0xfd013f0084d0ed2d for DN 
7493442a-c552-43f4-b6bd-728be292f66d.
2019-05-16 15:31:29,419 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: BR lease 
0xfd013f0084d0ed2d is not valid for DN 7493442a-c552-43f4-b6bd-728be292f66d, 
because the lease has expired.
2019-05-16 15:31:35,891 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: BR lease 
0xfd013f0084d0ed2d is not valid for DN 

[jira] [Comment Edited] (HDFS-12914) Block report leases cause missing blocks until next report

2019-05-20 Thread Santosh Marella (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843647#comment-16843647
 ] 

Santosh Marella edited comment on HDFS-12914 at 5/20/19 9:47 AM:
-

[~jojochuang] - We might have hit this issue recently and some initial 
investigation seems to lead to this issue described by [~daryn] in the 
description. I'm new to this area, but from what I have learnt so far, it seems 
that HDFS-14314 fixes a related, but slightly different scenario. Please see 
below.

 

We have a DN that has 12 disks. When we restarted a standby NN, the DN 
registers itself, gets a lease for FBR and sends a FBR containing 12 reports, 
one for each disk. However, only 3 of them got processed and the 9 aren't 
processed, as the lease had expired before processing the 4th report. This 
essentially meant the FBR was partially processed, and, in our case, this 
*might* be one of the reasons it's taking too long for the NN to come out of 
safe mode (as the safe block count is taking too long to reach the threshold 
due to partial FBR processing).

 

Some raw notes that I've collected investigating this issue. Sorry for being 
verbose, but hope it helps everyone. We are on 2.9.2 version of Hadoop.

Dug through the logs and observed this for a DN for which only 3 out of 12 
reports were processed. The DN registered itself with the NN, then sent a FBR 
that contained 12 reports (one for each disk). The NN processed 3 of them 
(indicated by *processing first storage report*  and *processing time* entries 
in the log statements). However, for the 9 remaining reports, it prints *lease 
xxx is not valid for DN* messages.  

 
{code:java}
2019-05-16 15:15:35,028 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(10.54.63.120:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005)
 storage 7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:15:35,028 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: 
Registered DN 7493442a-c552-43f4-b6bd-728be292f66d (10.54.63.120:50010).
2019-05-16 15:31:11,578 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: Processing first storage report for 
DS-3e8d8352-ecc9-45cb-a39b-86f10d8aa386 from datanode 
7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:31:11,941 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: from storage DS-3e8d8352-ecc9-45cb-a39b-86f10d8aa386 node 
DatanodeRegistration(10.54.63.120:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005),
 blocks: 12690, hasStaleStorage: true, processing time: 363 msecs, 
invalidatedBlocks: 0
2019-05-16 15:31:17,496 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: Processing first storage report for 
DS-600c8ca1-3f99-41fc-a784-f663b928fe21 from datanode 
7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:31:17,851 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: from storage DS-600c8ca1-3f99-41fc-a784-f663b928fe21 node 
DatanodeRegistration(10.54.63.120:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005),
 blocks: 12670, hasStaleStorage: true, processing time: 355 msecs, 
invalidatedBlocks: 0
2019-05-16 15:31:23,465 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: Processing first storage report for 
DS-3e7dc4c5-ab4e-40d1-8f32-64fe28081f94 from datanode 
7493442a-c552-43f4-b6bd-728be292f66d
2019-05-16 15:31:23,821 INFO BlockStateChange: BLOCK* processReport 
0xb4fb52822c9e3f03: from storage DS-3e7dc4c5-ab4e-40d1-8f32-64fe28081f94 node 
DatanodeRegistration(10.54.63.120:50010, 
datanodeUuid=7493442a-c552-43f4-b6bd-728be292f66d, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-f4a0a2ae-9e3d-41d5-b98a-e0e77ed0249b;nsid=682930173;c=1406912757005),
 blocks: 12698, hasStaleStorage: true, processing time: 356 msecs, 
invalidatedBlocks: 0
2019-05-16 15:31:29,419 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: Removing 
expired block report lease 0xfd013f0084d0ed2d for DN 
7493442a-c552-43f4-b6bd-728be292f66d.
2019-05-16 15:31:29,419 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: BR lease 
0xfd013f0084d0ed2d is not valid for DN 7493442a-c552-43f4-b6bd-728be292f66d, 
because the lease has expired.
2019-05-16 15:31:35,891 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager: BR lease 
0xfd013f0084d0ed2d is not valid for DN