[jira] [Commented] (HDFS-17366) NameNode Fine-Grained Locking via Namespace Tree

2024-02-02 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813808#comment-17813808
 ] 

Xing Lin commented on HDFS-17366:
-

Love to see this feature.

> NameNode Fine-Grained Locking via Namespace Tree
> 
>
> Key: HDFS-17366
> URL: https://issues.apache.org/jira/browse/HDFS-17366
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>
> As we all known, the write performance of NameNode is limited by the global 
> lock. We target to enable fine-grained locking based on the Namespace tree to 
> improve the performance of NameNode write operations.
> There are multiple motivations for creating this ticket:
>  * We have implemented this fine-grained locking and gained nearly 7x 
> performance improvements in our prod environment
>  * Other companies made similar improvements based on their internal branch. 
> Internal branches are quite different from the community, so few feedback and 
> discussions in the community.
>  * The topic of fine-grained locking has been discussed for a very long time, 
> but still without any results.
>  
> We implemented this fine-gained locking based on the namespace tree to 
> maximize the number of concurrency for disjoint or independent operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-12 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806170#comment-17806170
 ] 

Xing Lin commented on HDFS-17332:
-

In this jira, we only addressed the above issue for DFSInputStream. We did not 
work on DFSStripedInputStream for erasure coding read path. Erasure coding is 
not used at Linkedin. If you need the same fix for DFSStripedInputStream, 
please enhance DFSStripedInputStream#fetchBlockByteRange().

> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 
>  
> Example stack trace
> {code:java}
> [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: 
> Failed to connect to 10.150.91.13/10.150.91.13:71 for file 
> //part--95b9909c-zzz-c000.avro for block 
> BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException:
>  6 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/ip:40492 
> remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/localIp:40492 
> remote=datanodeIP:71] [12]: at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
> [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
> [12]: at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
>  [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
>  [12]: at 
> org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17332:

Description: 
In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
cases, the read request will be served successfully by reading from the next 
available DN. The existence of exception stacktrace in the log has caused 
multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal 
error for their jobs.  We would like to improve the log message and avoid 
sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace 
when reading reach DN is sent to the log only when we really need to fail a 
read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 

 

Example stack trace
{code:java}
[12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed 
to connect to 10.150.91.13/10.150.91.13:71 for file 
//part--95b9909c-zzz-c000.avro for block 
BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException:
 6 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/ip:40492 remote=datanodeIP:71] 
[12]:java.net.SocketTimeoutException: 6 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/localIp:40492 
remote=datanodeIP:71] [12]: at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
[12]: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
[12]: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
[12]: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
[12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
[12]: at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
 [12]: at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
[12]: at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
 [12]: at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
 [12]: at 
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
[12]: at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
[12]: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
[12]: at 
hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
 [12]: at 
com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
 [12]: at 
hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
 [12]: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
[12]: at 
org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
 [12]: at 
org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) [12]: 
at org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
[12]: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}

  was:
In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
cases, the read request will be served successfully by reading from the next 
available DN. The existence of exception stacktrace in the log has caused 
multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal 
error for their jobs.  We would like to improve the log message and avoid 
sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace 
when reading reach DN is sent to the log only when we really need to fail a 
read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 

 

Example stack trace
{code:java}
[12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed 
to connect to 10.150.91.13/10.150.91.13:71 for file 
/jobs/kgemb/holistic/dev/ywang11/pcv2/runs/2850541/artifacts/jobAction-train-importer/featurized_dataset/part-109247-95b9909c-b6ab-41aa-bb87-7e76f4aad35f-c000.avro
 for block 

[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17332:

Description: 
In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
cases, the read request will be served successfully by reading from the next 
available DN. The existence of exception stacktrace in the log has caused 
multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal 
error for their jobs.  We would like to improve the log message and avoid 
sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace 
when reading reach DN is sent to the log only when we really need to fail a 
read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 

 

Example stack trace
{code:java}
[12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed 
to connect to 10.150.91.13/10.150.91.13:71 for file 
/jobs/kgemb/holistic/dev/ywang11/pcv2/runs/2850541/artifacts/jobAction-train-importer/featurized_dataset/part-109247-95b9909c-b6ab-41aa-bb87-7e76f4aad35f-c000.avro
 for block 
BP-364971551-10.150.4.19-1448516588954:blk_130854761734_129864739321:java.net.SocketTimeoutException:
 6 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 
remote=10.150.91.13/10.150.91.13:71] 
[12]:java.net.SocketTimeoutException: 6 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 
remote=10.150.91.13/10.150.91.13:71] [12]: at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
[12]: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
[12]: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
[12]: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
[12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
[12]: at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
 [12]: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
 [12]: at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
[12]: at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
 [12]: at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
 [12]: at 
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
[12]: at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
[12]: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
[12]: at 
hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
 [12]: at 
com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
 [12]: at 
hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
 [12]: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
[12]: at 
org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
 [12]: at 
org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) [12]: 
at org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
[12]: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}

  was:In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
cases, the read request will be served successfully by reading from the next 
available DN. The existence of exception stacktrace in the log has caused 
multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal 
error for their jobs.  We would like to improve the log message and avoid 
sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace 
when reading reach DN is sent to the log only when we really need to fail a 
read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 


> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> 

[jira] [Assigned] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-17332:
---

Assignee: Xing Lin

> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17332:

Environment: (was: In DFSInputStream#actualGetFromOneDataNode(), it 
would send the exception stacktrace to the dfsClient.LOG whenever we fail on a 
DN. However, in most cases, the read request will be served successfully by 
reading from the next available DN. The existence of exception stacktrace in 
the log has caused multiple hadoop users at Linkedin to consider this WARN 
message as the RC/fatal error for their jobs.  We would like to improve the log 
message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. 
The stackTrace when reading reach DN is sent to the log only when we really 
need to fail a read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). )

> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Minor
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-09 Thread Xing Lin (Jira)
Xing Lin created HDFS-17332:
---

 Summary: DFSInputStream: avoid logging stacktrace until when we 
really need to fail a read request with a MissingBlockException
 Key: HDFS-17332
 URL: https://issues.apache.org/jira/browse/HDFS-17332
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
 Environment: In DFSInputStream#actualGetFromOneDataNode(), it would 
send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. 
However, in most cases, the read request will be served successfully by reading 
from the next available DN. The existence of exception stacktrace in the log 
has caused multiple hadoop users at Linkedin to consider this WARN message as 
the RC/fatal error for their jobs.  We would like to improve the log message 
and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
stackTrace when reading reach DN is sent to the log only when we really need to 
fail a read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 
Reporter: Xing Lin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17332:

Description: In DFSInputStream#actualGetFromOneDataNode(), it would send 
the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. 
However, in most cases, the read request will be served successfully by reading 
from the next available DN. The existence of exception stacktrace in the log 
has caused multiple hadoop users at Linkedin to consider this WARN message as 
the RC/fatal error for their jobs.  We would like to improve the log message 
and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
stackTrace when reading reach DN is sent to the log only when we really need to 
fail a read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 

> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: In DFSInputStream#actualGetFromOneDataNode(), it would 
> send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. 
> However, in most cases, the read request will be served successfully by 
> reading from the next available DN. The existence of exception stacktrace in 
> the log has caused multiple hadoop users at Linkedin to consider this WARN 
> message as the RC/fatal error for their jobs.  We would like to improve the 
> log message and avoid sending the stacktrace to dfsClient.LOG when a read 
> succeeds. The stackTrace when reading reach DN is sent to the log only when 
> we really need to fail a read request (when 
> chooseDataNode()/refetchLocations() throws a BlockMissingException). 
>Reporter: Xing Lin
>Priority: Minor
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-18 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: active.png
observer.png

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
> Attachments: active.png, observer.png
>
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-18 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: (was: active.png)

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-18 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: (was: Observer.png)

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: Screenshot 2023-12-12 at 9.31.58 AM.png

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: (was: Screenshot 2023-12-12 at 9.32.15 AM.png)

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: Observer.png

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
> Attachments: Observer.png, active.png
>
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: active.png

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
> Attachments: Observer.png, active.png
>
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)
Xing Lin created HDFS-17286:
---

 Summary: Add UDP as a transfer protocol for HDFS
 Key: HDFS-17286
 URL: https://issues.apache.org/jira/browse/HDFS-17286
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: Xing Lin


Right now, every connection in HDFS is based on RPC/IPC which is based on TCP. 
Connection is re-used based on ConnectionID, which includes RpcTimeout as part 
of the key to identify a connection. The consequence is if we want to use a 
different rpc timeout between two hosts, this would create different TCP 
connections. 

A use case which motivated us to consider UDP is getHAServiceState() in 
ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
much smaller timeout threshold and move to probe next Namenode. To support 
this, we used an executorService and set a timeout for the task in HDFS-17030. 
This implementation can be improved by using UDP to query HAServiceState. 
getHAServiceState() does not have to be very reliable, as we can always fall 
back to the active.

Another motivation is it seems 5~10% of RPC calls hitting our active/observers 
are GetHAServiceState(). If we can move them off to the UDP server, that can 
hopefully improve RPC latency.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: Screenshot 2023-12-12 at 9.32.15 AM.png

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS

2023-12-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17286:

Attachment: (was: Screenshot 2023-12-12 at 9.31.58 AM.png)

> Add UDP as a transfer protocol for HDFS
> ---
>
> Key: HDFS-17286
> URL: https://issues.apache.org/jira/browse/HDFS-17286
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Priority: Major
>
> Right now, every connection in HDFS is based on RPC/IPC which is based on 
> TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout 
> as part of the key to identify a connection. The consequence is if we want to 
> use a different rpc timeout between two hosts, this would create different 
> TCP connections. 
> A use case which motivated us to consider UDP is getHAServiceState() in 
> ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a 
> much smaller timeout threshold and move to probe next Namenode. To support 
> this, we used an executorService and set a timeout for the task in 
> HDFS-17030. This implementation can be improved by using UDP to query 
> HAServiceState. getHAServiceState() does not have to be very reliable, as we 
> can always fall back to the active.
> Another motivation is it seems 5~10% of RPC calls hitting our 
> active/observers are GetHAServiceState(). If we can move them off to the UDP 
> server, that can hopefully improve RPC latency.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.

2023-12-08 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17281:

Description: 
We have come across a few cases where the hdfs clients are reporting very bad 
latencies, while we don't see similar trends at NN-side. Instead, from NN-side, 
the latency metrics seem normal as usual. I attached a screenshot which we took 
during an internal investigation at LinkedIn. What was happening is a token 
management service was reporting an average latency of 1 sec in fetching 
delegation tokens from our NN but at the NN-side, we did not see anything 
abnormal. The recent OverallRpcProcessingTime metric we added in HDFS-17042 did 
not seem to be sufficient to identify/signal such cases. 

We propose to extend the IPC header in hadoop, to communicate call create time 
at client-side to IPC servers, so that for each rpc call, the server can get 
its round-trip time.

 

*Why is OverallRpcProcessingTime not sufficient?*

OverallRpcProcessingTime captures the time starting from when the reader thread 
reads in the call from the socket to when the response is sent back to the 
client. As a result, it does not capture the time it takes to transmit the call 
from client to the server. Besides, we only have a couple of reader threads to 
monitor a large number of open connections. It is possible that many 
connections become ready to read at the same time. Then, the reader thread 
would need to read each call sequentially, leading to a wait time for many Rpc 
Calls. We have also hit the case where the callQueue becomes full (with a total 
of 25600 requests) and thus reader threads are blocked to add new Calls into 
the callQueue. This would lead to a longer latency for all connections/calls 
which are ready and wait to be read by reader threads. 

Ideally, we want to measure the time between when a socket/call is ready to 
read and when it is actually being read by the reader thread. This would give 
us the wait time that a call is taking to be read. However, after some Google 
search, we failed to find a way to get this. 

  was:
We have come across a few cases where the hdfs clients are reporting very bad 
latencies, we don't see similar trends at NN-side. Instead, from NN-side, the 
latency metrics seem normal as usual. I attached a screenshot which we took 
during an internal investigation at LinkedIn. What was happening is a token 
management service was reporting an average latency of 1 sec in fetching 
delegation tokens from our NN but at the NN-side, we did not see anything 
abnormal. 

 

 

 

In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This 
metric measured the time from when the reader thread reads in a call from the 
socket connection to when the response for the call is sent back to the client. 
This metric is supposed to be a reliable signal for what RPC latency hdfs 
clients are getting (overallProcessingTime + network transfer latency == client 
latency).     


> Added support of reporting RPC round-trip time at NN.
> -
>
> Key: HDFS-17281
> URL: https://issues.apache.org/jira/browse/HDFS-17281
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
> Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png
>
>
> We have come across a few cases where the hdfs clients are reporting very bad 
> latencies, while we don't see similar trends at NN-side. Instead, from 
> NN-side, the latency metrics seem normal as usual. I attached a screenshot 
> which we took during an internal investigation at LinkedIn. What was 
> happening is a token management service was reporting an average latency of 1 
> sec in fetching delegation tokens from our NN but at the NN-side, we did not 
> see anything abnormal. The recent OverallRpcProcessingTime metric we added in 
> HDFS-17042 did not seem to be sufficient to identify/signal such cases. 
> We propose to extend the IPC header in hadoop, to communicate call create 
> time at client-side to IPC servers, so that for each rpc call, the server can 
> get its round-trip time.
>  
> *Why is OverallRpcProcessingTime not sufficient?*
> OverallRpcProcessingTime captures the time starting from when the reader 
> thread reads in the call from the socket to when the response is sent back to 
> the client. As a result, it does not capture the time it takes to transmit 
> the call from client to the server. Besides, we only have a couple of reader 
> threads to monitor a large number of open connections. It is possible that 
> many connections become ready to read at the same time. Then, the reader 
> thread would need to read each call sequentially, leading to a wait time for 
> many Rpc Calls. We have also hit the case 

[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.

2023-12-08 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17281:

Description: 
We have come across a few cases where the hdfs clients are reporting very bad 
latencies, we don't see similar trends at NN-side. Instead, from NN-side, the 
latency metrics seem normal as usual. I attached a screenshot which we took 
during an internal investigation at LinkedIn. What was happening is a token 
management service was reporting an average latency of 1 sec in fetching 
delegation tokens from our NN but at the NN-side, we did not see anything 
abnormal. 

 

 

 

In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This 
metric measured the time from when the reader thread reads in a call from the 
socket connection to when the response for the call is sent back to the client. 
This metric is supposed to be a reliable signal for what RPC latency hdfs 
clients are getting (overallProcessingTime + network transfer latency == client 
latency).     

  was:
We have come across a few cases where the hdfs clients are reporting very bad 
latencies, we don't see similar trends at NN-side. Instead, from NN-side, the 
latency metrics seem normal as usual.

 

 

 

In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This 
metric measured the time from when the reader thread reads in a call from the 
socket connection to when the response for the call is sent back to the client. 
This metric is supposed to be a reliable signal for what RPC latency hdfs 
clients are getting (overallProcessingTime + network transfer latency == client 
latency).     


> Added support of reporting RPC round-trip time at NN.
> -
>
> Key: HDFS-17281
> URL: https://issues.apache.org/jira/browse/HDFS-17281
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
> Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png
>
>
> We have come across a few cases where the hdfs clients are reporting very bad 
> latencies, we don't see similar trends at NN-side. Instead, from NN-side, the 
> latency metrics seem normal as usual. I attached a screenshot which we took 
> during an internal investigation at LinkedIn. What was happening is a token 
> management service was reporting an average latency of 1 sec in fetching 
> delegation tokens from our NN but at the NN-side, we did not see anything 
> abnormal. 
>  
>  
>  
> In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This 
> metric measured the time from when the reader thread reads in a call from the 
> socket connection to when the response for the call is sent back to the 
> client. This metric is supposed to be a reliable signal for what RPC latency 
> hdfs clients are getting (overallProcessingTime + network transfer latency == 
> client latency).     



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.

2023-12-08 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17281:

Attachment: Screenshot 2023-10-28 at 10.26.41 PM.png

> Added support of reporting RPC round-trip time at NN.
> -
>
> Key: HDFS-17281
> URL: https://issues.apache.org/jira/browse/HDFS-17281
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
> Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.

2023-12-08 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17281:

Description: 
We have come across a few cases where the hdfs clients are reporting very bad 
latencies, we don't see similar trends at NN-side. Instead, from NN-side, the 
latency metrics seem normal as usual.

 

 

 

In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This 
metric measured the time from when the reader thread reads in a call from the 
socket connection to when the response for the call is sent back to the client. 
This metric is supposed to be a reliable signal for what RPC latency hdfs 
clients are getting (overallProcessingTime + network transfer latency == client 
latency).     

> Added support of reporting RPC round-trip time at NN.
> -
>
> Key: HDFS-17281
> URL: https://issues.apache.org/jira/browse/HDFS-17281
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
> Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png
>
>
> We have come across a few cases where the hdfs clients are reporting very bad 
> latencies, we don't see similar trends at NN-side. Instead, from NN-side, the 
> latency metrics seem normal as usual.
>  
>  
>  
> In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This 
> metric measured the time from when the reader thread reads in a call from the 
> socket connection to when the response for the call is sent back to the 
> client. This metric is supposed to be a reliable signal for what RPC latency 
> hdfs clients are getting (overallProcessingTime + network transfer latency == 
> client latency).     



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17281) Added support of reporting RPC round-trip time at NN.

2023-12-08 Thread Xing Lin (Jira)
Xing Lin created HDFS-17281:
---

 Summary: Added support of reporting RPC round-trip time at NN.
 Key: HDFS-17281
 URL: https://issues.apache.org/jira/browse/HDFS-17281
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: Xing Lin
Assignee: Xing Lin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17262) Transfer rate metric warning log is too verbose

2023-11-21 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-17262:
---

Assignee: Xing Lin

> Transfer rate metric warning log is too verbose
> ---
>
> Key: HDFS-17262
> URL: https://issues.apache.org/jira/browse/HDFS-17262
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration 
> is millis, and its very possible for a read to take less than a millisecond 
> when considering local TCP connection. We are seeing this spam multiple times 
> per millisecond. There's another report on the PR for HDFS-16917.
> Please downgrade to debug or remove the log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17262) Transfer rate metric warning log is too verbose

2023-11-21 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788577#comment-17788577
 ] 

Xing Lin commented on HDFS-17262:
-

[~rdingankar] is out until early Dec. I pushed out a PR on his behalf. 

> Transfer rate metric warning log is too verbose
> ---
>
> Key: HDFS-17262
> URL: https://issues.apache.org/jira/browse/HDFS-17262
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration 
> is millis, and its very possible for a read to take less than a millisecond 
> when considering local TCP connection. We are seeing this spam multiple times 
> per millisecond. There's another report on the PR for HDFS-16917.
> Please downgrade to debug or remove the log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17231) HA: Safemode should exit when resources are from low to available

2023-10-21 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778222#comment-17778222
 ] 

Xing Lin commented on HDFS-17231:
-

Hi [~kuper], 

Please assign this Jira to yourself since you have worked on it.

> HA: Safemode should exit when resources are from low to available
> -
>
> Key: HDFS-17231
> URL: https://issues.apache.org/jira/browse/HDFS-17231
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 3.3.4, 3.3.6
>Reporter: kuper
>Priority: Major
>  Labels: pull-request-available
> Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
>
>
> The NameNodeResourceMonitor automatically enters safe mode when it detects 
> that the resources are not sufficient. When zkfc detects insufficient 
> resources, it triggers failover. Consider the following scenario:
>  * Initially, nn01 is active and nn02 is standby. Due to insufficient 
> resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the 
> resource issue and puts nn01 into safemode. Subsequently, zkfc triggers 
> failover.
>  * At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
> safemode (OFF) and active.
>  * After a period of time, the resources in nn01's dfs.namenode.name.dir 
> recover, causing a slight instability and triggering failover again.
>  * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
> (OFF) and standby.
>  * However, since nn01 is active but in safemode (ON), hdfs cannot be read 
> from or written to.
> !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png!
> *reproduction*
>  # Increase the dfs.namenode.resource.du.reserved
>  # Increase the ha.health-monitor.check-interval.ms can avoid directly 
> switching to standby and stopping the NameNodeResourceMonitor thread. 
> Instead, it is necessary to wait for the NameNodeResourceMonitor to enter 
> safe mode before switching to standby.
>  # On the nn01 active node, using the dd command to create a file that 
> exceeds the threshold, triggering a low on available disk space condition. 
>  # If the nn01 namenode process is not dead, the situation of nn01 safemode 
> (ON) and standby occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider

2023-07-24 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17118:

Description: 
We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

PR for HDFS-17030 which has been merged into trunk: 
[https://github.com/apache/hadoop/pull/5700]

  was:
We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

 

PR for HDFS-17030 which has been merged into trunk: 
[https://github.com/apache/hadoop/pull/5700]


> Fix minor checkstyle warnings in TestObserverReadProxyProvider
> --
>
> Key: HDFS-17118
> URL: https://issues.apache.org/jira/browse/HDFS-17118
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Trivial
>  Labels: pull-request-available
>
> We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk 
> to branch-3.3. The yetus build was not stable at that time and we did not 
> notice the newly added checkstyle warnings.
> PR for HDFS-17030 which has been merged into trunk: 
> [https://github.com/apache/hadoop/pull/5700]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider

2023-07-24 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17118:

Description: 
We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

 

PR for HDFS-17030 which has been merged into trunk: 
[https://github.com/apache/hadoop/pull/5700]

  was:
We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

 

PR merged into trunk: [https://github.com/apache/hadoop/pull/5700]


> Fix minor checkstyle warnings in TestObserverReadProxyProvider
> --
>
> Key: HDFS-17118
> URL: https://issues.apache.org/jira/browse/HDFS-17118
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Trivial
>  Labels: pull-request-available
>
> We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk 
> to branch-3.3. The yetus build was not stable at that time and we did not 
> notice the newly added checkstyle warnings.
>  
> PR for HDFS-17030 which has been merged into trunk: 
> [https://github.com/apache/hadoop/pull/5700]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider

2023-07-23 Thread Xing Lin (Jira)
Xing Lin created HDFS-17118:
---

 Summary: Fix minor checkstyle warnings in 
TestObserverReadProxyProvider
 Key: HDFS-17118
 URL: https://issues.apache.org/jira/browse/HDFS-17118
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.4.0
Reporter: Xing Lin


We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

 

PR: https://github.com/apache/hadoop/pull/5700



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider

2023-07-23 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17118:

Description: 
We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

 

PR merged into trunk: [https://github.com/apache/hadoop/pull/5700]

  was:
We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to 
branch-3.3. The yetus build was not stable at that time and we did not notice 
the newly added checkstyle warnings.

 

PR: https://github.com/apache/hadoop/pull/5700


> Fix minor checkstyle warnings in TestObserverReadProxyProvider
> --
>
> Key: HDFS-17118
> URL: https://issues.apache.org/jira/browse/HDFS-17118
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Priority: Trivial
>
> We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk 
> to branch-3.3. The yetus build was not stable at that time and we did not 
> notice the newly added checkstyle warnings.
>  
> PR merged into trunk: [https://github.com/apache/hadoop/pull/5700]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider

2023-07-23 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-17118:
---

Assignee: Xing Lin

> Fix minor checkstyle warnings in TestObserverReadProxyProvider
> --
>
> Key: HDFS-17118
> URL: https://issues.apache.org/jira/browse/HDFS-17118
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Trivial
>
> We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk 
> to branch-3.3. The yetus build was not stable at that time and we did not 
> notice the newly added checkstyle warnings.
>  
> PR merged into trunk: [https://github.com/apache/hadoop/pull/5700]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744751#comment-17744751
 ] 

Xing Lin commented on HDFS-17093:
-

FYI, we set dfs.namenode.max.full.block.report.leases = 6, even though we are 
running clusters at 10k DNs per cluster.

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744749#comment-17744749
 ] 

Xing Lin commented on HDFS-17093:
-

{quote}[~xinglin] ,I think you modify some more reasonable, datanode separate 
disk operation should be processed in the final set to perform

blockReportLeaseManager. RemoveLease (node);
return ! node.hasStaleStorages();

This is all at the datanode level
{quote}
 

not sure i understand what you said here.

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-18 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744437#comment-17744437
 ] 

Xing Lin commented on HDFS-17093:
-

Hi [~yuyanlei],

Thanks for sharing! I don't fully understand how your PR is going to help.

Without your PR, when NN receives the second FBR attempt from the same DN, the 
NN won't process these RBF and will remove the release from that DN. So, that 
DN won't be able to send more FBRs.

With your PR, though NN won't remove the release from that DN until it receives 
all 12 reports, NN will NOT process these FBRs, right?
 # DN wants to send 12 reports but only sent 1 report.
 # NN processes 1 report (then _storageInfo.getBlockReportCount() > 0_ will be 
true)
 # DN continues to send 12 reports to NN.
 # NN will simply discard these reports, because 
_storageInfo.getBlockReportCount() > 0_

If the change is something like the following, then the change would make more 
sense to me.
{code:java}
if (namesystem.isInStartupSafeMode()
&& !StorageType.PROVIDED.equals(storageInfo.getStorageType())
&& storageInfo.getBlockReportCount() > 0
+   && totalReportNum == currentReportNum) {
  blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
  + "discarded non-initial block report from {}"
  + " because namenode still in startup phase",
  strBlockReportId, fullBrLeaseId, nodeID);
  blockReportLeaseManager.removeLease(node);
  return !node.hasStaleStorages();
}
{code}

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy

2023-07-03 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17067:

Description: 
In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and also allows setting the prefix for thread 
names.

{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}

A second minor issue is we did not shutdown the executorService in close(). It 
is a minor issue as close() will only be called when the garbage collector 
starts to reclaim an ObserverReadProxyProvider object, not when there is no 
reference to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).



  was:
In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix.

{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}

A second minor issue is we did not shutdown the executorService in close(). It 
is a minor issue as close() will only be called when the garbage collector 
starts to reclaim an ObserverReadProxyProvider object, not when there is no 
reference to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).




> Use BlockingThreadPoolExecutorService for nnProbingThreadPool in 
> ObserverReadProxy
> --
>
> Key: HDFS-17067
> URL: https://issues.apache.org/jira/browse/HDFS-17067
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> In HDFS-17030, we introduced an ExecutorService, to submit 
> getHAServiceState() requests. We constructed the ExecutorService directly 
> from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to 
> true. Then, the core thread will be kept up and running even when the main 
> thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. 
> However, in this PR, we decide to directly use an existing executorService 
> implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It 
> takes care of setting _allowCoreThreadTimeOut_ and also allows setting the 
> prefix for thread names.
> {code:java}
>   private final ExecutorService nnProbingThreadPool =
>   new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
>   new ArrayBlockingQueue(1024));
> {code}
> A second minor issue is we did not shutdown the executorService in close(). 
> It is a minor issue as close() will only be called when the garbage collector 
> starts to reclaim an ObserverReadProxyProvider object, not when there is no 
> reference to the ObserverReadProxyProvider object. The time between when an 
> ObserverReadProxyProvider becomes dereferenced and when the garage collector 
> actually starts to reclaim that object is out of control/under-defined 
> (unless the program is shutdown with an explicit System.exit(1)).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy

2023-07-03 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17067:

Description: 
In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix.

Second minor issue is we did not shutdown the executorService in close(). It is 
a minor issue as close() will only be called when the garbage collector starts 
to reclaim an ObserverReadProxyProvider object, not when there is no reference 
to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).


{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}


  was:
In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and allowing setting the thread prefix.

Second minor issue is we did not shutdown the executorService in close(). It is 
a minor issue as close() will only be called when the garbage collector starts 
to reclaim an ObserverReadProxyProvider object, not when there is no reference 
to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).


{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}



> Use BlockingThreadPoolExecutorService for nnProbingThreadPool in 
> ObserverReadProxy
> --
>
> Key: HDFS-17067
> URL: https://issues.apache.org/jira/browse/HDFS-17067
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> In HDFS-17030, we introduced an ExecutorService, to submit 
> getHAServiceState() requests. We constructed the ExecutorService directly 
> from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to 
> true. Then, the core thread will be kept up and running even when the main 
> thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. 
> However, in this PR, we decide to directly use an existing executorService 
> implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It 
> takes care of setting _allowCoreThreadTimeOut_ and also allows setting the 
> thread prefix.
> Second minor issue is we did not shutdown the executorService in close(). It 
> is a minor issue as close() will only be called when the garbage collector 
> starts to reclaim an ObserverReadProxyProvider object, not when there is no 
> reference to the ObserverReadProxyProvider object. The time between when an 
> ObserverReadProxyProvider becomes dereferenced and when the garage collector 
> actually starts to reclaim that object is out of control/under-defined 
> (unless the program is shutdown with an explicit System.exit(1)).
> {code:java}
>   private final ExecutorService nnProbingThreadPool =
>   new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
>   new ArrayBlockingQueue(1024));
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy

2023-07-03 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17067:

Description: 
In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix.

{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}

A second minor issue is we did not shutdown the executorService in close(). It 
is a minor issue as close() will only be called when the garbage collector 
starts to reclaim an ObserverReadProxyProvider object, not when there is no 
reference to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).



  was:
In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix.

Second minor issue is we did not shutdown the executorService in close(). It is 
a minor issue as close() will only be called when the garbage collector starts 
to reclaim an ObserverReadProxyProvider object, not when there is no reference 
to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).


{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}



> Use BlockingThreadPoolExecutorService for nnProbingThreadPool in 
> ObserverReadProxy
> --
>
> Key: HDFS-17067
> URL: https://issues.apache.org/jira/browse/HDFS-17067
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> In HDFS-17030, we introduced an ExecutorService, to submit 
> getHAServiceState() requests. We constructed the ExecutorService directly 
> from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to 
> true. Then, the core thread will be kept up and running even when the main 
> thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. 
> However, in this PR, we decide to directly use an existing executorService 
> implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It 
> takes care of setting _allowCoreThreadTimeOut_ and also allows setting the 
> thread prefix.
> {code:java}
>   private final ExecutorService nnProbingThreadPool =
>   new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
>   new ArrayBlockingQueue(1024));
> {code}
> A second minor issue is we did not shutdown the executorService in close(). 
> It is a minor issue as close() will only be called when the garbage collector 
> starts to reclaim an ObserverReadProxyProvider object, not when there is no 
> reference to the ObserverReadProxyProvider object. The time between when an 
> ObserverReadProxyProvider becomes dereferenced and when the garage collector 
> actually starts to reclaim that object is out of control/under-defined 
> (unless the program is shutdown with an explicit System.exit(1)).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17067) allowCoreThreadTimeOut should be set to true for nnProbingThreadPool in ObserverReadProxy

2023-07-03 Thread Xing Lin (Jira)
Xing Lin created HDFS-17067:
---

 Summary: allowCoreThreadTimeOut should be set to true for 
nnProbingThreadPool in ObserverReadProxy
 Key: HDFS-17067
 URL: https://issues.apache.org/jira/browse/HDFS-17067
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.4.0
Reporter: Xing Lin
Assignee: Xing Lin


In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() 
requests. We constructed the ExecutorService directly from a basic 
ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the 
core thread will be kept up and running even when the main thread exits. To fix 
it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we 
decide to directly use an existing executorService implementation 
(_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of 
setting _allowCoreThreadTimeOut_ and allowing setting the thread prefix.

Second minor issue is we did not shutdown the executorService in close(). It is 
a minor issue as close() will only be called when the garbage collector starts 
to reclaim an ObserverReadProxyProvider object, not when there is no reference 
to the ObserverReadProxyProvider object. The time between when an 
ObserverReadProxyProvider becomes dereferenced and when the garage collector 
actually starts to reclaim that object is out of control/under-defined (unless 
the program is shutdown with an explicit System.exit(1)).


{code:java}
  private final ExecutorService nnProbingThreadPool =
  new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
  new ArrayBlockingQueue(1024));
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy

2023-07-03 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17067:

Summary: Use BlockingThreadPoolExecutorService for nnProbingThreadPool in 
ObserverReadProxy  (was: allowCoreThreadTimeOut should be set to true for 
nnProbingThreadPool in ObserverReadProxy)

> Use BlockingThreadPoolExecutorService for nnProbingThreadPool in 
> ObserverReadProxy
> --
>
> Key: HDFS-17067
> URL: https://issues.apache.org/jira/browse/HDFS-17067
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> In HDFS-17030, we introduced an ExecutorService, to submit 
> getHAServiceState() requests. We constructed the ExecutorService directly 
> from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to 
> true. Then, the core thread will be kept up and running even when the main 
> thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. 
> However, in this PR, we decide to directly use an existing executorService 
> implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It 
> takes care of setting _allowCoreThreadTimeOut_ and allowing setting the 
> thread prefix.
> Second minor issue is we did not shutdown the executorService in close(). It 
> is a minor issue as close() will only be called when the garbage collector 
> starts to reclaim an ObserverReadProxyProvider object, not when there is no 
> reference to the ObserverReadProxyProvider object. The time between when an 
> ObserverReadProxyProvider becomes dereferenced and when the garage collector 
> actually starts to reclaim that object is out of control/under-defined 
> (unless the program is shutdown with an explicit System.exit(1)).
> {code:java}
>   private final ExecutorService nnProbingThreadPool =
>   new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES,
>   new ArrayBlockingQueue(1024));
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17055) Export HAState as a metric from Namenode for monitoring

2023-06-21 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17055:

Description: 
We'd like measure the uptime for Namenodes: percentage of time when we have the 
active/standby/observer node available (up and running). We could monitor the 
namenode from an external service, such as ZKFC. But that would require the 
external service to be available 100% itself. And when this third-party 
external monitoring service is down, we won't have info on whether our 
Namenodes are still up.

We propose to take a different approach: we will emit Namenode state directly 
from namenode itself. Whenever we miss a data point for this metric, we 
consider the corresponding namenode to be down/not available. In other words, 
we assume the metric collection/monitoring infrastructure to be 100% reliable.

One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
we don't think that is a good place to emit NameNode HAState. HAState is stored 
in NameNode.java and we should directly emit it from NameNode.java. Otherwise, 
we basically duplicate this info in two classes and we would have to keep them 
in sync. Besides, _NameNodeMetrics_ class does not have a reference to the 
_NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a 
_static_ function _initMetrics()_ in {_}NameNode.java{_}.

We shouldn't emit HA state from FSNameSystem.java either, as it is initialized 
from NameNode.java and all state transitions are implemented in NameNode.java.

 

  was:
We'd like measure the uptime for Namenodes: percentage of time when we have the 
active/standby/observer node available (up and running). We could monitor the 
namenode from an external service, such as ZKFC. But that would require the 
external service to be available 100% itself. And when this third-party 
external monitoring service is down, we won't have info on whether our 
Namenodes are still up.

We propose to take a different approach: we will emit Namenode state directly 
from namenode itself. Whenever we miss a data point for this metric, we 
consider the corresponding namenode to be down/not available. In other words, 
we assume the metric collection/monitoring infrastructure to be 100% reliable.

One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
we don't think that is a good place to emit NameNode HAState. HAState is stored 
in NameNode.java and we should directly emit it from NameNode.java. Otherwise, 
we basically duplicate this info in two classes and we would have to keep them 
in sync. Besides, _NameNodeMetrics_ class does not have a reference to the 
_NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a 
_static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA 
state from FSNameSystem.java either, as it is initialized from NameNode.java 
and all state transitions are implemented in NameNode.java.

 


> Export HAState as a metric from Namenode for monitoring
> ---
>
> Key: HDFS-17055
> URL: https://issues.apache.org/jira/browse/HDFS-17055
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0, 3.3.9
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> We'd like measure the uptime for Namenodes: percentage of time when we have 
> the active/standby/observer node available (up and running). We could monitor 
> the namenode from an external service, such as ZKFC. But that would require 
> the external service to be available 100% itself. And when this third-party 
> external monitoring service is down, we won't have info on whether our 
> Namenodes are still up.
> We propose to take a different approach: we will emit Namenode state directly 
> from namenode itself. Whenever we miss a data point for this metric, we 
> consider the corresponding namenode to be down/not available. In other words, 
> we assume the metric collection/monitoring infrastructure to be 100% reliable.
> One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
> which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
> we don't think that is a good place to emit NameNode HAState. HAState is 
> stored in NameNode.java and we should directly emit it from NameNode.java. 
> Otherwise, we basically duplicate this info in two classes and we would have 
> to keep them in sync. Besides, _NameNodeMetrics_ class does not have a 
> reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ 
> is created by a _static_ function _initMetrics()_ in 

[jira] [Updated] (HDFS-17055) Export HAState as a metric from Namenode for monitoring

2023-06-21 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17055:

Description: 
We'd like measure the uptime for Namenodes: percentage of time when we have the 
active/standby/observer node available (up and running). We could monitor the 
namenode from an external service, such as ZKFC. But that would require the 
external service to be available 100% itself. And when this third-party 
external monitoring service is down, we won't have info on whether our 
Namenodes are still up.

We propose to take a different approach: we will emit Namenode state directly 
from namenode itself. Whenever we miss a data point for this metric, we 
consider the corresponding namenode to be down/not available. In other words, 
we assume the metric collection/monitoring infrastructure to be 100% reliable.

One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
we don't think that is a good place to emit NameNode HAState. HAState is stored 
in NameNode.java and we should directly emit it from NameNode.java. Otherwise, 
we basically duplicate this info in two classes and we would have to keep them 
in sync. Besides, _NameNodeMetrics_ class does not have a reference to the 
_NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a 
_static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA 
state from FSNameSystem.java either, as it is initialized from NameNode.java 
and all state transitions are implemented in NameNode.java.

 

  was:
We'd like measure the uptime for Namenodes: percentage of time when we have the 
active/standby/observer node available (up and running). We could monitor the 
namenode from an external service, such as ZKFC. But that would require the 
external service to be available 100% itself. And when this third-party 
external monitoring service is down, we won't have info on whether our 
Namenodes are still up.

We propose to take a different approach: we will emit Namenode state directly 
from namenode itself. Whenever we miss a data point for this metric, we 
consider the corresponding namenode to be down/not available. In other words, 
we assume the metric collection/monitoring infrastructure to be 100% reliable.

One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
which is used to emit all metrics for {_}NameNode.java{_}. However, we don't 
think that is a good place to emit NameNode HAState. HAState is stored in 
NameNode.java and we should directly emit it from NameNode.java. Otherwise, we 
basically duplicate this info in two classes and we would have to keep them in 
sync. Besides, _NameNodeMetrics_ class does not have a reference to the 
_NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a 
_static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA 
state from FSNameSystem.java either, as it is initialized from NameNode.java 
and all state transitions are implemented in NameNode.java.

 


> Export HAState as a metric from Namenode for monitoring
> ---
>
> Key: HDFS-17055
> URL: https://issues.apache.org/jira/browse/HDFS-17055
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0, 3.3.9
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> We'd like measure the uptime for Namenodes: percentage of time when we have 
> the active/standby/observer node available (up and running). We could monitor 
> the namenode from an external service, such as ZKFC. But that would require 
> the external service to be available 100% itself. And when this third-party 
> external monitoring service is down, we won't have info on whether our 
> Namenodes are still up.
> We propose to take a different approach: we will emit Namenode state directly 
> from namenode itself. Whenever we miss a data point for this metric, we 
> consider the corresponding namenode to be down/not available. In other words, 
> we assume the metric collection/monitoring infrastructure to be 100% reliable.
> One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
> which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
> we don't think that is a good place to emit NameNode HAState. HAState is 
> stored in NameNode.java and we should directly emit it from NameNode.java. 
> Otherwise, we basically duplicate this info in two classes and we would have 
> to keep them in sync. Besides, _NameNodeMetrics_ class does not have a 
> reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ 
> is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We 
> 

[jira] [Assigned] (HDFS-17055) Export HAState as a metric from Namenode for monitoring

2023-06-21 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-17055:
---

Assignee: Xing Lin

> Export HAState as a metric from Namenode for monitoring
> ---
>
> Key: HDFS-17055
> URL: https://issues.apache.org/jira/browse/HDFS-17055
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0, 3.3.9
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> We'd like measure the uptime for Namenodes: percentage of time when we have 
> the active/standby/observer node available (up and running). We could monitor 
> the namenode from an external service, such as ZKFC. But that would require 
> the external service to be available 100% itself. And when this third-party 
> external monitoring service is down, we won't have info on whether our 
> Namenodes are still up.
> We propose to take a different approach: we will emit Namenode state directly 
> from namenode itself. Whenever we miss a data point for this metric, we 
> consider the corresponding namenode to be down/not available. In other words, 
> we assume the metric collection/monitoring infrastructure to be 100% reliable.
> One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
> which is used to emit all metrics for {_}NameNode.java{_}. However, we don't 
> think that is a good place to emit NameNode HAState. HAState is stored in 
> NameNode.java and we should directly emit it from NameNode.java. Otherwise, 
> we basically duplicate this info in two classes and we would have to keep 
> them in sync. Besides, _NameNodeMetrics_ class does not have a reference to 
> the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by 
> a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit 
> HA state from FSNameSystem.java either, as it is initialized from 
> NameNode.java and all state transitions are implemented in NameNode.java.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17055) Export HAState as a metric from Namenode for monitoring

2023-06-21 Thread Xing Lin (Jira)
Xing Lin created HDFS-17055:
---

 Summary: Export HAState as a metric from Namenode for monitoring
 Key: HDFS-17055
 URL: https://issues.apache.org/jira/browse/HDFS-17055
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.4.0, 3.3.9
Reporter: Xing Lin


We'd like measure the uptime for Namenodes: percentage of time when we have the 
active/standby/observer node available (up and running). We could monitor the 
namenode from an external service, such as ZKFC. But that would require the 
external service to be available 100% itself. And when this third-party 
external monitoring service is down, we won't have info on whether our 
Namenodes are still up.

We propose to take a different approach: we will emit Namenode state directly 
from namenode itself. Whenever we miss a data point for this metric, we 
consider the corresponding namenode to be down/not available. In other words, 
we assume the metric collection/monitoring infrastructure to be 100% reliable.

One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
which is used to emit all metrics for {_}NameNode.java{_}. However, we don't 
think that is a good place to emit NameNode HAState. HAState is stored in 
NameNode.java and we should directly emit it from NameNode.java. Otherwise, we 
basically duplicate this info in two classes and we would have to keep them in 
sync. Besides, _NameNodeMetrics_ class does not have a reference to the 
_NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a 
_static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA 
state from FSNameSystem.java either, as it is initialized from NameNode.java 
and all state transitions are implemented in NameNode.java.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode

2023-06-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17042:

Description: 
We'd like to add two new types of metrics to the existing NN 
RpcMetrics/RpcDetailedMetrics. These two metrics can then be used as part of 
SLA/SLO for the HDFS service.
 * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
are successfully processed by a NN (e.g., with a response with an RpcStatus 
{_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
({_}which refers the total number of RPC requests{_}){_}, we can derive the 
RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
RpcQueueNumOps. 
 * OverallRpcProcessingTime for each RPC method: this metric measures the 
overall RPC processing time for each RPC method at the NN. It covers the time 
from when a request arrives at the NN to when a response is sent back. We are 
already emitting processingTime for each RPC method today in 
RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
each RPC method, which includes enqueueTime, queueTime, processingTime, 
responseTime, and handlerTime.

 

  was:
We'd like to add two new types of metrics to the existing NN 
RpcMetrics/RpcDetailedMetrics. 
 * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
are successfully processed by a NN (e.g., with a response with an RpcStatus 
{_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
({_}which refers the total number of RPC requests{_}){_}, we can derive the 
RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
RpcQueueNumOps. 
 * OverallRpcProcessingTime for each RPC method: this metric measures the 
overall RPC processing time for each RPC method at the NN. It covers the time 
from when a request arrives at the NN to when a response is sent back. We are 
already emitting processingTime for each RPC method today in 
RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
each RPC method, which includes enqueueTime, queueTime, processingTime, 
responseTime, and handlerTime.

 


> Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
> 
>
> Key: HDFS-17042
> URL: https://issues.apache.org/jira/browse/HDFS-17042
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0, 3.3.9
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> We'd like to add two new types of metrics to the existing NN 
> RpcMetrics/RpcDetailedMetrics. These two metrics can then be used as part of 
> SLA/SLO for the HDFS service.
>  * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
> are successfully processed by a NN (e.g., with a response with an RpcStatus 
> {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
> ({_}which refers the total number of RPC requests{_}){_}, we can derive the 
> RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
> RpcQueueNumOps. 
>  * OverallRpcProcessingTime for each RPC method: this metric measures the 
> overall RPC processing time for each RPC method at the NN. It covers the time 
> from when a request arrives at the NN to when a response is sent back. We are 
> already emitting processingTime for each RPC method today in 
> RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
> each RPC method, which includes enqueueTime, queueTime, processingTime, 
> responseTime, and handlerTime.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode

2023-06-09 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17042:

Description: 
We'd like to add two new types of metrics to the existing NN 
RpcMetrics/RpcDetailedMetrics. 
 * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
are successfully processed by a NN (e.g., with a response with an RpcStatus 
{_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
({_}which refers the total number of RPC requests{_}){_}, we can derive the 
RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
RpcQueueNumOps. 
 * OverallRpcProcessingTime for each RPC method: this metric measures the 
overall RPC processing time for each RPC method at the NN. It covers the time 
from when a request arrives at the NN to when a response is sent back. We are 
already emitting processingTime for each RPC method today in 
RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
each RPC method, which includes enqueueTime, queueTime, processingTime, 
responseTime, and handlerTime.

 

  was:
We'd like to add two new types of metrics to the existing 
RpcMetrics/RpcDetailedMetrics. 
 * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
are successfully processed by a NN (e.g., with a response with an RpcStatus 
{_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
({_}which refers the total number of RPC requests{_}){_}, we can derive the 
RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
RpcQueueNumOps. 
 * OverallRpcProcessingTime for each RPC method: this metric measures the 
overall RPC processing time for each RPC method at the NN. It covers the time 
from when a request arrives at the NN to when a response is sent back. We are 
already emitting processingTime for each RPC method today in 
RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
each RPC method, which includes enqueueTime, queueTime, processingTime, 
responseTime, and handlerTime.

 


> Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
> 
>
> Key: HDFS-17042
> URL: https://issues.apache.org/jira/browse/HDFS-17042
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0, 3.3.9
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> We'd like to add two new types of metrics to the existing NN 
> RpcMetrics/RpcDetailedMetrics. 
>  * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
> are successfully processed by a NN (e.g., with a response with an RpcStatus 
> {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
> ({_}which refers the total number of RPC requests{_}){_}, we can derive the 
> RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
> RpcQueueNumOps. 
>  * OverallRpcProcessingTime for each RPC method: this metric measures the 
> overall RPC processing time for each RPC method at the NN. It covers the time 
> from when a request arrives at the NN to when a response is sent back. We are 
> already emitting processingTime for each RPC method today in 
> RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
> each RPC method, which includes enqueueTime, queueTime, processingTime, 
> responseTime, and handlerTime.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode

2023-06-09 Thread Xing Lin (Jira)
Xing Lin created HDFS-17042:
---

 Summary: Add rpcCallSuccesses and OverallRpcProcessingTime to 
RpcMetrics for Namenode
 Key: HDFS-17042
 URL: https://issues.apache.org/jira/browse/HDFS-17042
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.4.0, 3.3.9
Reporter: Xing Lin
Assignee: Xing Lin


We'd like to add two new types of metrics to the existing 
RpcMetrics/RpcDetailedMetrics. 
 * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
are successfully processed by a NN (e.g., with a response with an RpcStatus 
{_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
({_}which refers the total number of RPC requests{_}){_}, we can derive the 
RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
RpcQueueNumOps. 
 * OverallRpcProcessingTime for each RPC method: this metric measures the 
overall RPC processing time for each RPC method at the NN. It covers the time 
from when a request arrives at the NN to when a response is sent back. We are 
already emitting processingTime for each RPC method today in 
RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
each RPC method, which includes enqueueTime, queueTime, processingTime, 
responseTime, and handlerTime.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-30 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
takes more than 2 mins to complete when we take a heap dump at a standby. This 
has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN (which is a deal-breaker). 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 

  was:
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
takes more than 2 mins to complete when we take a heap dump at a standby. This 
has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN (which is a deal-breaker). 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-30 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
takes more than 2 mins to complete when we take a heap dump at a standby. This 
has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN (which is a deal-breaker). 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to probe the next NN. 

 

  was:
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
takes more than 2 mins to complete when we take a heap dump at a standby. This 
has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN (which is a deal-breaker). 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN (which is a deal-breaker). 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to probe the 
> next NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To 

[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-30 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
takes more than 2 mins to complete when we take a heap dump at a standby. This 
has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 

  was:
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN. 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional 

[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-30 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to create 
socket connection to that standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 

  was:
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to connect that 
standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request would need to take more than 2 mins to complete when we take a heap 
> dump at a standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN. 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional 

[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-29 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to connect that 
standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
still use the original value from the config). However, that would double the 
socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 

  was:
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to connect that 
standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy. However, that would double 
the socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to connect 
> that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request would need to take more than 2 mins to complete when we take a heap 
> dump at a standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN. 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-29 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to connect that 
standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
getHAServiceState requests in ObserverReaderProxy. However, that would double 
the socket connection between clients and the NN. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 

  was:
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to connect that 
standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to connect 
> that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request would need to take more than 2 mins to complete when we take a heap 
> dump at a standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy. However, that would double 
> the socket connection between clients and the NN. 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-29 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-17030:
---

Assignee: Xing Lin

> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to connect 
> that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request would need to take more than 2 mins to complete when we take a heap 
> dump at a standby. This has been causing user job failures. 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-29 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17030:

Description: 
When namenode HA is enabled and a standby NN is not responsible, we have 
observed it would take a long time to serve a request, even though we have a 
healthy observer or active NN. 

Basically, when a standby is down, the RPC client would (re)try to connect that 
standby for _ipc.client.connect.timeout_ _* 
ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
heap dump at a standby, the NN still accepts the socket connection but it won't 
send responses to these RPC requests and we would timeout after 
_ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at 
Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request 
would need to take more than 2 mins to complete when we take a heap dump at a 
standby. This has been causing user job failures. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 

  was:
When HA is enabled and a standby NN is not responsible (either when it is down 
or a heap dump is being taken), we would wait for either 
_socket_connection_timeout * socket_max_retries_on_connection_timeout_ or 
_rpcTimeOut_ before moving on to the next NN. This adds a significantly 
latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a 
request would need to take more than 2 mins to complete when we take a heap 
dump at a standby. This has been causing user job failures. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 


> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Priority: Minor
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to connect 
> that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request would need to take more than 2 mins to complete when we take a heap 
> dump at a standby. This has been causing user job failures. 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to the next 
> NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-05-29 Thread Xing Lin (Jira)
Xing Lin created HDFS-17030:
---

 Summary: Limit wait time for getHAServiceState in 
ObserverReaderProxy
 Key: HDFS-17030
 URL: https://issues.apache.org/jira/browse/HDFS-17030
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.4.0
Reporter: Xing Lin


When HA is enabled and a standby NN is not responsible (either when it is down 
or a heap dump is being taken), we would wait for either 
_socket_connection_timeout * socket_max_retries_on_connection_timeout_ or 
_rpcTimeOut_ before moving on to the next NN. This adds a significantly 
latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a 
request would need to take more than 2 mins to complete when we take a heap 
dump at a standby. This has been causing user job failures. 

The proposal is to add a timeout on getHAServiceState() calls in 
ObserverReaderProxy and we will only wait for the timeout for an NN to respond 
its HA state. Once we pass that timeout, we will move on to the next NN. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16816) RBF: auto-create user home dir for trash paths by router

2023-02-24 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-16816:
---

Assignee: (was: Xing Lin)

> RBF: auto-create user home dir for trash paths by router
> 
>
> Key: HDFS-16816
> URL: https://issues.apache.org/jira/browse/HDFS-16816
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> In RBF, trash files are moved to trash root under user's home dir at the 
> corresponding namespace/namenode where the files reside. This was added in 
> HDFS-16024. When the user home dir is not created before-hand at a namenode, 
> we run into permission denied exceptions when trying to create the parent dir 
> for the trash file before moving the file into it. We propose to enhance 
> Router, to auto-create a user home's dir at the namenode for trash paths, 
> using router's identity (which is assumed to be a super-user).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer

2023-01-17 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16689:

Description: 
Standby NameNode crashes when transitioning to Active with a in-progress 
tailer. And the error message like blew:
{code:java}
Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when 
there is a stream available for read: ByteStringEditLog[X, Y], 
ByteStringEditLog[X, 0]
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
... 36 more
{code}
After tracing and found there is a critical bug in 
*EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* 
is true. Because *catchupDuringFailover()* try to replay all missed edits from 
JournalNodes with {*}onlyDurableTxns=true{*}. It may cannot replay any edits 
when they are some abnormal JournalNodes.

Reproduce method, suppose:
 - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is 
Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and 
JN2.
 - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully 
synced them to JN1 and JN2 {-}JN3{-}. And JN0 is abnormal, such as GC, bad 
network or restarted.
 - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active 
from NN0 to NN1.
 - NN1 only got two responses from JN0 and JN1 when it try to selecting 
inputStreams with *fromTxnId=3* and {*}onlyDurableTxns=true{*}, and the count 
txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad network 
or restarted.
 - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes 
because the *maxAllowedTxns* is 0.

So I think Standby NameNode should *catchupDuringFailover()* with 
*onlyDurableTxns=false* , so that it can replay all missed edits from 
JournalNode.

  was:
Standby NameNode crashes when transitioning to Active with a in-progress 
tailer. And the error message like blew:


{code:java}
Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when 
there is a stream available for read: ByteStringEditLog[X, Y], 
ByteStringEditLog[X, 0]
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
... 36 more
{code}

After tracing and found there is a critical bug in 
*EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* 
is true. Because *catchupDuringFailover()* try to replay all missed edits from 
JournalNodes with *onlyDurableTxns=true*. It may cannot replay any edits when 
they are some abnormal JournalNodes. 

Reproduce method, suppose:
- There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is 
Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and 
JN2. 
- NN0 try to sync 3 edits to JNs with started txid 3, but only successfully 
synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or 
restarted.
- NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active 
from NN0 to NN1. 
- NN1 only got two responses from JN0 and JN1 when it try to selecting 
inputStreams with *fromTxnId=3*  and *onlyDurableTxns=true*, and the count txid 
of response is 0, 3 respectively. JN2 is abnormal, such as GC,  bad network or 
restarted.
- NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes because 
the *maxAllowedTxns* is 0.


So I think Standby NameNode should *catchupDuringFailover()* with 
*onlyDurableTxns=false* , so that it can replay all missed edits from 
JournalNode.


> Standby NameNode crashes when transitioning to Active with in-progress tailer
> -
>
> Key: HDFS-16689
> URL: https://issues.apache.org/jira/browse/HDFS-16689
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Standby NameNode crashes when transitioning to Active with a in-progress 
> tailer. And the error message like blew:
> 

[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2023-01-07 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655752#comment-17655752
 ] 

Xing Lin edited comment on HDFS-15901 at 1/8/23 1:42 AM:
-

Do we have any followup on this issue? 

We are seeing a similar issue happening at Linkedin as well. The standby NN can 
be stuck in safe mode when restarted for some of the large clusters. When NN 
stuck in safe mode, the number of missing blocks each time are different and 
they are small numbers, from ~800 to 10K. It does not seem that we are missing 
a FBR. We are not sure what is causing the issue but could the following 
hypothesis be the case? 

In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a 
later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send 
in a new incremental Block report (IBR). However, NN does not process these 
IBRs (for example, it is paused due to GC). NN will not process any non-initial 
FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from 
the cluster and blockA becomes the missing block it will wait forever. 

 


was (Author: xinglin):
Do we have any followup on this issue? 

We are seeing a similar issue happening at Linkedin as well. The standby NN can 
be stuck in safe mode when restarted for some of the large clusters. When NN 
stuck in safe mode, the number of missing blocks each time are different. We 
are not sure what is causing the issue but could the following hypothesis be 
the case? 

In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a 
later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send 
in a new incremental Block report (IBR). However, NN does not process these 
IBRs (for example, it is paused due to GC). NN will not process any non-initial 
FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from 
the cluster and blockA becomes the missing block it will wait forever. 

 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2023-01-07 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655752#comment-17655752
 ] 

Xing Lin commented on HDFS-15901:
-

Do we have any followup on this issue? 

We are seeing a similar issue happening at Linkedin as well. The standby NN can 
be stuck in safe mode when restarted for some of the large clusters. When NN 
stuck in safe mode, the number of missing blocks each time are different. We 
are not sure what is causing the issue but could the following hypothesis be 
the case? 

In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a 
later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send 
in a new incremental Block report (IBR). However, NN does not process these 
IBRs (for example, it is paused due to GC). NN will not process any non-initial 
FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from 
the cluster and blockA becomes the missing block it will wait forever. 

 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16852) HDFS-16852 Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-02 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16852:

Summary: HDFS-16852 Register the shutdown hook only when not in shutdown 
for KeyProviderCache constructor  (was: Swallow IllegalStateException in 
KeyProviderCache)

> HDFS-16852 Register the shutdown hook only when not in shutdown for 
> KeyProviderCache constructor
> 
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>      

[jira] [Updated] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-02 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16852:

Summary: Register the shutdown hook only when not in shutdown for 
KeyProviderCache constructor  (was: HDFS-16852 Register the shutdown hook only 
when not in shutdown for KeyProviderCache constructor)

> Register the shutdown hook only when not in shutdown for KeyProviderCache 
> constructor
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> 

[jira] [Updated] (HDFS-16852) Swallow IllegalStateException in KeyProviderCache

2022-11-22 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16852:

Summary: Swallow IllegalStateException in KeyProviderCache  (was: swallow 
IllegalStateException in KeyProviderCache)

> Swallow IllegalStateException in KeyProviderCache
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> 

[jira] [Assigned] (HDFS-16852) swallow IllegalStateException in KeyProviderCache

2022-11-22 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-16852:
---

Assignee: Xing Lin

> swallow IllegalStateException in KeyProviderCache
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)       
>    
> 

[jira] [Created] (HDFS-16852) swallow IllegalStateException in KeyProviderCache

2022-11-22 Thread Xing Lin (Jira)
Xing Lin created HDFS-16852:
---

 Summary: swallow IllegalStateException in KeyProviderCache
 Key: HDFS-16852
 URL: https://issues.apache.org/jira/browse/HDFS-16852
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Xing Lin


When an HDFS client is created, it will register a shutdownhook to 
shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
shutdownHook when the process is already in shutdown and throws an 
IllegalStateException.

This behavior is not ideal, when a spark program failed during pre-launch. In 
that case, during shutdown, spark would call cleanStagingDir() to clean the 
staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
to HDFS. However, since this would be the first time to use a filesystem object 
in that process, it will need to create an hdfs client and register the 
shutdownHook. Then, we will hit the IllegalStateException. This 
illegalStateException will mask the actual exception which causes the spark 
program to fail during pre-launch.

We propose to swallow IllegalStateException in KeyProviderCache and log a 
warning. The TCP connection between the client and NameNode should be closed by 
the OS when the process is shutdown. 

Example stacktrace
{code:java}
13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
exception in thread shutdown-hook-0   
13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
progress, cannot add a shutdownHook    
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)        
  
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)         
 
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)      
    
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
          
13-09-2022 14:39:42 PDT INFO - at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
          
13-09-2022 14:39:42 PDT INFO - at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)          
13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
          
13-09-2022 14:39:42 PDT INFO - at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
          
13-09-2022 14:39:42 PDT INFO - at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)         
 
13-09-2022 14:39:42 PDT INFO - at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)          
13-09-2022 14:39:42 PDT INFO - at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
         
13-09-2022 14:39:42 PDT INFO - at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
         
13-09-2022 14:39:42 PDT INFO - at java.lang.Thread.run(Thread.java:748)         
 

[jira] [Commented] (HDFS-15505) Fix NullPointerException when call getAdditionalDatanode method with null extendedBlock parameter

2022-11-12 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17632720#comment-17632720
 ] 

Xing Lin commented on HDFS-15505:
-

There is no update from [~hangc] on the PR. I am not sure whether he still 
plans to fix/finish his PR. 

[~jianghuazhu], do you have bandwidth to pick this up?

> Fix NullPointerException when call getAdditionalDatanode method with null 
> extendedBlock parameter
> -
>
> Key: HDFS-15505
> URL: https://issues.apache.org/jira/browse/HDFS-15505
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2, 3.2.0, 3.1.1, 3.0.3, 3.1.2, 
> 3.3.0, 3.2.1, 3.1.3
>Reporter: hang chen
>Priority: Major
>
> When client call getAdditionalDatanode method, it will initialize 
> GetAdditionalDatanodeRequestProto and send RPC request to Router/namenode. 
> However, if we call getAdditionalDatanode method with null extendedBlock 
> parameter, it will set GetAdditionalDatanodeRequestProto's blk field with 
> null, which will cause NullPointerException. The code show as follow.
> {code:java}
> // code placeholder
> GetAdditionalDatanodeRequestProto req = GetAdditionalDatanodeRequestProto
>  .newBuilder()
>  .setSrc(src)
>  .setFileId(fileId)
>  .setBlk(PBHelperClient.convert(blk))
>  .addAllExistings(PBHelperClient.convert(existings))
>  .addAllExistingStorageUuids(Arrays.asList(existingStorageIDs))
>  .addAllExcludes(PBHelperClient.convert(excludes))
>  .setNumAdditionalNodes(numAdditionalNodes)
>  .setClientName(clientName)
>  .build();{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16838) Fix NPE in testAddRplicaProcessorForAddingReplicaInMap

2022-11-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16838:

Description: 
There is a NPE in 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap
 if we run this UT individually. And the related bug as bellow:

 
{code:java}
public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception {
  // BUG here
  BlockPoolSlice.reInitializeAddReplicaThreadPool();
  Configuration cnf = new Configuration();
  int poolSize = 5; 
  ...
}{code}
 

_addReplicaThreadPool_ may not have been initialized and is null, if we run 
testAddRplicaProcessorForAddingReplicaInMap unit test as an individual unit 
test.
{code:java}
@VisibleForTesting
public static void reInitializeAddReplicaThreadPool() {
  addReplicaThreadPool.shutdown();
  addReplicaThreadPool = null;
}{code}

  was:
There is a NPE in 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap
 if we run this UT individually. And the related bug as bellow:

 
{code:java}
public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception {
  // BUG here
  BlockPoolSlice.reInitializeAddReplicaThreadPool();
  Configuration cnf = new Configuration();
  int poolSize = 5; 
  ...
}{code}
 


> Fix NPE in testAddRplicaProcessorForAddingReplicaInMap
> --
>
> Key: HDFS-16838
> URL: https://issues.apache.org/jira/browse/HDFS-16838
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> There is a NPE in 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap
>  if we run this UT individually. And the related bug as bellow:
>  
> {code:java}
> public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception {
>   // BUG here
>   BlockPoolSlice.reInitializeAddReplicaThreadPool();
>   Configuration cnf = new Configuration();
>   int poolSize = 5; 
>   ...
> }{code}
>  
> _addReplicaThreadPool_ may not have been initialized and is null, if we run 
> testAddRplicaProcessorForAddingReplicaInMap unit test as an individual unit 
> test.
> {code:java}
> @VisibleForTesting
> public static void reInitializeAddReplicaThreadPool() {
>   addReplicaThreadPool.shutdown();
>   addReplicaThreadPool = null;
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16816) RBF: auto-create user home dir for trash paths by router

2022-11-02 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627993#comment-17627993
 ] 

Xing Lin commented on HDFS-16816:
-

/user dir may not be writable for any regular user. In that case, assume userA 
dir is not created in /user, when userA calls moveToTrash(/dir/file), it will 
hit the permission denied error when it is trying to create dir 
/user/userA/.Trash/Current/dir, because userA does not have write permission 
for /user. 

> RBF: auto-create user home dir for trash paths by router
> 
>
> Key: HDFS-16816
> URL: https://issues.apache.org/jira/browse/HDFS-16816
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> In RBF, trash files are moved to trash root under user's home dir at the 
> corresponding namespace/namenode where the files reside. This was added in 
> HDFS-16024. When the user home dir is not created before-hand at a namenode, 
> we run into permission denied exceptions when trying to create the parent dir 
> for the trash file before moving the file into it. We propose to enhance 
> Router, to auto-create a user home's dir at the namenode for trash paths, 
> using router's identity (which is assumed to be a super-user).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16818) RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures

2022-10-30 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626178#comment-17626178
 ] 

Xing Lin commented on HDFS-16818:
-

Even when we unset storagepolicy, we still get HOT policy. 
{code:java}
routerFs.unsetStoragePolicy(mountFile);
routerFs.removeXAttr(mountFile, name);
assertEquals(0, nnFs.getXAttrs(nameSpaceFile).size());

assertEquals("HOT", nnFs.getStoragePolicy(nameSpaceFile).getName());{code}

> RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit 
> tests failures
> 
>
> Key: HDFS-16818
> URL: https://issues.apache.org/jira/browse/HDFS-16818
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Priority: Major
>
> TestRouterRPCMultipleDestinationMountTableResolver fails a couple of times 
> nondeterministically when run multiple times. 
> I repeated the following commands for 10+ times against 
> 454157a3844cdd6c92ef650af6c3b323cbec88af in trunk and observed two types of 
> failed runs.
> {code:java}
> mvn test -Dtest="TestRouterRPCMultipleDestinationMountTableResolver"{code}
>  
> Failed run 1 output:
> {code:java}
> [ERROR] Failures:
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec
> toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector
> yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect
> oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [INFO]
> [ERROR] Tests run: 18, Failures: 5, Errors: 0, Skipped: 0{code}
>  
> Failed run 2 output:
> {code:java}
> [ERROR] Failures:
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testECMultipleDestinations:430
> [ERROR] Errors:
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec
> toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 
> NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector
> yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:397 NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect
> oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
> [INFO]
> [ERROR] Tests run: 18, Failures: 1, Errors: 5, Skipped: 0{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16818) RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures

2022-10-25 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624081#comment-17624081
 ] 

Xing Lin commented on HDFS-16818:
-

This unit test failure is non-deterministic. May be related with HDFS-16740.

> RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit 
> tests failures
> 
>
> Key: HDFS-16818
> URL: https://issues.apache.org/jira/browse/HDFS-16818
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Priority: Major
>
> TestRouterRPCMultipleDestinationMountTableResolver fails a couple of times 
> nondeterministically when run multiple times. 
> I repeated the following commands for 10+ times against 
> 454157a3844cdd6c92ef650af6c3b323cbec88af in trunk and observed two types of 
> failed runs.
> {code:java}
> mvn test -Dtest="TestRouterRPCMultipleDestinationMountTableResolver"{code}
>  
> Failed run 1 output:
> {code:java}
> [ERROR] Failures:
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec
> toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector
> yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect
> oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
> expected:<[COLD]> but was:<[HOT]>
> [INFO]
> [ERROR] Tests run: 18, Failures: 5, Errors: 0, Skipped: 0{code}
>  
> Failed run 2 output:
> {code:java}
> [ERROR] Failures:
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testECMultipleDestinations:430
> [ERROR] Errors:
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec
> toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 
> NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector
> yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:397 NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect
> oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
> [ERROR]   
> TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto
> ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
> [INFO]
> [ERROR] Tests run: 18, Failures: 1, Errors: 5, Skipped: 0{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16818) RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures

2022-10-24 Thread Xing Lin (Jira)
Xing Lin created HDFS-16818:
---

 Summary: RBF TestRouterRPCMultipleDestinationMountTableResolver 
non-deterministic unit tests failures
 Key: HDFS-16818
 URL: https://issues.apache.org/jira/browse/HDFS-16818
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Affects Versions: 3.4.0
Reporter: Xing Lin


TestRouterRPCMultipleDestinationMountTableResolver fails a couple of times 
nondeterministically when run multiple times. 

I repeated the following commands for 10+ times against 
454157a3844cdd6c92ef650af6c3b323cbec88af in trunk and observed two types of 
failed runs.
{code:java}
mvn test -Dtest="TestRouterRPCMultipleDestinationMountTableResolver"{code}
 

Failed run 1 output:
{code:java}
[ERROR] Failures:
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec
toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
expected:<[COLD]> but was:<[HOT]>
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector
yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:395 
expected:<[COLD]> but was:<[HOT]>
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto
ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
expected:<[COLD]> but was:<[HOT]>
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect
oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
expected:<[COLD]> but was:<[HOT]>
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto
ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 
expected:<[COLD]> but was:<[HOT]>
[INFO]
[ERROR] Tests run: 18, Failures: 5, Errors: 0, Skipped: 0{code}
 

Failed run 2 output:
{code:java}
[ERROR] Failures:
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testECMultipleDestinations:430
[ERROR] Errors:
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec
toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector
yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:397 NullPointer
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto
ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect
oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
[ERROR]   
TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto
ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer
[INFO]
[ERROR] Tests run: 18, Failures: 1, Errors: 5, Skipped: 0{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16816) RBF: auto-create user home dir for trash paths by router

2022-10-24 Thread Xing Lin (Jira)
Xing Lin created HDFS-16816:
---

 Summary: RBF: auto-create user home dir for trash paths by router
 Key: HDFS-16816
 URL: https://issues.apache.org/jira/browse/HDFS-16816
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Reporter: Xing Lin


In RBF, trash files are moved to trash root under user's home dir at the 
corresponding namespace/namenode where the files reside. This was added in 
HDFS-16024. When the user home dir is not created before-hand at a namenode, we 
run into permission denied exceptions when trying to create the parent dir for 
the trash file before moving the file into it. We propose to enhance Router, to 
auto-create a user home's dir at the namenode for trash paths, using router's 
identity (which is assumed to be a super-user).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16790) rbf wrong path when destination dir is not created

2022-10-02 Thread Xing Lin (Jira)
Xing Lin created HDFS-16790:
---

 Summary: rbf wrong path when destination dir is not created
 Key: HDFS-16790
 URL: https://issues.apache.org/jira/browse/HDFS-16790
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Affects Versions: 3.4.0
Reporter: Xing Lin


mount table at router
{code:java}
$HADOOP_HOME/bin/hdfs dfsrouteradmin -ls
/data1ns1->/data
/data2                    ns2->/data
/data3ns3->/data
{code}
At a client node, when /data is not created in ns2, the error message shows a 
wrong path.
{code:java}
utos@c01:/usr/local/bin/hadoop-3.4.0-SNAPSHOT$ bin/hadoop dfs -ls 
hdfs://ns-fed/data2
ls: File hdfs://ns-fed/data2/data2 does not exist.

utos@c01:/usr/local/bin/hadoop-3.4.0-SNAPSHOT$ bin/hadoop dfs -ls 
hdfs://ns-fed/data3
-rw-r--r--   3 utos supergroup  0 2022-10-02 17:35 
hdfs://ns-fed/data3/file3
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16191) [FGL] Fix FSImage loading issues on dynamic partitions

2021-09-13 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414719#comment-17414719
 ] 

Xing Lin commented on HDFS-16191:
-

Yeah, that does not sound right: when there are 256 partitions, we insert range 
keys [0, 16385], [1, 16385], [2, 16385], ... .[255, 16385]. If there are more 
partitions need to be created, the next ones should be created with range keys: 
[256, 16385], [257, 16385], [258, 16385], ... When the partition size is 
changed, we also need to update the indexof() method.

We need a holistic approach to support dynamic partition sizes. 
 # Do we support arbitrary partition size or only power of 2 partition sizes? 
Maybe probably later is simpler.   
 # whenever the partition size is changed, we need to re-shuffle keys in the 
partitionedGSet. Essentially, it is a rehashing operation. If we double the 
partition size from 256 to 512, instead of doing indexKey%256, we need to do 
indexKey%521.

> [FGL] Fix FSImage loading issues on dynamic partitions
> --
>
> Key: HDFS-16191
> URL: https://issues.apache.org/jira/browse/HDFS-16191
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When new partitions gets added into PartitionGSet, iterator do not consider 
> the new partitions. Which always iterate on Static Partition count. This lead 
> to full of warn messages as below.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139780 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139781 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139784 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139785 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139786 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139788 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139789 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139790 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139791 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139793 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139795 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139796 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139797 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139800 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139801 when saving the leases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16191) [FGL] Fix FSImage loading issues on dynamic partitions

2021-09-08 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412107#comment-17412107
 ] 

Xing Lin commented on HDFS-16191:
-

Hi [~prasad-acit], 

Thanks for working on this!

As you asked in the github pull request, we don't support partitions larger 
than NUM_RANGES_STATIC right now.
The key of a inode is calculated and then modulo by NUM_RANGES_STATIC in 
indexof(). As a result, any partition that has an id larger than 
NUM_RANGES_STATIC will receive no insertion.

If we want to support dynamic partition numbers, we need to modify indexof() 
implementation as well. We need to replace `& (INodeMap.NUM_RANGES_STATIC -1)` 
with something like `% partition_num`. Also note, indexof() is a static 
function which means we can not access instance variable from here. I don't 
know how to handle it now. 
{code:java}
public static long indexOf(long[] key) {
if(key[key.length-1] == INodeId.ROOT_INODE_ID) {
  return key[0];
}
long idx = LARGE_PRIME * key[0];
idx = (idx ^ (idx >> 32)) & (INodeMap.NUM_RANGES_STATIC -1);
return idx;
  }
{code}

> [FGL] Fix FSImage loading issues on dynamic partitions
> --
>
> Key: HDFS-16191
> URL: https://issues.apache.org/jira/browse/HDFS-16191
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When new partitions gets added into PartitionGSet, iterator do not consider 
> the new partitions. Which always iterate on Static Partition count. This lead 
> to full of warn messages as below.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139780 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139781 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139784 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139785 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139786 when saving the leases.
> 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139788 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139789 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139790 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139791 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139793 when saving the leases.
> 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139795 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139796 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139797 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139800 when saving the leases.
> 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find 
> inode 139801 when saving the leases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-09-02 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408949#comment-17408949
 ] 

Xing Lin edited comment on HDFS-16128 at 9/2/21, 4:14 PM:
--

Hi [~prasad-acit], 

The issue is given an inode as a long value, the function will first construct 
a INode object. But we don't know what the parent Inode is for this INode, thus 
we can not determine which partition to search for.  That is why we fall back 
to iterate over all partitions to search for that inode.

We construct a new INode object as following in this function before we do the 
search.
{code:java}
INode inode = new INodeDirectory(id, null,
 new PermissionStatus("", "", new FsPermission((short) 0)), 0);{code}
 

You should also take a look at this function: public INode get(INode inode).

Inside this function, we first check whether there are KEY_DEPTH - 1 levels of 
parent Inodes. If there are sufficient parent Inodes to construct the partition 
key, then we go directly with map.get(inode). Otherwise, we fall back to 
get(long inode), which basically scan all partitions and search for the inode. 

Hope this answers your question. 

 


was (Author: xinglin):
Hi [~prasad-acit], 

The issue is given a inode as a long value, the function will first construct a 
INode object. But we don't know what the parent Inode is for this INode, thus 
we can not determine which partition to search for.  That is why we fall back 
to iterate over all partitions to search for that inode.


{code:java}
INode inode = new INodeDirectory(id, null,
 new PermissionStatus("", "", new FsPermission((short) 0)), 0);{code}
You should also take a look at this function: public INode get(INode inode).

Inside this function, we first check whether there are KEY_DEPTH - 1 levels of 
parent Inodes. If there are sufficient parent Inodes, then we go directly with 
map.get(inode). Otherwise, we fall back to get(long inode), which basically 
scan all partitions and search for the inode. 

Hope this answers your question. 

 

> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: Fine-Grained Locking
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.
> h1. Saving FSImage
> *Original HDFS design*: iterate every inode in inodeMap and save them into 
> the FSImage file. 
> *FGL*: no change is needed here, since PartitionedGSet also provides an 
> iterator interface, to iterate over inodes stored in partitions. 
> h1. Loading an HDFS 
> *Original HDFS design*: it first loads the FSImage files and then loads edit 
> logs for recent changes. FSImage files contain different sections, including 
> INodeSections and INodeDirectorySections. An InodeSection contains serialized 
> Inodes objects and the INodeDirectorySection contains the parent inode for an 
> Inode. When loading an FSImage, the system first loads INodeSections and then 
> load the INodeDirectorySections, to set the parent inode for each inode. 
> After FSImage files are loaded, edit logs are then loaded. Edit log contains 
> recent changes to the filesystem, including Inodes creation/deletion. For a 
> newly created INode, the parent inode is set before it is added to the 
> inodeMap.
> *FGL*: when adding an Inode into the partitionedGSet, we need the parent 
> inode of an inode, in order to determine which partition to store that inode, 
> when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we 
> used a temporary LightweightGSet (inodeMapTemp), to store inodes. When 
> LoadFSImage is done, the parent inode for all existing inodes in FSImage 
> files is set. We can now move the inodes into a partitionedGSet. Load edit 
> logs can work as usual, as the parent inode for an inode is set before it is 
> added to the inodeMap. 
> In theory, PartitionedGSet can support to store inodes without setting its 
> parent inodes. All these inodes will be stored in the 0th partition. However, 
> we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
> inodes, to make this case more transparent.          
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-09-02 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408949#comment-17408949
 ] 

Xing Lin commented on HDFS-16128:
-

Hi [~prasad-acit], 

The issue is given a inode as a long value, the function will first construct a 
INode object. But we don't know what the parent Inode is for this INode, thus 
we can not determine which partition to search for.  That is why we fall back 
to iterate over all partitions to search for that inode.


{code:java}
INode inode = new INodeDirectory(id, null,
 new PermissionStatus("", "", new FsPermission((short) 0)), 0);{code}
You should also take a look at this function: public INode get(INode inode).

Inside this function, we first check whether there are KEY_DEPTH - 1 levels of 
parent Inodes. If there are sufficient parent Inodes, then we go directly with 
map.get(inode). Otherwise, we fall back to get(long inode), which basically 
scan all partitions and search for the inode. 

Hope this answers your question. 

 

> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: Fine-Grained Locking
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.
> h1. Saving FSImage
> *Original HDFS design*: iterate every inode in inodeMap and save them into 
> the FSImage file. 
> *FGL*: no change is needed here, since PartitionedGSet also provides an 
> iterator interface, to iterate over inodes stored in partitions. 
> h1. Loading an HDFS 
> *Original HDFS design*: it first loads the FSImage files and then loads edit 
> logs for recent changes. FSImage files contain different sections, including 
> INodeSections and INodeDirectorySections. An InodeSection contains serialized 
> Inodes objects and the INodeDirectorySection contains the parent inode for an 
> Inode. When loading an FSImage, the system first loads INodeSections and then 
> load the INodeDirectorySections, to set the parent inode for each inode. 
> After FSImage files are loaded, edit logs are then loaded. Edit log contains 
> recent changes to the filesystem, including Inodes creation/deletion. For a 
> newly created INode, the parent inode is set before it is added to the 
> inodeMap.
> *FGL*: when adding an Inode into the partitionedGSet, we need the parent 
> inode of an inode, in order to determine which partition to store that inode, 
> when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we 
> used a temporary LightweightGSet (inodeMapTemp), to store inodes. When 
> LoadFSImage is done, the parent inode for all existing inodes in FSImage 
> files is set. We can now move the inodes into a partitionedGSet. Load edit 
> logs can work as usual, as the parent inode for an inode is set before it is 
> added to the inodeMap. 
> In theory, PartitionedGSet can support to store inodes without setting its 
> parent inodes. All these inodes will be stored in the 0th partition. However, 
> we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
> inodes, to make this case more transparent.          
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-27 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797
 ] 

Xing Lin edited comment on HDFS-14703 at 7/27/21, 6:07 AM:
---

[~daryn] Thanks for your comments. I will address your last question and leave 
other questions to [~shv]. :)

 

Regarding the results, we used the standard NNThroughputBenchmark, with 
commands like the following. 
  
{code:java}
./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
file:/// -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512{code}
Here are a result from [~prasad-acit], since his QPS numbers are higher than 
what I got. 
{code:java}
BASE:
 common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO 
namenode.NNThroughputBenchmark: — mkdirs inputs —
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: — mkdirs stats —
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 
100
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
17718
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 
56439.77875606727
 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3
 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254
PATCH:
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: — mkdirs inputs —
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: — mkdirs stats —
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 
100
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
15010
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 
66622.25183211193
 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2
 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254
{code}
 


was (Author: xinglin):
[~daryn] Thanks for your comments. I will address your last question and leave 
other questions to [~shv]. :)

 

Regarding the results, we used the standard NNThroughputBenchmark, with 
commands like the following. 
 
./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* 
[*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512

Here are a result from [~prasad-acit], since his QPS numbers are higher than 
what I got. 
BASE:
common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO 
namenode.NNThroughputBenchmark: --- mkdirs inputs ---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
56439.77875606727
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3
2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254

PATCH:
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
66622.25183211193
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2
2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>

[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-27 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797
 ] 

Xing Lin commented on HDFS-14703:
-

[~daryn] Thanks for your comments. I will address your last question and leave 
other questions to [~shv]. :)

 

Regarding the results, we used the standard NNThroughputBenchmark, with 
commands like the following. 
 
./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* 
[*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512

Here are a result from [~prasad-acit], since his QPS numbers are higher than 
what I got. 
BASE:
common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO 
namenode.NNThroughputBenchmark: --- mkdirs inputs ---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
56439.77875606727
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3
2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254

PATCH:
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
66622.25183211193
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2
2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-07-21 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384675#comment-17384675
 ] 

Xing Lin commented on HDFS-16128:
-

Instead of loading inodes directly into the final inodeMap, we now split it 
into two steps: first load them into a lightweightGSet and then move them into 
the partitionedGSet. But this is all done in-memory and hopefully, it won't 
bring too much performance degradation. 

Thanks for the +1 from you! 

> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.
> h1. Saving FSImage
> *Original HDFS design*: iterate every inode in inodeMap and save them into 
> the FSImage file. 
> *FGL*: no change is needed here, since PartitionedGSet also provides an 
> iterator interface, to iterate over inodes stored in partitions. 
> h1. Loading an HDFS 
> *Original HDFS design*: it first loads the FSImage files and then loads edit 
> logs for recent changes. FSImage files contain different sections, including 
> INodeSections and INodeDirectorySections. An InodeSection contains serialized 
> Inodes objects and the INodeDirectorySection contains the parent inode for an 
> Inode. When loading an FSImage, the system first loads INodeSections and then 
> load the INodeDirectorySections, to set the parent inode for each inode. 
> After FSImage files are loaded, edit logs are then loaded. Edit log contains 
> recent changes to the filesystem, including Inodes creation/deletion. For a 
> newly created INode, the parent inode is set before it is added to the 
> inodeMap.
> *FGL*: when adding an Inode into the partitionedGSet, we need the parent 
> inode of an inode, in order to determine which partition to store that inode, 
> when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we 
> used a temporary LightweightGSet (inodeMapTemp), to store inodes. When 
> LoadFSImage is done, the parent inode for all existing inodes in FSImage 
> files is set. We can now move the inodes into a partitionedGSet. Load edit 
> logs can work as usual, as the parent inode for an inode is set before it is 
> added to the inodeMap. 
> In theory, PartitionedGSet can support to store inodes without setting its 
> parent inodes. All these inodes will be stored in the 0th partition. However, 
> we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
> inodes, to make this case more transparent.          
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-07-18 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382941#comment-17382941
 ] 

Xing Lin commented on HDFS-16128:
-

[~prasad-acit] Thanks for your comments. One of them is a bug in my code. See 
my comments in the pull request. Updated my pull request.

> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.
> h1. Saving FSImage
> *Original HDFS design*: iterate every inode in inodeMap and save them into 
> the FSImage file. 
> *FGL*: no change is needed here, since PartitionedGSet also provides an 
> iterator interface, to iterate over inodes stored in partitions. 
> h1. Loading an HDFS 
> *Original HDFS design*: it first loads the FSImage files and then loads edit 
> logs for recent changes. FSImage files contain different sections, including 
> INodeSections and INodeDirectorySections. An InodeSection contains serialized 
> Inodes objects and the INodeDirectorySection contains the parent inode for an 
> Inode. When loading an FSImage, the system first loads INodeSections and then 
> load the INodeDirectorySections, to set the parent inode for each inode. 
> After FSImage files are loaded, edit logs are then loaded. Edit log contains 
> recent changes to the filesystem, including Inodes creation/deletion. For a 
> newly created INode, the parent inode is set before it is added to the 
> inodeMap.
> *FGL*: when adding an Inode into the partitionedGSet, we need the parent 
> inode of an inode, in order to determine which partition to store that inode, 
> when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we 
> used a temporary LightweightGSet (inodeMapTemp), to store inodes. When 
> LoadFSImage is done, the parent inode for all existing inodes in FSImage 
> files is set. We can now move the inodes into a partitionedGSet. Load edit 
> logs can work as usual, as the parent inode for an inode is set before it is 
> added to the inodeMap. 
> In theory, PartitionedGSet can support to store inodes without setting its 
> parent inodes. All these inodes will be stored in the 0th partition. However, 
> we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
> inodes, to make this case more transparent.          
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-07-18 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16128 started by Xing Lin.
---
> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.
> h1. Saving FSImage
> *Original HDFS design*: iterate every inode in inodeMap and save them into 
> the FSImage file. 
> *FGL*: no change is needed here, since PartitionedGSet also provides an 
> iterator interface, to iterate over inodes stored in partitions. 
> h1. Loading an HDFS 
> *Original HDFS design*: it first loads the FSImage files and then loads edit 
> logs for recent changes. FSImage files contain different sections, including 
> INodeSections and INodeDirectorySections. An InodeSection contains serialized 
> Inodes objects and the INodeDirectorySection contains the parent inode for an 
> Inode. When loading an FSImage, the system first loads INodeSections and then 
> load the INodeDirectorySections, to set the parent inode for each inode. 
> After FSImage files are loaded, edit logs are then loaded. Edit log contains 
> recent changes to the filesystem, including Inodes creation/deletion. For a 
> newly created INode, the parent inode is set before it is added to the 
> inodeMap.
> *FGL*: when adding an Inode into the partitionedGSet, we need the parent 
> inode of an inode, in order to determine which partition to store that inode, 
> when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we 
> used a temporary LightweightGSet (inodeMapTemp), to store inodes. When 
> LoadFSImage is done, the parent inode for all existing inodes in FSImage 
> files is set. We can now move the inodes into a partitionedGSet. Load edit 
> logs can work as usual, as the parent inode for an inode is set before it is 
> added to the inodeMap. 
> In theory, PartitionedGSet can support to store inodes without setting its 
> parent inodes. All these inodes will be stored in the 0th partition. However, 
> we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
> inodes, to make this case more transparent.          
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-18 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-14703:

Fix Version/s: (was: Fine-Grained Locking)

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-18 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-14703:

Fix Version/s: Fine-Grained Locking

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Fix For: Fine-Grained Locking
>
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-07-17 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16128:

Description: 
Add support to save Inodes stored in PartitionedGSet when saving an FS image 
and load Inodes into PartitionedGSet from a saved FS image.
h1. Saving FSImage

*Original HDFS design*: iterate every inode in inodeMap and save them into the 
FSImage file. 

*FGL*: no change is needed here, since PartitionedGSet also provides an 
iterator interface, to iterate over inodes stored in partitions. 
h1. Loading an HDFS 

*Original HDFS design*: it first loads the FSImage files and then loads edit 
logs for recent changes. FSImage files contain different sections, including 
INodeSections and INodeDirectorySections. An InodeSection contains serialized 
Inodes objects and the INodeDirectorySection contains the parent inode for an 
Inode. When loading an FSImage, the system first loads INodeSections and then 
load the INodeDirectorySections, to set the parent inode for each inode. 

After FSImage files are loaded, edit logs are then loaded. Edit log contains 
recent changes to the filesystem, including Inodes creation/deletion. For a 
newly created INode, the parent inode is set before it is added to the inodeMap.

*FGL*: when adding an Inode into the partitionedGSet, we need the parent inode 
of an inode, in order to determine which partition to store that inode, when 
NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we used a 
temporary LightweightGSet (inodeMapTemp), to store inodes. When LoadFSImage is 
done, the parent inode for all existing inodes in FSImage files is set. We can 
now move the inodes into a partitionedGSet. Load edit logs can work as usual, 
as the parent inode for an inode is set before it is added to the inodeMap. 

In theory, PartitionedGSet can support to store inodes without setting its 
parent inodes. All these inodes will be stored in the 0th partition. However, 
we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
inodes, to make this case more transparent.          

 

  was:Add support to save Inodes stored in PartitionedGSet when saving an FS 
image and load Inodes into PartitionedGSet from a saved FS image.


> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.
> h1. Saving FSImage
> *Original HDFS design*: iterate every inode in inodeMap and save them into 
> the FSImage file. 
> *FGL*: no change is needed here, since PartitionedGSet also provides an 
> iterator interface, to iterate over inodes stored in partitions. 
> h1. Loading an HDFS 
> *Original HDFS design*: it first loads the FSImage files and then loads edit 
> logs for recent changes. FSImage files contain different sections, including 
> INodeSections and INodeDirectorySections. An InodeSection contains serialized 
> Inodes objects and the INodeDirectorySection contains the parent inode for an 
> Inode. When loading an FSImage, the system first loads INodeSections and then 
> load the INodeDirectorySections, to set the parent inode for each inode. 
> After FSImage files are loaded, edit logs are then loaded. Edit log contains 
> recent changes to the filesystem, including Inodes creation/deletion. For a 
> newly created INode, the parent inode is set before it is added to the 
> inodeMap.
> *FGL*: when adding an Inode into the partitionedGSet, we need the parent 
> inode of an inode, in order to determine which partition to store that inode, 
> when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we 
> used a temporary LightweightGSet (inodeMapTemp), to store inodes. When 
> LoadFSImage is done, the parent inode for all existing inodes in FSImage 
> files is set. We can now move the inodes into a partitionedGSet. Load edit 
> logs can work as usual, as the parent inode for an inode is set before it is 
> added to the inodeMap. 
> In theory, PartitionedGSet can support to store inodes without setting its 
> parent inodes. All these inodes will be stored in the 0th partition. However, 
> we decide to use a temporary LightweightGSet (inodeMapTemp) to store these 
> inodes, to make this case more transparent.          
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-07-16 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16128:

Labels: pull-request-available  (was: )

> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet

2021-07-14 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16128:

Summary: [FGL] Add support for saving/loading an FS Image for 
PartitionedGSet  (was: Add support for saving/loading an FS Image for 
PartitionedGSet)

> [FGL] Add support for saving/loading an FS Image for PartitionedGSet
> 
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16128) Add support for saving/loading an FS Image for PartitionedGSet

2021-07-14 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16128:

Parent: HDFS-14703
Issue Type: Sub-task  (was: Improvement)

> Add support for saving/loading an FS Image for PartitionedGSet
> --
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16128) Add support for saving/loading an FS Image

2021-07-13 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16128:

Description: Add support to save Inodes stored in PartitionedGSet when 
saving an FS image and load Inodes into PartitionedGSet from a saved FS image.  
(was: We target to enable fine-grained locking by splitting the in-memory 
namespace into multiple partitions each having a separate lock. Intended to 
improve performance of NameNode write operations.)

> Add support for saving/loading an FS Image
> --
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Priority: Major
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16128) Add support for saving/loading an FS Image for PartitionedGSet

2021-07-13 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16128:

Summary: Add support for saving/loading an FS Image for PartitionedGSet  
(was: Add support for saving/loading an FS Image)

> Add support for saving/loading an FS Image for PartitionedGSet
> --
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Priority: Major
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16128) Add support for saving/loading an FS Image for PartitionedGSet

2021-07-13 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin reassigned HDFS-16128:
---

Assignee: Xing Lin

> Add support for saving/loading an FS Image for PartitionedGSet
> --
>
> Key: HDFS-16128
> URL: https://issues.apache.org/jira/browse/HDFS-16128
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>
> Add support to save Inodes stored in PartitionedGSet when saving an FS image 
> and load Inodes into PartitionedGSet from a saved FS image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16128) Add support for saving/loading an FS Image

2021-07-13 Thread Xing Lin (Jira)
Xing Lin created HDFS-16128:
---

 Summary: Add support for saving/loading an FS Image
 Key: HDFS-16128
 URL: https://issues.apache.org/jira/browse/HDFS-16128
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, namenode
Reporter: Xing Lin


We target to enable fine-grained locking by splitting the in-memory namespace 
into multiple partitions each having a separate lock. Intended to improve 
performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16125) fix the iterator for PartitionedGSet

2021-07-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16125:

Summary: fix the iterator for PartitionedGSet   (was: iterator for 
PartitionedGSet would visit the first partition twice)

> fix the iterator for PartitionedGSet 
> -
>
> Key: HDFS-16125
> URL: https://issues.apache.org/jira/browse/HDFS-16125
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Priority: Minor
>
> Iterator in PartitionedGSet would visit the first partition twice, since we 
> did not set the keyIterator to move to the first key during initialization.  
>  
> This is related to fgl: https://issues.apache.org/jira/browse/HDFS-14703



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16125) iterator for PartitionedGSet would visit the first partition twice

2021-07-12 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16125:

Description: 
Iterator in PartitionedGSet would visit the first partition twice, since we did 
not set the keyIterator to move to the first key during initialization.  

 

This is related to fgl: https://issues.apache.org/jira/browse/HDFS-14703

  was:Iterator in PartitionedGSet would visit the first partition twice, since 
we did not set the keyIterator to move to the first key during initialization.  


> iterator for PartitionedGSet would visit the first partition twice
> --
>
> Key: HDFS-16125
> URL: https://issues.apache.org/jira/browse/HDFS-16125
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: Xing Lin
>Priority: Minor
>
> Iterator in PartitionedGSet would visit the first partition twice, since we 
> did not set the keyIterator to move to the first key during initialization.  
>  
> This is related to fgl: https://issues.apache.org/jira/browse/HDFS-14703



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16125) iterator for PartitionedGSet would visit the first partition twice

2021-07-12 Thread Xing Lin (Jira)
Xing Lin created HDFS-16125:
---

 Summary: iterator for PartitionedGSet would visit the first 
partition twice
 Key: HDFS-16125
 URL: https://issues.apache.org/jira/browse/HDFS-16125
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs, namenode
Reporter: Xing Lin


Iterator in PartitionedGSet would visit the first partition twice, since we did 
not set the keyIterator to move to the first key during initialization.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-06-08 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359651#comment-17359651
 ] 

Xing Lin commented on HDFS-14703:
-

Hi [~prasad-acit], that is awesome! Konstantin is on vacation this week and 
next week. I am sure he will be very happy to review your pull press for Create 
API. 

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-16 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870
 ] 

Xing Lin edited comment on HDFS-14703 at 5/17/21, 4:42 AM:
---

I did some performance benchmarks using a physical server (a d430 server in 
[Utah Emulab testbed|http://www.emulab.net]). I used either RAMDISK or SSD, as 
the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD 
to make each write persistent. For the RAM case, we observed an improvement of 
45% from fine-grained locking. For the SSD case, fine-grained locking gives us 
20% improvement.  We used an Intel SSD (model: SSDSC2BX200G4R).  

We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't 
know the reason for this yet. We repeated the experiment for RAMDISK for trunk 
twice to confirm the performance number.
h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos
h1. 45% improvements fgl vs. trunk
h2. trunk 

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
663510

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
15071.362

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
710248

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
14079.5

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14

2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 
10019540

fgl

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
445980

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
22422.530

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8
h1. SSD, hadoop.tmp.dir=/dev/sda4
h1. 23% improvement fgl vs. trunk

trunk:

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
593839

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
16839.581

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11

 

fgl

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
481269

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
20778.400

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9

 

/dev/sda:

ATA device, with non-removable media

Model Number:       INTEL SSDSC2BX200G4R

Serial Number:      BTHC523202RD200TGN

Firmware Revision:  G201DL2D


was (Author: xinglin):
I did some performance benchmarks using a physical server (a d430 server in 
[Utah Emulab testbed|www.emulab.net]). I used either RAMDISK or SSD, as the 
storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to 
make each write persistent. For the RAM case, we observed an improvement of 45% 
from fine-grained locking. For the SSD case, fine-grained locking gives us 20% 
improvement.  We used an Intel SSD (model: SSDSC2BX200G4R).  

We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't 
know the reason for this yet. We repeated the experiment for RAMDISK for trunk 
twice to confirm the performance number.
h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos
h1. 45% improvements fgl vs. trunk
h2. trunk 

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
663510

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
15071.362

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
710248

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
14079.5

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14

2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 
10019540

fgl


[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-16 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870
 ] 

Xing Lin edited comment on HDFS-14703 at 5/17/21, 4:41 AM:
---

I did some performance benchmarks using a physical server (a d430 server in 
[Utah Emulab testbed|www.emulab.net]). I used either RAMDISK or SSD, as the 
storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to 
make each write persistent. For the RAM case, we observed an improvement of 45% 
from fine-grained locking. For the SSD case, fine-grained locking gives us 20% 
improvement.  We used an Intel SSD (model: SSDSC2BX200G4R).  

We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't 
know the reason for this yet. We repeated the experiment for RAMDISK for trunk 
twice to confirm the performance number.
h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos
h1. 45% improvements fgl vs. trunk
h2. trunk 

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
663510

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
15071.362

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
710248

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
14079.5

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14

2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 
10019540

fgl

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
445980

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
22422.530

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8
h1. SSD, hadoop.tmp.dir=/dev/sda4
h1. 23% improvement fgl vs. trunk

trunk:

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
593839

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
16839.581

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11

 

fgl

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats  —

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
481269

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
20778.400

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9

 

/dev/sda:

ATA device, with non-removable media

Model Number:       INTEL SSDSC2BX200G4R

Serial Number:      BTHC523202RD200TGN

Firmware Revision:  G201DL2D


was (Author: xinglin):
I did some performance benchmarks using a physical server (a d430 server in 
[utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used 
either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove 
the time used by the SSD to make each write persistent. For the RAM case, we 
observed an improvement of 45% from fine-grained locking. For the SSD case, 
fine-grained locking gives us 20% improvement.  We used an Intel SSD (model: 
SSDSC2BX200G4R).  

We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't 
know the reason for this yet. We repeated the experiment for RAMDISK for trunk 
twice to confirm the performance number.
h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos
h1. 45% improvements fgl vs. trunk
h2. trunk 

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
663510

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
15071.362

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
710248

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
14079.5

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14

2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 

[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-16 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870
 ] 

Xing Lin commented on HDFS-14703:
-

I did some performance benchmarks using a physical server (a d430 server in 
[utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used 
either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove 
the time used by the SSD to make each write persistent. For the RAM case, we 
observed an improvement of 45% from fine-grained locking. For the SSD case, 
fine-grained locking gives us 20% improvement.  We used an Intel SSD (model: 
SSDSC2BX200G4R).  

We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't 
know the reason for this yet. We repeated the experiment for RAMDISK for trunk 
twice to confirm the performance number.
h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos
h1. 45% improvements fgl vs. trunk
h2. trunk 

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
663510

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
15071.362

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
710248

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
14079.5

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14

2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 
10019540





fgl

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
445980

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
22422.530

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8




h1. SSD, hadoop.tmp.dir=/dev/sda4
h1. 23% improvement fgl vs. trunk

trunk:

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
593839

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
16839.581

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11

 

fgl

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
481269

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
20778.400

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9

 

/dev/sda:

ATA device, with non-removable media

Model Number:       INTEL SSDSC2BX200G4R

Serial Number:      BTHC523202RD200TGN

Firmware Revision:  G201DL2D

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-15 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069
 ] 

Xing Lin edited comment on HDFS-14703 at 5/15/21, 3:55 PM:
---

[~prasad-acit] try this command: use -fs [file:///], instead of 
hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give 
you higher numbers at your VM. I use the default partition size of 256. 

dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT

$ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark 
*-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 
512


was (Author: xinglin):
[~prasad-acit] try this command: use -fs [file:///], instead of 
hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give 
you higher numbers at your VM. I use the default partition size of 256. 

dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT

$ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark 
*-fs [file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-15 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069
 ] 

Xing Lin edited comment on HDFS-14703 at 5/15/21, 3:53 PM:
---

[~prasad-acit] try this command: use -fs [file:///], instead of 
hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give 
you higher numbers at your VM. I use the default partition size of 256. 

dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT

$ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark 
*-fs [file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512


was (Author: xinglin):
[~prasad-acit] try this command: use -fs file:///, instead of 
hdfs://server:port. "-fs file:///" will bypass the RPC layer and should give 
you higher numbers at your VM. 

dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT

$ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark 
*-fs file:///* -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >