[jira] [Commented] (HDFS-17366) NameNode Fine-Grained Locking via Namespace Tree
[ https://issues.apache.org/jira/browse/HDFS-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813808#comment-17813808 ] Xing Lin commented on HDFS-17366: - Love to see this feature. > NameNode Fine-Grained Locking via Namespace Tree > > > Key: HDFS-17366 > URL: https://issues.apache.org/jira/browse/HDFS-17366 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > As we all known, the write performance of NameNode is limited by the global > lock. We target to enable fine-grained locking based on the Namespace tree to > improve the performance of NameNode write operations. > There are multiple motivations for creating this ticket: > * We have implemented this fine-grained locking and gained nearly 7x > performance improvements in our prod environment > * Other companies made similar improvements based on their internal branch. > Internal branches are quite different from the community, so few feedback and > discussions in the community. > * The topic of fine-grained locking has been discussed for a very long time, > but still without any results. > > We implemented this fine-gained locking based on the namespace tree to > maximize the number of concurrency for disjoint or independent operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806170#comment-17806170 ] Xing Lin commented on HDFS-17332: - In this jira, we only addressed the above issue for DFSInputStream. We did not work on DFSStripedInputStream for erasure coding read path. Erasure coding is not used at Linkedin. If you need the same fix for DFSStripedInputStream, please enhance DFSStripedInputStream#fetchBlockByteRange(). > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException > -- > > Key: HDFS-17332 > URL: https://issues.apache.org/jira/browse/HDFS-17332 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > In DFSInputStream#actualGetFromOneDataNode(), it would send the exception > stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most > cases, the read request will be served successfully by reading from the next > available DN. The existence of exception stacktrace in the log has caused > multiple hadoop users at Linkedin to consider this WARN message as the > RC/fatal error for their jobs. We would like to improve the log message and > avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The > stackTrace when reading reach DN is sent to the log only when we really need > to fail a read request (when chooseDataNode()/refetchLocations() throws a > BlockMissingException). > > Example stack trace > {code:java} > [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: > Failed to connect to 10.150.91.13/10.150.91.13:71 for file > //part--95b9909c-zzz-c000.avro for block > BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException: > 6 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/ip:40492 > remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 > millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/localIp:40492 > remote=datanodeIP:71] [12]: at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) > [12]: at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) > [12]: at > hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108) > [12]: at > com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39) > [12]: at > hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) > [12]: at > org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153) > [12]: at > org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) > [12]: at > org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17332: Description: In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). Example stack trace {code:java} [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed to connect to 10.150.91.13/10.150.91.13:71 for file //part--95b9909c-zzz-c000.avro for block BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ip:40492 remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/localIp:40492 remote=datanodeIP:71] [12]: at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) [12]: at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) [12]: at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) [12]: at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) [12]: at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387) [12]: at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) [12]: at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268) [12]: at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216) [12]: at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) [12]: at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) [12]: at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) [12]: at hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108) [12]: at com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39) [12]: at hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108) [12]: at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) [12]: at org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153) [12]: at org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) [12]: at org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) [12]: at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code} was: In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). Example stack trace {code:java} [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed to connect to 10.150.91.13/10.150.91.13:71 for file /jobs/kgemb/holistic/dev/ywang11/pcv2/runs/2850541/artifacts/jobAction-train-importer/featurized_dataset/part-109247-95b9909c-b6ab-41aa-bb87-7e76f4aad35f-c000.avro for block
[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17332: Description: In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). Example stack trace {code:java} [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed to connect to 10.150.91.13/10.150.91.13:71 for file /jobs/kgemb/holistic/dev/ywang11/pcv2/runs/2850541/artifacts/jobAction-train-importer/featurized_dataset/part-109247-95b9909c-b6ab-41aa-bb87-7e76f4aad35f-c000.avro for block BP-364971551-10.150.4.19-1448516588954:blk_130854761734_129864739321:java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 remote=10.150.91.13/10.150.91.13:71] [12]:java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 remote=10.150.91.13/10.150.91.13:71] [12]: at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) [12]: at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) [12]: at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) [12]: at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) [12]: at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753) [12]: at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387) [12]: at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) [12]: at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268) [12]: at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216) [12]: at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) [12]: at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) [12]: at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) [12]: at hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108) [12]: at com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39) [12]: at hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108) [12]: at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) [12]: at org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153) [12]: at org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) [12]: at org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) [12]: at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code} was:In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException >
[jira] [Assigned] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-17332: --- Assignee: Xing Lin > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException > -- > > Key: HDFS-17332 > URL: https://issues.apache.org/jira/browse/HDFS-17332 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > In DFSInputStream#actualGetFromOneDataNode(), it would send the exception > stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most > cases, the read request will be served successfully by reading from the next > available DN. The existence of exception stacktrace in the log has caused > multiple hadoop users at Linkedin to consider this WARN message as the > RC/fatal error for their jobs. We would like to improve the log message and > avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The > stackTrace when reading reach DN is sent to the log only when we really need > to fail a read request (when chooseDataNode()/refetchLocations() throws a > BlockMissingException). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17332: Environment: (was: In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). ) > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException > -- > > Key: HDFS-17332 > URL: https://issues.apache.org/jira/browse/HDFS-17332 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Minor > > In DFSInputStream#actualGetFromOneDataNode(), it would send the exception > stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most > cases, the read request will be served successfully by reading from the next > available DN. The existence of exception stacktrace in the log has caused > multiple hadoop users at Linkedin to consider this WARN message as the > RC/fatal error for their jobs. We would like to improve the log message and > avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The > stackTrace when reading reach DN is sent to the log only when we really need > to fail a read request (when chooseDataNode()/refetchLocations() throws a > BlockMissingException). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
Xing Lin created HDFS-17332: --- Summary: DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException Key: HDFS-17332 URL: https://issues.apache.org/jira/browse/HDFS-17332 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Environment: In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). Reporter: Xing Lin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17332: Description: In DFSInputStream#actualGetFromOneDataNode(), it would send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most cases, the read request will be served successfully by reading from the next available DN. The existence of exception stacktrace in the log has caused multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal error for their jobs. We would like to improve the log message and avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace when reading reach DN is sent to the log only when we really need to fail a read request (when chooseDataNode()/refetchLocations() throws a BlockMissingException). > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException > -- > > Key: HDFS-17332 > URL: https://issues.apache.org/jira/browse/HDFS-17332 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: In DFSInputStream#actualGetFromOneDataNode(), it would > send the exception stacktrace to the dfsClient.LOG whenever we fail on a DN. > However, in most cases, the read request will be served successfully by > reading from the next available DN. The existence of exception stacktrace in > the log has caused multiple hadoop users at Linkedin to consider this WARN > message as the RC/fatal error for their jobs. We would like to improve the > log message and avoid sending the stacktrace to dfsClient.LOG when a read > succeeds. The stackTrace when reading reach DN is sent to the log only when > we really need to fail a read request (when > chooseDataNode()/refetchLocations() throws a BlockMissingException). >Reporter: Xing Lin >Priority: Minor > > In DFSInputStream#actualGetFromOneDataNode(), it would send the exception > stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most > cases, the read request will be served successfully by reading from the next > available DN. The existence of exception stacktrace in the log has caused > multiple hadoop users at Linkedin to consider this WARN message as the > RC/fatal error for their jobs. We would like to improve the log message and > avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The > stackTrace when reading reach DN is sent to the log only when we really need > to fail a read request (when chooseDataNode()/refetchLocations() throws a > BlockMissingException). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: active.png observer.png > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > Attachments: active.png, observer.png > > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: (was: active.png) > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: (was: Observer.png) > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: Screenshot 2023-12-12 at 9.31.58 AM.png > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: (was: Screenshot 2023-12-12 at 9.32.15 AM.png) > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: Observer.png > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > Attachments: Observer.png, active.png > > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: active.png > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > Attachments: Observer.png, active.png > > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17286) Add UDP as a transfer protocol for HDFS
Xing Lin created HDFS-17286: --- Summary: Add UDP as a transfer protocol for HDFS Key: HDFS-17286 URL: https://issues.apache.org/jira/browse/HDFS-17286 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Reporter: Xing Lin Right now, every connection in HDFS is based on RPC/IPC which is based on TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout as part of the key to identify a connection. The consequence is if we want to use a different rpc timeout between two hosts, this would create different TCP connections. A use case which motivated us to consider UDP is getHAServiceState() in ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a much smaller timeout threshold and move to probe next Namenode. To support this, we used an executorService and set a timeout for the task in HDFS-17030. This implementation can be improved by using UDP to query HAServiceState. getHAServiceState() does not have to be very reliable, as we can always fall back to the active. Another motivation is it seems 5~10% of RPC calls hitting our active/observers are GetHAServiceState(). If we can move them off to the UDP server, that can hopefully improve RPC latency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: Screenshot 2023-12-12 at 9.32.15 AM.png > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17286) Add UDP as a transfer protocol for HDFS
[ https://issues.apache.org/jira/browse/HDFS-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17286: Attachment: (was: Screenshot 2023-12-12 at 9.31.58 AM.png) > Add UDP as a transfer protocol for HDFS > --- > > Key: HDFS-17286 > URL: https://issues.apache.org/jira/browse/HDFS-17286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Priority: Major > > Right now, every connection in HDFS is based on RPC/IPC which is based on > TCP. Connection is re-used based on ConnectionID, which includes RpcTimeout > as part of the key to identify a connection. The consequence is if we want to > use a different rpc timeout between two hosts, this would create different > TCP connections. > A use case which motivated us to consider UDP is getHAServiceState() in > ObserverReadProxyProvider. We'd like getHAServiceState() to time out with a > much smaller timeout threshold and move to probe next Namenode. To support > this, we used an executorService and set a timeout for the task in > HDFS-17030. This implementation can be improved by using UDP to query > HAServiceState. getHAServiceState() does not have to be very reliable, as we > can always fall back to the active. > Another motivation is it seems 5~10% of RPC calls hitting our > active/observers are GetHAServiceState(). If we can move them off to the UDP > server, that can hopefully improve RPC latency. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.
[ https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17281: Description: We have come across a few cases where the hdfs clients are reporting very bad latencies, while we don't see similar trends at NN-side. Instead, from NN-side, the latency metrics seem normal as usual. I attached a screenshot which we took during an internal investigation at LinkedIn. What was happening is a token management service was reporting an average latency of 1 sec in fetching delegation tokens from our NN but at the NN-side, we did not see anything abnormal. The recent OverallRpcProcessingTime metric we added in HDFS-17042 did not seem to be sufficient to identify/signal such cases. We propose to extend the IPC header in hadoop, to communicate call create time at client-side to IPC servers, so that for each rpc call, the server can get its round-trip time. *Why is OverallRpcProcessingTime not sufficient?* OverallRpcProcessingTime captures the time starting from when the reader thread reads in the call from the socket to when the response is sent back to the client. As a result, it does not capture the time it takes to transmit the call from client to the server. Besides, we only have a couple of reader threads to monitor a large number of open connections. It is possible that many connections become ready to read at the same time. Then, the reader thread would need to read each call sequentially, leading to a wait time for many Rpc Calls. We have also hit the case where the callQueue becomes full (with a total of 25600 requests) and thus reader threads are blocked to add new Calls into the callQueue. This would lead to a longer latency for all connections/calls which are ready and wait to be read by reader threads. Ideally, we want to measure the time between when a socket/call is ready to read and when it is actually being read by the reader thread. This would give us the wait time that a call is taking to be read. However, after some Google search, we failed to find a way to get this. was: We have come across a few cases where the hdfs clients are reporting very bad latencies, we don't see similar trends at NN-side. Instead, from NN-side, the latency metrics seem normal as usual. I attached a screenshot which we took during an internal investigation at LinkedIn. What was happening is a token management service was reporting an average latency of 1 sec in fetching delegation tokens from our NN but at the NN-side, we did not see anything abnormal. In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This metric measured the time from when the reader thread reads in a call from the socket connection to when the response for the call is sent back to the client. This metric is supposed to be a reliable signal for what RPC latency hdfs clients are getting (overallProcessingTime + network transfer latency == client latency). > Added support of reporting RPC round-trip time at NN. > - > > Key: HDFS-17281 > URL: https://issues.apache.org/jira/browse/HDFS-17281 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png > > > We have come across a few cases where the hdfs clients are reporting very bad > latencies, while we don't see similar trends at NN-side. Instead, from > NN-side, the latency metrics seem normal as usual. I attached a screenshot > which we took during an internal investigation at LinkedIn. What was > happening is a token management service was reporting an average latency of 1 > sec in fetching delegation tokens from our NN but at the NN-side, we did not > see anything abnormal. The recent OverallRpcProcessingTime metric we added in > HDFS-17042 did not seem to be sufficient to identify/signal such cases. > We propose to extend the IPC header in hadoop, to communicate call create > time at client-side to IPC servers, so that for each rpc call, the server can > get its round-trip time. > > *Why is OverallRpcProcessingTime not sufficient?* > OverallRpcProcessingTime captures the time starting from when the reader > thread reads in the call from the socket to when the response is sent back to > the client. As a result, it does not capture the time it takes to transmit > the call from client to the server. Besides, we only have a couple of reader > threads to monitor a large number of open connections. It is possible that > many connections become ready to read at the same time. Then, the reader > thread would need to read each call sequentially, leading to a wait time for > many Rpc Calls. We have also hit the case
[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.
[ https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17281: Description: We have come across a few cases where the hdfs clients are reporting very bad latencies, we don't see similar trends at NN-side. Instead, from NN-side, the latency metrics seem normal as usual. I attached a screenshot which we took during an internal investigation at LinkedIn. What was happening is a token management service was reporting an average latency of 1 sec in fetching delegation tokens from our NN but at the NN-side, we did not see anything abnormal. In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This metric measured the time from when the reader thread reads in a call from the socket connection to when the response for the call is sent back to the client. This metric is supposed to be a reliable signal for what RPC latency hdfs clients are getting (overallProcessingTime + network transfer latency == client latency). was: We have come across a few cases where the hdfs clients are reporting very bad latencies, we don't see similar trends at NN-side. Instead, from NN-side, the latency metrics seem normal as usual. In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This metric measured the time from when the reader thread reads in a call from the socket connection to when the response for the call is sent back to the client. This metric is supposed to be a reliable signal for what RPC latency hdfs clients are getting (overallProcessingTime + network transfer latency == client latency). > Added support of reporting RPC round-trip time at NN. > - > > Key: HDFS-17281 > URL: https://issues.apache.org/jira/browse/HDFS-17281 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png > > > We have come across a few cases where the hdfs clients are reporting very bad > latencies, we don't see similar trends at NN-side. Instead, from NN-side, the > latency metrics seem normal as usual. I attached a screenshot which we took > during an internal investigation at LinkedIn. What was happening is a token > management service was reporting an average latency of 1 sec in fetching > delegation tokens from our NN but at the NN-side, we did not see anything > abnormal. > > > > In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This > metric measured the time from when the reader thread reads in a call from the > socket connection to when the response for the call is sent back to the > client. This metric is supposed to be a reliable signal for what RPC latency > hdfs clients are getting (overallProcessingTime + network transfer latency == > client latency). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.
[ https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17281: Attachment: Screenshot 2023-10-28 at 10.26.41 PM.png > Added support of reporting RPC round-trip time at NN. > - > > Key: HDFS-17281 > URL: https://issues.apache.org/jira/browse/HDFS-17281 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17281) Added support of reporting RPC round-trip time at NN.
[ https://issues.apache.org/jira/browse/HDFS-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17281: Description: We have come across a few cases where the hdfs clients are reporting very bad latencies, we don't see similar trends at NN-side. Instead, from NN-side, the latency metrics seem normal as usual. In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This metric measured the time from when the reader thread reads in a call from the socket connection to when the response for the call is sent back to the client. This metric is supposed to be a reliable signal for what RPC latency hdfs clients are getting (overallProcessingTime + network transfer latency == client latency). > Added support of reporting RPC round-trip time at NN. > - > > Key: HDFS-17281 > URL: https://issues.apache.org/jira/browse/HDFS-17281 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Attachments: Screenshot 2023-10-28 at 10.26.41 PM.png > > > We have come across a few cases where the hdfs clients are reporting very bad > latencies, we don't see similar trends at NN-side. Instead, from NN-side, the > latency metrics seem normal as usual. > > > > In HDFS-17042, we added OverallRpcProcessingTime for each RPC method. This > metric measured the time from when the reader thread reads in a call from the > socket connection to when the response for the call is sent back to the > client. This metric is supposed to be a reliable signal for what RPC latency > hdfs clients are getting (overallProcessingTime + network transfer latency == > client latency). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17281) Added support of reporting RPC round-trip time at NN.
Xing Lin created HDFS-17281: --- Summary: Added support of reporting RPC round-trip time at NN. Key: HDFS-17281 URL: https://issues.apache.org/jira/browse/HDFS-17281 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Reporter: Xing Lin Assignee: Xing Lin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17262) Transfer rate metric warning log is too verbose
[ https://issues.apache.org/jira/browse/HDFS-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-17262: --- Assignee: Xing Lin > Transfer rate metric warning log is too verbose > --- > > Key: HDFS-17262 > URL: https://issues.apache.org/jira/browse/HDFS-17262 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > > HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration > is millis, and its very possible for a read to take less than a millisecond > when considering local TCP connection. We are seeing this spam multiple times > per millisecond. There's another report on the PR for HDFS-16917. > Please downgrade to debug or remove the log -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17262) Transfer rate metric warning log is too verbose
[ https://issues.apache.org/jira/browse/HDFS-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788577#comment-17788577 ] Xing Lin commented on HDFS-17262: - [~rdingankar] is out until early Dec. I pushed out a PR on his behalf. > Transfer rate metric warning log is too verbose > --- > > Key: HDFS-17262 > URL: https://issues.apache.org/jira/browse/HDFS-17262 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > > HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration > is millis, and its very possible for a read to take less than a millisecond > when considering local TCP connection. We are seeing this spam multiple times > per millisecond. There's another report on the PR for HDFS-16917. > Please downgrade to debug or remove the log -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17231) HA: Safemode should exit when resources are from low to available
[ https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778222#comment-17778222 ] Xing Lin commented on HDFS-17231: - Hi [~kuper], Please assign this Jira to yourself since you have worked on it. > HA: Safemode should exit when resources are from low to available > - > > Key: HDFS-17231 > URL: https://issues.apache.org/jira/browse/HDFS-17231 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Affects Versions: 3.3.4, 3.3.6 >Reporter: kuper >Priority: Major > Labels: pull-request-available > Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png > > > The NameNodeResourceMonitor automatically enters safe mode when it detects > that the resources are not sufficient. When zkfc detects insufficient > resources, it triggers failover. Consider the following scenario: > * Initially, nn01 is active and nn02 is standby. Due to insufficient > resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the > resource issue and puts nn01 into safemode. Subsequently, zkfc triggers > failover. > * At this point, nn01 is in safemode (ON) and standby, while nn02 is in > safemode (OFF) and active. > * After a period of time, the resources in nn01's dfs.namenode.name.dir > recover, causing a slight instability and triggering failover again. > * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode > (OFF) and standby. > * However, since nn01 is active but in safemode (ON), hdfs cannot be read > from or written to. > !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png! > *reproduction* > # Increase the dfs.namenode.resource.du.reserved > # Increase the ha.health-monitor.check-interval.ms can avoid directly > switching to standby and stopping the NameNodeResourceMonitor thread. > Instead, it is necessary to wait for the NameNodeResourceMonitor to enter > safe mode before switching to standby. > # On the nn01 active node, using the dd command to create a file that > exceeds the threshold, triggering a low on available disk space condition. > # If the nn01 namenode process is not dead, the situation of nn01 safemode > (ON) and standby occurs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider
[ https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17118: Description: We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR for HDFS-17030 which has been merged into trunk: [https://github.com/apache/hadoop/pull/5700] was: We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR for HDFS-17030 which has been merged into trunk: [https://github.com/apache/hadoop/pull/5700] > Fix minor checkstyle warnings in TestObserverReadProxyProvider > -- > > Key: HDFS-17118 > URL: https://issues.apache.org/jira/browse/HDFS-17118 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Trivial > Labels: pull-request-available > > We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk > to branch-3.3. The yetus build was not stable at that time and we did not > notice the newly added checkstyle warnings. > PR for HDFS-17030 which has been merged into trunk: > [https://github.com/apache/hadoop/pull/5700] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider
[ https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17118: Description: We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR for HDFS-17030 which has been merged into trunk: [https://github.com/apache/hadoop/pull/5700] was: We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR merged into trunk: [https://github.com/apache/hadoop/pull/5700] > Fix minor checkstyle warnings in TestObserverReadProxyProvider > -- > > Key: HDFS-17118 > URL: https://issues.apache.org/jira/browse/HDFS-17118 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Trivial > Labels: pull-request-available > > We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk > to branch-3.3. The yetus build was not stable at that time and we did not > notice the newly added checkstyle warnings. > > PR for HDFS-17030 which has been merged into trunk: > [https://github.com/apache/hadoop/pull/5700] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider
Xing Lin created HDFS-17118: --- Summary: Fix minor checkstyle warnings in TestObserverReadProxyProvider Key: HDFS-17118 URL: https://issues.apache.org/jira/browse/HDFS-17118 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.4.0 Reporter: Xing Lin We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR: https://github.com/apache/hadoop/pull/5700 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider
[ https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17118: Description: We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR merged into trunk: [https://github.com/apache/hadoop/pull/5700] was: We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk to branch-3.3. The yetus build was not stable at that time and we did not notice the newly added checkstyle warnings. PR: https://github.com/apache/hadoop/pull/5700 > Fix minor checkstyle warnings in TestObserverReadProxyProvider > -- > > Key: HDFS-17118 > URL: https://issues.apache.org/jira/browse/HDFS-17118 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Priority: Trivial > > We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk > to branch-3.3. The yetus build was not stable at that time and we did not > notice the newly added checkstyle warnings. > > PR merged into trunk: [https://github.com/apache/hadoop/pull/5700] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17118) Fix minor checkstyle warnings in TestObserverReadProxyProvider
[ https://issues.apache.org/jira/browse/HDFS-17118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-17118: --- Assignee: Xing Lin > Fix minor checkstyle warnings in TestObserverReadProxyProvider > -- > > Key: HDFS-17118 > URL: https://issues.apache.org/jira/browse/HDFS-17118 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Trivial > > We noticed a few checkstyle warnings when backporting HDFS-17030 from trunk > to branch-3.3. The yetus build was not stable at that time and we did not > notice the newly added checkstyle warnings. > > PR merged into trunk: [https://github.com/apache/hadoop/pull/5700] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744751#comment-17744751 ] Xing Lin commented on HDFS-17093: - FYI, we set dfs.namenode.max.full.block.report.leases = 6, even though we are running clusters at 10k DNs per cluster. > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744749#comment-17744749 ] Xing Lin commented on HDFS-17093: - {quote}[~xinglin] ,I think you modify some more reasonable, datanode separate disk operation should be processed in the final set to perform blockReportLeaseManager. RemoveLease (node); return ! node.hasStaleStorages(); This is all at the datanode level {quote} not sure i understand what you said here. > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744437#comment-17744437 ] Xing Lin commented on HDFS-17093: - Hi [~yuyanlei], Thanks for sharing! I don't fully understand how your PR is going to help. Without your PR, when NN receives the second FBR attempt from the same DN, the NN won't process these RBF and will remove the release from that DN. So, that DN won't be able to send more FBRs. With your PR, though NN won't remove the release from that DN until it receives all 12 reports, NN will NOT process these FBRs, right? # DN wants to send 12 reports but only sent 1 report. # NN processes 1 report (then _storageInfo.getBlockReportCount() > 0_ will be true) # DN continues to send 12 reports to NN. # NN will simply discard these reports, because _storageInfo.getBlockReportCount() > 0_ If the change is something like the following, then the change would make more sense to me. {code:java} if (namesystem.isInStartupSafeMode() && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) && storageInfo.getBlockReportCount() > 0 + && totalReportNum == currentReportNum) { blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " + "discarded non-initial block report from {}" + " because namenode still in startup phase", strBlockReportId, fullBrLeaseId, nodeID); blockReportLeaseManager.removeLease(node); return !node.hasStaleStorages(); } {code} > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy
[ https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17067: Description: In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and also allows setting the prefix for thread names. {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} A second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). was: In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix. {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} A second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). > Use BlockingThreadPoolExecutorService for nnProbingThreadPool in > ObserverReadProxy > -- > > Key: HDFS-17067 > URL: https://issues.apache.org/jira/browse/HDFS-17067 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > In HDFS-17030, we introduced an ExecutorService, to submit > getHAServiceState() requests. We constructed the ExecutorService directly > from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to > true. Then, the core thread will be kept up and running even when the main > thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. > However, in this PR, we decide to directly use an existing executorService > implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It > takes care of setting _allowCoreThreadTimeOut_ and also allows setting the > prefix for thread names. > {code:java} > private final ExecutorService nnProbingThreadPool = > new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, > new ArrayBlockingQueue(1024)); > {code} > A second minor issue is we did not shutdown the executorService in close(). > It is a minor issue as close() will only be called when the garbage collector > starts to reclaim an ObserverReadProxyProvider object, not when there is no > reference to the ObserverReadProxyProvider object. The time between when an > ObserverReadProxyProvider becomes dereferenced and when the garage collector > actually starts to reclaim that object is out of control/under-defined > (unless the program is shutdown with an explicit System.exit(1)). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy
[ https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17067: Description: In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix. Second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} was: In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and allowing setting the thread prefix. Second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} > Use BlockingThreadPoolExecutorService for nnProbingThreadPool in > ObserverReadProxy > -- > > Key: HDFS-17067 > URL: https://issues.apache.org/jira/browse/HDFS-17067 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > In HDFS-17030, we introduced an ExecutorService, to submit > getHAServiceState() requests. We constructed the ExecutorService directly > from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to > true. Then, the core thread will be kept up and running even when the main > thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. > However, in this PR, we decide to directly use an existing executorService > implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It > takes care of setting _allowCoreThreadTimeOut_ and also allows setting the > thread prefix. > Second minor issue is we did not shutdown the executorService in close(). It > is a minor issue as close() will only be called when the garbage collector > starts to reclaim an ObserverReadProxyProvider object, not when there is no > reference to the ObserverReadProxyProvider object. The time between when an > ObserverReadProxyProvider becomes dereferenced and when the garage collector > actually starts to reclaim that object is out of control/under-defined > (unless the program is shutdown with an explicit System.exit(1)). > {code:java} > private final ExecutorService nnProbingThreadPool = > new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, > new ArrayBlockingQueue(1024)); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy
[ https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17067: Description: In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix. {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} A second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). was: In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and also allows setting the thread prefix. Second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} > Use BlockingThreadPoolExecutorService for nnProbingThreadPool in > ObserverReadProxy > -- > > Key: HDFS-17067 > URL: https://issues.apache.org/jira/browse/HDFS-17067 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > In HDFS-17030, we introduced an ExecutorService, to submit > getHAServiceState() requests. We constructed the ExecutorService directly > from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to > true. Then, the core thread will be kept up and running even when the main > thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. > However, in this PR, we decide to directly use an existing executorService > implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It > takes care of setting _allowCoreThreadTimeOut_ and also allows setting the > thread prefix. > {code:java} > private final ExecutorService nnProbingThreadPool = > new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, > new ArrayBlockingQueue(1024)); > {code} > A second minor issue is we did not shutdown the executorService in close(). > It is a minor issue as close() will only be called when the garbage collector > starts to reclaim an ObserverReadProxyProvider object, not when there is no > reference to the ObserverReadProxyProvider object. The time between when an > ObserverReadProxyProvider becomes dereferenced and when the garage collector > actually starts to reclaim that object is out of control/under-defined > (unless the program is shutdown with an explicit System.exit(1)). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17067) allowCoreThreadTimeOut should be set to true for nnProbingThreadPool in ObserverReadProxy
Xing Lin created HDFS-17067: --- Summary: allowCoreThreadTimeOut should be set to true for nnProbingThreadPool in ObserverReadProxy Key: HDFS-17067 URL: https://issues.apache.org/jira/browse/HDFS-17067 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.4.0 Reporter: Xing Lin Assignee: Xing Lin In HDFS-17030, we introduced an ExecutorService, to submit getHAServiceState() requests. We constructed the ExecutorService directly from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to true. Then, the core thread will be kept up and running even when the main thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. However, in this PR, we decide to directly use an existing executorService implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It takes care of setting _allowCoreThreadTimeOut_ and allowing setting the thread prefix. Second minor issue is we did not shutdown the executorService in close(). It is a minor issue as close() will only be called when the garbage collector starts to reclaim an ObserverReadProxyProvider object, not when there is no reference to the ObserverReadProxyProvider object. The time between when an ObserverReadProxyProvider becomes dereferenced and when the garage collector actually starts to reclaim that object is out of control/under-defined (unless the program is shutdown with an explicit System.exit(1)). {code:java} private final ExecutorService nnProbingThreadPool = new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, new ArrayBlockingQueue(1024)); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17067) Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy
[ https://issues.apache.org/jira/browse/HDFS-17067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17067: Summary: Use BlockingThreadPoolExecutorService for nnProbingThreadPool in ObserverReadProxy (was: allowCoreThreadTimeOut should be set to true for nnProbingThreadPool in ObserverReadProxy) > Use BlockingThreadPoolExecutorService for nnProbingThreadPool in > ObserverReadProxy > -- > > Key: HDFS-17067 > URL: https://issues.apache.org/jira/browse/HDFS-17067 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > In HDFS-17030, we introduced an ExecutorService, to submit > getHAServiceState() requests. We constructed the ExecutorService directly > from a basic ThreadPoolExecutor, without setting _allowCoreThreadTimeOut_ to > true. Then, the core thread will be kept up and running even when the main > thread exits. To fix it, one could set _allowCoreThreadTimeOut_ to true. > However, in this PR, we decide to directly use an existing executorService > implementation (_BlockingThreadPoolExecutorService_) in hadoop instead. It > takes care of setting _allowCoreThreadTimeOut_ and allowing setting the > thread prefix. > Second minor issue is we did not shutdown the executorService in close(). It > is a minor issue as close() will only be called when the garbage collector > starts to reclaim an ObserverReadProxyProvider object, not when there is no > reference to the ObserverReadProxyProvider object. The time between when an > ObserverReadProxyProvider becomes dereferenced and when the garage collector > actually starts to reclaim that object is out of control/under-defined > (unless the program is shutdown with an explicit System.exit(1)). > {code:java} > private final ExecutorService nnProbingThreadPool = > new ThreadPoolExecutor(1, 4, 1L, TimeUnit.MINUTES, > new ArrayBlockingQueue(1024)); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17055: Description: We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up. We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable. One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, which is currently used to emit all metrics for {_}NameNode.java{_}. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, _NameNodeMetrics_ class does not have a reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java. was: We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up. We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable. One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, which is currently used to emit all metrics for {_}NameNode.java{_}. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, _NameNodeMetrics_ class does not have a reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java. > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is currently used to emit all metrics for {_}NameNode.java{_}. However, > we don't think that is a good place to emit NameNode HAState. HAState is > stored in NameNode.java and we should directly emit it from NameNode.java. > Otherwise, we basically duplicate this info in two classes and we would have > to keep them in sync. Besides, _NameNodeMetrics_ class does not have a > reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ > is created by a _static_ function _initMetrics()_ in
[jira] [Updated] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17055: Description: We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up. We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable. One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, which is currently used to emit all metrics for {_}NameNode.java{_}. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, _NameNodeMetrics_ class does not have a reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java. was: We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up. We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable. One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, which is used to emit all metrics for {_}NameNode.java{_}. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, _NameNodeMetrics_ class does not have a reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java. > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is currently used to emit all metrics for {_}NameNode.java{_}. However, > we don't think that is a good place to emit NameNode HAState. HAState is > stored in NameNode.java and we should directly emit it from NameNode.java. > Otherwise, we basically duplicate this info in two classes and we would have > to keep them in sync. Besides, _NameNodeMetrics_ class does not have a > reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ > is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We >
[jira] [Assigned] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-17055: --- Assignee: Xing Lin > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is used to emit all metrics for {_}NameNode.java{_}. However, we don't > think that is a good place to emit NameNode HAState. HAState is stored in > NameNode.java and we should directly emit it from NameNode.java. Otherwise, > we basically duplicate this info in two classes and we would have to keep > them in sync. Besides, _NameNodeMetrics_ class does not have a reference to > the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by > a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit > HA state from FSNameSystem.java either, as it is initialized from > NameNode.java and all state transitions are implemented in NameNode.java. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
Xing Lin created HDFS-17055: --- Summary: Export HAState as a metric from Namenode for monitoring Key: HDFS-17055 URL: https://issues.apache.org/jira/browse/HDFS-17055 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.4.0, 3.3.9 Reporter: Xing Lin We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up. We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable. One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, which is used to emit all metrics for {_}NameNode.java{_}. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, _NameNodeMetrics_ class does not have a reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
[ https://issues.apache.org/jira/browse/HDFS-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17042: Description: We'd like to add two new types of metrics to the existing NN RpcMetrics/RpcDetailedMetrics. These two metrics can then be used as part of SLA/SLO for the HDFS service. * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they are successfully processed by a NN (e.g., with a response with an RpcStatus {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps ({_}which refers the total number of RPC requests{_}){_}, we can derive the RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / RpcQueueNumOps. * OverallRpcProcessingTime for each RPC method: this metric measures the overall RPC processing time for each RPC method at the NN. It covers the time from when a request arrives at the NN to when a response is sent back. We are already emitting processingTime for each RPC method today in RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for each RPC method, which includes enqueueTime, queueTime, processingTime, responseTime, and handlerTime. was: We'd like to add two new types of metrics to the existing NN RpcMetrics/RpcDetailedMetrics. * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they are successfully processed by a NN (e.g., with a response with an RpcStatus {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps ({_}which refers the total number of RPC requests{_}){_}, we can derive the RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / RpcQueueNumOps. * OverallRpcProcessingTime for each RPC method: this metric measures the overall RPC processing time for each RPC method at the NN. It covers the time from when a request arrives at the NN to when a response is sent back. We are already emitting processingTime for each RPC method today in RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for each RPC method, which includes enqueueTime, queueTime, processingTime, responseTime, and handlerTime. > Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode > > > Key: HDFS-17042 > URL: https://issues.apache.org/jira/browse/HDFS-17042 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > We'd like to add two new types of metrics to the existing NN > RpcMetrics/RpcDetailedMetrics. These two metrics can then be used as part of > SLA/SLO for the HDFS service. > * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they > are successfully processed by a NN (e.g., with a response with an RpcStatus > {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps > ({_}which refers the total number of RPC requests{_}){_}, we can derive the > RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / > RpcQueueNumOps. > * OverallRpcProcessingTime for each RPC method: this metric measures the > overall RPC processing time for each RPC method at the NN. It covers the time > from when a request arrives at the NN to when a response is sent back. We are > already emitting processingTime for each RPC method today in > RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for > each RPC method, which includes enqueueTime, queueTime, processingTime, > responseTime, and handlerTime. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
[ https://issues.apache.org/jira/browse/HDFS-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17042: Description: We'd like to add two new types of metrics to the existing NN RpcMetrics/RpcDetailedMetrics. * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they are successfully processed by a NN (e.g., with a response with an RpcStatus {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps ({_}which refers the total number of RPC requests{_}){_}, we can derive the RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / RpcQueueNumOps. * OverallRpcProcessingTime for each RPC method: this metric measures the overall RPC processing time for each RPC method at the NN. It covers the time from when a request arrives at the NN to when a response is sent back. We are already emitting processingTime for each RPC method today in RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for each RPC method, which includes enqueueTime, queueTime, processingTime, responseTime, and handlerTime. was: We'd like to add two new types of metrics to the existing RpcMetrics/RpcDetailedMetrics. * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they are successfully processed by a NN (e.g., with a response with an RpcStatus {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps ({_}which refers the total number of RPC requests{_}){_}, we can derive the RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / RpcQueueNumOps. * OverallRpcProcessingTime for each RPC method: this metric measures the overall RPC processing time for each RPC method at the NN. It covers the time from when a request arrives at the NN to when a response is sent back. We are already emitting processingTime for each RPC method today in RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for each RPC method, which includes enqueueTime, queueTime, processingTime, responseTime, and handlerTime. > Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode > > > Key: HDFS-17042 > URL: https://issues.apache.org/jira/browse/HDFS-17042 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > We'd like to add two new types of metrics to the existing NN > RpcMetrics/RpcDetailedMetrics. > * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they > are successfully processed by a NN (e.g., with a response with an RpcStatus > {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps > ({_}which refers the total number of RPC requests{_}){_}, we can derive the > RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / > RpcQueueNumOps. > * OverallRpcProcessingTime for each RPC method: this metric measures the > overall RPC processing time for each RPC method at the NN. It covers the time > from when a request arrives at the NN to when a response is sent back. We are > already emitting processingTime for each RPC method today in > RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for > each RPC method, which includes enqueueTime, queueTime, processingTime, > responseTime, and handlerTime. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
Xing Lin created HDFS-17042: --- Summary: Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode Key: HDFS-17042 URL: https://issues.apache.org/jira/browse/HDFS-17042 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.4.0, 3.3.9 Reporter: Xing Lin Assignee: Xing Lin We'd like to add two new types of metrics to the existing RpcMetrics/RpcDetailedMetrics. * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they are successfully processed by a NN (e.g., with a response with an RpcStatus {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps ({_}which refers the total number of RPC requests{_}){_}, we can derive the RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / RpcQueueNumOps. * OverallRpcProcessingTime for each RPC method: this metric measures the overall RPC processing time for each RPC method at the NN. It covers the time from when a request arrives at the NN to when a response is sent back. We are already emitting processingTime for each RPC method today in RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for each RPC method, which includes enqueueTime, queueTime, processingTime, responseTime, and handlerTime. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request takes more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN (which is a deal-breaker). The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. was: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request takes more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to create > socket connection to that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request takes more than 2 mins to complete when we take a heap dump at a > standby. This has been causing user job failures. > We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending > getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we > still use the original value from the config). However, that would double the > socket connection between clients and the NN (which is a deal-breaker). > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request takes more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN (which is a deal-breaker). The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to probe the next NN. was: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request takes more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN (which is a deal-breaker). The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to create > socket connection to that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request takes more than 2 mins to complete when we take a heap dump at a > standby. This has been causing user job failures. > We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending > getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we > still use the original value from the config). However, that would double the > socket connection between clients and the NN (which is a deal-breaker). > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to probe the > next NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request takes more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. was: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to create > socket connection to that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request takes more than 2 mins to complete when we take a heap dump at a > standby. This has been causing user job failures. > We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending > getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we > still use the original value from the config). However, that would double the > socket connection between clients and the NN. > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. was: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to connect that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to create > socket connection to that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request would need to take more than 2 mins to complete when we take a heap > dump at a standby. This has been causing user job failures. > We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending > getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we > still use the original value from the config). However, that would double the > socket connection between clients and the NN. > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to connect that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. was: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to connect that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy. However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to connect > that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request would need to take more than 2 mins to complete when we take a heap > dump at a standby. This has been causing user job failures. > We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending > getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we > still use the original value from the config). However, that would double the > socket connection between clients and the NN. > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to connect that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending getHAServiceState requests in ObserverReaderProxy. However, that would double the socket connection between clients and the NN. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. was: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to connect that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to connect > that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request would need to take more than 2 mins to complete when we take a heap > dump at a standby. This has been causing user job failures. > We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending > getHAServiceState requests in ObserverReaderProxy. However, that would double > the socket connection between clients and the NN. > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-17030: --- Assignee: Xing Lin > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to connect > that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request would need to take more than 2 mins to complete when we take a heap > dump at a standby. This has been causing user job failures. > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
[ https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-17030: Description: When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN. Basically, when a standby is down, the RPC client would (re)try to connect that standby for _ipc.client.connect.timeout_ _* ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. was: When HA is enabled and a standby NN is not responsible (either when it is down or a heap dump is being taken), we would wait for either _socket_connection_timeout * socket_max_retries_on_connection_timeout_ or _rpcTimeOut_ before moving on to the next NN. This adds a significantly latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. > Limit wait time for getHAServiceState in ObserverReaderProxy > > > Key: HDFS-17030 > URL: https://issues.apache.org/jira/browse/HDFS-17030 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0 >Reporter: Xing Lin >Priority: Minor > > When namenode HA is enabled and a standby NN is not responsible, we have > observed it would take a long time to serve a request, even though we have a > healthy observer or active NN. > Basically, when a standby is down, the RPC client would (re)try to connect > that standby for _ipc.client.connect.timeout_ _* > ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a > heap dump at a standby, the NN still accepts the socket connection but it > won't send responses to these RPC requests and we would timeout after > _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters > at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a > request would need to take more than 2 mins to complete when we take a heap > dump at a standby. This has been causing user job failures. > The proposal is to add a timeout on getHAServiceState() calls in > ObserverReaderProxy and we will only wait for the timeout for an NN to > respond its HA state. Once we pass that timeout, we will move on to the next > NN. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
Xing Lin created HDFS-17030: --- Summary: Limit wait time for getHAServiceState in ObserverReaderProxy Key: HDFS-17030 URL: https://issues.apache.org/jira/browse/HDFS-17030 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.4.0 Reporter: Xing Lin When HA is enabled and a standby NN is not responsible (either when it is down or a heap dump is being taken), we would wait for either _socket_connection_timeout * socket_max_retries_on_connection_timeout_ or _rpcTimeOut_ before moving on to the next NN. This adds a significantly latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16816) RBF: auto-create user home dir for trash paths by router
[ https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-16816: --- Assignee: (was: Xing Lin) > RBF: auto-create user home dir for trash paths by router > > > Key: HDFS-16816 > URL: https://issues.apache.org/jira/browse/HDFS-16816 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Xing Lin >Priority: Minor > Labels: pull-request-available > > In RBF, trash files are moved to trash root under user's home dir at the > corresponding namespace/namenode where the files reside. This was added in > HDFS-16024. When the user home dir is not created before-hand at a namenode, > we run into permission denied exceptions when trying to create the parent dir > for the trash file before moving the file into it. We propose to enhance > Router, to auto-create a user home's dir at the namenode for trash paths, > using router's identity (which is assumed to be a super-user). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16689: Description: Standby NameNode crashes when transitioning to Active with a in-progress tailer. And the error message like blew: {code:java} Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when there is a stream available for read: ByteStringEditLog[X, Y], ByteStringEditLog[X, 0] at org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) ... 36 more {code} After tracing and found there is a critical bug in *EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* try to replay all missed edits from JournalNodes with {*}onlyDurableTxns=true{*}. It may cannot replay any edits when they are some abnormal JournalNodes. Reproduce method, suppose: - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and JN2. - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully synced them to JN1 and JN2 {-}JN3{-}. And JN0 is abnormal, such as GC, bad network or restarted. - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active from NN0 to NN1. - NN1 only got two responses from JN0 and JN1 when it try to selecting inputStreams with *fromTxnId=3* and {*}onlyDurableTxns=true{*}, and the count txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad network or restarted. - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes because the *maxAllowedTxns* is 0. So I think Standby NameNode should *catchupDuringFailover()* with *onlyDurableTxns=false* , so that it can replay all missed edits from JournalNode. was: Standby NameNode crashes when transitioning to Active with a in-progress tailer. And the error message like blew: {code:java} Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when there is a stream available for read: ByteStringEditLog[X, Y], ByteStringEditLog[X, 0] at org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) ... 36 more {code} After tracing and found there is a critical bug in *EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. It may cannot replay any edits when they are some abnormal JournalNodes. Reproduce method, suppose: - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and JN2. - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or restarted. - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active from NN0 to NN1. - NN1 only got two responses from JN0 and JN1 when it try to selecting inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad network or restarted. - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes because the *maxAllowedTxns* is 0. So I think Standby NameNode should *catchupDuringFailover()* with *onlyDurableTxns=false* , so that it can replay all missed edits from JournalNode. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: >
[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655752#comment-17655752 ] Xing Lin edited comment on HDFS-15901 at 1/8/23 1:42 AM: - Do we have any followup on this issue? We are seeing a similar issue happening at Linkedin as well. The standby NN can be stuck in safe mode when restarted for some of the large clusters. When NN stuck in safe mode, the number of missing blocks each time are different and they are small numbers, from ~800 to 10K. It does not seem that we are missing a FBR. We are not sure what is causing the issue but could the following hypothesis be the case? In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send in a new incremental Block report (IBR). However, NN does not process these IBRs (for example, it is paused due to GC). NN will not process any non-initial FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from the cluster and blockA becomes the missing block it will wait forever. was (Author: xinglin): Do we have any followup on this issue? We are seeing a similar issue happening at Linkedin as well. The standby NN can be stuck in safe mode when restarted for some of the large clusters. When NN stuck in safe mode, the number of missing blocks each time are different. We are not sure what is causing the issue but could the following hypothesis be the case? In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send in a new incremental Block report (IBR). However, NN does not process these IBRs (for example, it is paused due to GC). NN will not process any non-initial FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from the cluster and blockA becomes the missing block it will wait forever. > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. > 2021-03-13 18:36:08,145 [29220972] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode
[ https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655752#comment-17655752 ] Xing Lin commented on HDFS-15901: - Do we have any followup on this issue? We are seeing a similar issue happening at Linkedin as well. The standby NN can be stuck in safe mode when restarted for some of the large clusters. When NN stuck in safe mode, the number of missing blocks each time are different. We are not sure what is causing the issue but could the following hypothesis be the case? In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send in a new incremental Block report (IBR). However, NN does not process these IBRs (for example, it is paused due to GC). NN will not process any non-initial FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from the cluster and blockA becomes the missing block it will wait forever. > Solve the problem of DN repeated block reports occupying too many RPCs during > Safemode > -- > > Key: HDFS-15901 > URL: https://issues.apache.org/jira/browse/HDFS-15901 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > When the cluster exceeds thousands of nodes, we want to restart the NameNode > service, and all DataNodes send a full Block action to the NameNode. During > SafeMode, some DataNodes may send blocks to NameNode multiple times, which > will take up too much RPC. In fact, this is unnecessary. > In this case, some block report leases will fail or time out, and in extreme > cases, the NameNode will always stay in Safe Mode. > 2021-03-14 08:16:25,873 [78438700] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(:port, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-14 08:16:31,521 [78444348] - INFO [Block report > processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded > non-initial block report from DatanodeRegistration(, > datanodeUuid=, infoPort=, infoSecurePort=, > ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode > still in startup phase > 2021-03-13 18:35:38,200 [29191027] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for > DN , because the DN is not in the pending set. > 2021-03-13 18:36:08,143 [29220970] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. > 2021-03-13 18:36:08,145 [29220972] - WARN [Block report > processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for > DN , because the lease has expired. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16852) HDFS-16852 Register the shutdown hook only when not in shutdown for KeyProviderCache constructor
[ https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16852: Summary: HDFS-16852 Register the shutdown hook only when not in shutdown for KeyProviderCache constructor (was: Swallow IllegalStateException in KeyProviderCache) > HDFS-16852 Register the shutdown hook only when not in shutdown for > KeyProviderCache constructor > > > Key: HDFS-16852 > URL: https://issues.apache.org/jira/browse/HDFS-16852 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > When an HDFS client is created, it will register a shutdownhook to > shutdownHookManager. ShutdownHookManager doesn't allow adding a new > shutdownHook when the process is already in shutdown and throws an > IllegalStateException. > This behavior is not ideal, when a spark program failed during pre-launch. In > that case, during shutdown, spark would call cleanStagingDir() to clean the > staging dir. In cleanStagingDir(), it will create a FileSystem object to talk > to HDFS. However, since this would be the first time to use a filesystem > object in that process, it will need to create an hdfs client and register > the shutdownHook. Then, we will hit the IllegalStateException. This > illegalStateException will mask the actual exception which causes the spark > program to fail during pre-launch. > We propose to swallow IllegalStateException in KeyProviderCache and log a > warning. The TCP connection between the client and NameNode should be closed > by the OS when the process is shutdown. > Example stacktrace > {code:java} > 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught > exception in thread shutdown-hook-0 > 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in > progress, cannot add a shutdownHook > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) >
[jira] [Updated] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor
[ https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16852: Summary: Register the shutdown hook only when not in shutdown for KeyProviderCache constructor (was: HDFS-16852 Register the shutdown hook only when not in shutdown for KeyProviderCache constructor) > Register the shutdown hook only when not in shutdown for KeyProviderCache > constructor > - > > Key: HDFS-16852 > URL: https://issues.apache.org/jira/browse/HDFS-16852 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > When an HDFS client is created, it will register a shutdownhook to > shutdownHookManager. ShutdownHookManager doesn't allow adding a new > shutdownHook when the process is already in shutdown and throws an > IllegalStateException. > This behavior is not ideal, when a spark program failed during pre-launch. In > that case, during shutdown, spark would call cleanStagingDir() to clean the > staging dir. In cleanStagingDir(), it will create a FileSystem object to talk > to HDFS. However, since this would be the first time to use a filesystem > object in that process, it will need to create an hdfs client and register > the shutdownHook. Then, we will hit the IllegalStateException. This > illegalStateException will mask the actual exception which causes the spark > program to fail during pre-launch. > We propose to swallow IllegalStateException in KeyProviderCache and log a > warning. The TCP connection between the client and NameNode should be closed > by the OS when the process is shutdown. > Example stacktrace > {code:java} > 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught > exception in thread shutdown-hook-0 > 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in > progress, cannot add a shutdownHook > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213) > > 13-09-2022 14:39:42 PDT INFO - at >
[jira] [Updated] (HDFS-16852) Swallow IllegalStateException in KeyProviderCache
[ https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16852: Summary: Swallow IllegalStateException in KeyProviderCache (was: swallow IllegalStateException in KeyProviderCache) > Swallow IllegalStateException in KeyProviderCache > - > > Key: HDFS-16852 > URL: https://issues.apache.org/jira/browse/HDFS-16852 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > When an HDFS client is created, it will register a shutdownhook to > shutdownHookManager. ShutdownHookManager doesn't allow adding a new > shutdownHook when the process is already in shutdown and throws an > IllegalStateException. > This behavior is not ideal, when a spark program failed during pre-launch. In > that case, during shutdown, spark would call cleanStagingDir() to clean the > staging dir. In cleanStagingDir(), it will create a FileSystem object to talk > to HDFS. However, since this would be the first time to use a filesystem > object in that process, it will need to create an hdfs client and register > the shutdownHook. Then, we will hit the IllegalStateException. This > illegalStateException will mask the actual exception which causes the spark > program to fail during pre-launch. > We propose to swallow IllegalStateException in KeyProviderCache and log a > warning. The TCP connection between the client and NameNode should be closed > by the OS when the process is shutdown. > Example stacktrace > {code:java} > 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught > exception in thread shutdown-hook-0 > 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in > progress, cannot add a shutdownHook > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > > 13-09-2022 14:39:42 PDT INFO - at >
[jira] [Assigned] (HDFS-16852) swallow IllegalStateException in KeyProviderCache
[ https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-16852: --- Assignee: Xing Lin > swallow IllegalStateException in KeyProviderCache > - > > Key: HDFS-16852 > URL: https://issues.apache.org/jira/browse/HDFS-16852 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > > When an HDFS client is created, it will register a shutdownhook to > shutdownHookManager. ShutdownHookManager doesn't allow adding a new > shutdownHook when the process is already in shutdown and throws an > IllegalStateException. > This behavior is not ideal, when a spark program failed during pre-launch. In > that case, during shutdown, spark would call cleanStagingDir() to clean the > staging dir. In cleanStagingDir(), it will create a FileSystem object to talk > to HDFS. However, since this would be the first time to use a filesystem > object in that process, it will need to create an hdfs client and register > the shutdownHook. Then, we will hit the IllegalStateException. This > illegalStateException will mask the actual exception which causes the spark > program to fail during pre-launch. > We propose to swallow IllegalStateException in KeyProviderCache and log a > warning. The TCP connection between the client and NameNode should be closed > by the OS when the process is shutdown. > Example stacktrace > {code:java} > 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught > exception in thread shutdown-hook-0 > 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in > progress, cannot add a shutdownHook > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023) > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > > 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > > 13-09-2022 14:39:42 PDT INFO - at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > > 13-09-2022 14:39:42 PDT INFO - at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > >
[jira] [Created] (HDFS-16852) swallow IllegalStateException in KeyProviderCache
Xing Lin created HDFS-16852: --- Summary: swallow IllegalStateException in KeyProviderCache Key: HDFS-16852 URL: https://issues.apache.org/jira/browse/HDFS-16852 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Xing Lin When an HDFS client is created, it will register a shutdownhook to shutdownHookManager. ShutdownHookManager doesn't allow adding a new shutdownHook when the process is already in shutdown and throws an IllegalStateException. This behavior is not ideal, when a spark program failed during pre-launch. In that case, during shutdown, spark would call cleanStagingDir() to clean the staging dir. In cleanStagingDir(), it will create a FileSystem object to talk to HDFS. However, since this would be the first time to use a filesystem object in that process, it will need to create an hdfs client and register the shutdownHook. Then, we will hit the IllegalStateException. This illegalStateException will mask the actual exception which causes the spark program to fail during pre-launch. We propose to swallow IllegalStateException in KeyProviderCache and log a warning. The TCP connection between the client and NameNode should be closed by the OS when the process is shutdown. Example stacktrace {code:java} 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught exception in thread shutdown-hook-0 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) 13-09-2022 14:39:42 PDT INFO - at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) 13-09-2022 14:39:42 PDT INFO - at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) 13-09-2022 14:39:42 PDT INFO - at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) 13-09-2022 14:39:42 PDT INFO - at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) 13-09-2022 14:39:42 PDT INFO - at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 13-09-2022 14:39:42 PDT INFO - at java.util.concurrent.FutureTask.run(FutureTask.java:266) 13-09-2022 14:39:42 PDT INFO - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 13-09-2022 14:39:42 PDT INFO - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 13-09-2022 14:39:42 PDT INFO - at java.lang.Thread.run(Thread.java:748)
[jira] [Commented] (HDFS-15505) Fix NullPointerException when call getAdditionalDatanode method with null extendedBlock parameter
[ https://issues.apache.org/jira/browse/HDFS-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17632720#comment-17632720 ] Xing Lin commented on HDFS-15505: - There is no update from [~hangc] on the PR. I am not sure whether he still plans to fix/finish his PR. [~jianghuazhu], do you have bandwidth to pick this up? > Fix NullPointerException when call getAdditionalDatanode method with null > extendedBlock parameter > - > > Key: HDFS-15505 > URL: https://issues.apache.org/jira/browse/HDFS-15505 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2, 3.2.0, 3.1.1, 3.0.3, 3.1.2, > 3.3.0, 3.2.1, 3.1.3 >Reporter: hang chen >Priority: Major > > When client call getAdditionalDatanode method, it will initialize > GetAdditionalDatanodeRequestProto and send RPC request to Router/namenode. > However, if we call getAdditionalDatanode method with null extendedBlock > parameter, it will set GetAdditionalDatanodeRequestProto's blk field with > null, which will cause NullPointerException. The code show as follow. > {code:java} > // code placeholder > GetAdditionalDatanodeRequestProto req = GetAdditionalDatanodeRequestProto > .newBuilder() > .setSrc(src) > .setFileId(fileId) > .setBlk(PBHelperClient.convert(blk)) > .addAllExistings(PBHelperClient.convert(existings)) > .addAllExistingStorageUuids(Arrays.asList(existingStorageIDs)) > .addAllExcludes(PBHelperClient.convert(excludes)) > .setNumAdditionalNodes(numAdditionalNodes) > .setClientName(clientName) > .build();{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16838) Fix NPE in testAddRplicaProcessorForAddingReplicaInMap
[ https://issues.apache.org/jira/browse/HDFS-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16838: Description: There is a NPE in org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap if we run this UT individually. And the related bug as bellow: {code:java} public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception { // BUG here BlockPoolSlice.reInitializeAddReplicaThreadPool(); Configuration cnf = new Configuration(); int poolSize = 5; ... }{code} _addReplicaThreadPool_ may not have been initialized and is null, if we run testAddRplicaProcessorForAddingReplicaInMap unit test as an individual unit test. {code:java} @VisibleForTesting public static void reInitializeAddReplicaThreadPool() { addReplicaThreadPool.shutdown(); addReplicaThreadPool = null; }{code} was: There is a NPE in org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap if we run this UT individually. And the related bug as bellow: {code:java} public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception { // BUG here BlockPoolSlice.reInitializeAddReplicaThreadPool(); Configuration cnf = new Configuration(); int poolSize = 5; ... }{code} > Fix NPE in testAddRplicaProcessorForAddingReplicaInMap > -- > > Key: HDFS-16838 > URL: https://issues.apache.org/jira/browse/HDFS-16838 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > There is a NPE in > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap > if we run this UT individually. And the related bug as bellow: > > {code:java} > public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception { > // BUG here > BlockPoolSlice.reInitializeAddReplicaThreadPool(); > Configuration cnf = new Configuration(); > int poolSize = 5; > ... > }{code} > > _addReplicaThreadPool_ may not have been initialized and is null, if we run > testAddRplicaProcessorForAddingReplicaInMap unit test as an individual unit > test. > {code:java} > @VisibleForTesting > public static void reInitializeAddReplicaThreadPool() { > addReplicaThreadPool.shutdown(); > addReplicaThreadPool = null; > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16816) RBF: auto-create user home dir for trash paths by router
[ https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627993#comment-17627993 ] Xing Lin commented on HDFS-16816: - /user dir may not be writable for any regular user. In that case, assume userA dir is not created in /user, when userA calls moveToTrash(/dir/file), it will hit the permission denied error when it is trying to create dir /user/userA/.Trash/Current/dir, because userA does not have write permission for /user. > RBF: auto-create user home dir for trash paths by router > > > Key: HDFS-16816 > URL: https://issues.apache.org/jira/browse/HDFS-16816 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > In RBF, trash files are moved to trash root under user's home dir at the > corresponding namespace/namenode where the files reside. This was added in > HDFS-16024. When the user home dir is not created before-hand at a namenode, > we run into permission denied exceptions when trying to create the parent dir > for the trash file before moving the file into it. We propose to enhance > Router, to auto-create a user home's dir at the namenode for trash paths, > using router's identity (which is assumed to be a super-user). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16818) RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures
[ https://issues.apache.org/jira/browse/HDFS-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626178#comment-17626178 ] Xing Lin commented on HDFS-16818: - Even when we unset storagepolicy, we still get HOT policy. {code:java} routerFs.unsetStoragePolicy(mountFile); routerFs.removeXAttr(mountFile, name); assertEquals(0, nnFs.getXAttrs(nameSpaceFile).size()); assertEquals("HOT", nnFs.getStoragePolicy(nameSpaceFile).getName());{code} > RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit > tests failures > > > Key: HDFS-16818 > URL: https://issues.apache.org/jira/browse/HDFS-16818 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.4.0 >Reporter: Xing Lin >Priority: Major > > TestRouterRPCMultipleDestinationMountTableResolver fails a couple of times > nondeterministically when run multiple times. > I repeated the following commands for 10+ times against > 454157a3844cdd6c92ef650af6c3b323cbec88af in trunk and observed two types of > failed runs. > {code:java} > mvn test -Dtest="TestRouterRPCMultipleDestinationMountTableResolver"{code} > > Failed run 1 output: > {code:java} > [ERROR] Failures: > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec > toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector > yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect > oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [INFO] > [ERROR] Tests run: 18, Failures: 5, Errors: 0, Skipped: 0{code} > > Failed run 2 output: > {code:java} > [ERROR] Failures: > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testECMultipleDestinations:430 > [ERROR] Errors: > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec > toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 > NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector > yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:397 NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect > oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer > [INFO] > [ERROR] Tests run: 18, Failures: 1, Errors: 5, Skipped: 0{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16818) RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures
[ https://issues.apache.org/jira/browse/HDFS-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624081#comment-17624081 ] Xing Lin commented on HDFS-16818: - This unit test failure is non-deterministic. May be related with HDFS-16740. > RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit > tests failures > > > Key: HDFS-16818 > URL: https://issues.apache.org/jira/browse/HDFS-16818 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.4.0 >Reporter: Xing Lin >Priority: Major > > TestRouterRPCMultipleDestinationMountTableResolver fails a couple of times > nondeterministically when run multiple times. > I repeated the following commands for 10+ times against > 454157a3844cdd6c92ef650af6c3b323cbec88af in trunk and observed two types of > failed runs. > {code:java} > mvn test -Dtest="TestRouterRPCMultipleDestinationMountTableResolver"{code} > > Failed run 1 output: > {code:java} > [ERROR] Failures: > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec > toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector > yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect > oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 > expected:<[COLD]> but was:<[HOT]> > [INFO] > [ERROR] Tests run: 18, Failures: 5, Errors: 0, Skipped: 0{code} > > Failed run 2 output: > {code:java} > [ERROR] Failures: > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testECMultipleDestinations:430 > [ERROR] Errors: > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec > toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 > NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector > yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:397 NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect > oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer > [ERROR] > TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto > ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer > [INFO] > [ERROR] Tests run: 18, Failures: 1, Errors: 5, Skipped: 0{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16818) RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures
Xing Lin created HDFS-16818: --- Summary: RBF TestRouterRPCMultipleDestinationMountTableResolver non-deterministic unit tests failures Key: HDFS-16818 URL: https://issues.apache.org/jira/browse/HDFS-16818 Project: Hadoop HDFS Issue Type: Bug Components: rbf Affects Versions: 3.4.0 Reporter: Xing Lin TestRouterRPCMultipleDestinationMountTableResolver fails a couple of times nondeterministically when run multiple times. I repeated the following commands for 10+ times against 454157a3844cdd6c92ef650af6c3b323cbec88af in trunk and observed two types of failed runs. {code:java} mvn test -Dtest="TestRouterRPCMultipleDestinationMountTableResolver"{code} Failed run 1 output: {code:java} [ERROR] Failures: [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 expected:<[COLD]> but was:<[HOT]> [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:395 expected:<[COLD]> but was:<[HOT]> [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 expected:<[COLD]> but was:<[HOT]> [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 expected:<[COLD]> but was:<[HOT]> [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:395 expected:<[COLD]> but was:<[HOT]> [INFO] [ERROR] Tests run: 18, Failures: 5, Errors: 0, Skipped: 0{code} Failed run 2 output: {code:java} [ERROR] Failures: [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testECMultipleDestinations:430 [ERROR] Errors: [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashAllOrder:177->testInvocation:221->testDirec toryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationHashOrder:193->testInvocation:221->testDirector yAndFileLevelInvocation:298->verifyDirectoryLevelInvocations:397 NullPointer [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationLocalOrder:201->testInvocation:221->testDirecto ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationRandomOrder:185->testInvocation:221->testDirect oryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer [ERROR] TestRouterRPCMultipleDestinationMountTableResolver.testInvocationSpaceOrder:169->testInvocation:221->testDirecto ryAndFileLevelInvocation:296->verifyDirectoryLevelInvocations:397 NullPointer [INFO] [ERROR] Tests run: 18, Failures: 1, Errors: 5, Skipped: 0{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16816) RBF: auto-create user home dir for trash paths by router
Xing Lin created HDFS-16816: --- Summary: RBF: auto-create user home dir for trash paths by router Key: HDFS-16816 URL: https://issues.apache.org/jira/browse/HDFS-16816 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Reporter: Xing Lin In RBF, trash files are moved to trash root under user's home dir at the corresponding namespace/namenode where the files reside. This was added in HDFS-16024. When the user home dir is not created before-hand at a namenode, we run into permission denied exceptions when trying to create the parent dir for the trash file before moving the file into it. We propose to enhance Router, to auto-create a user home's dir at the namenode for trash paths, using router's identity (which is assumed to be a super-user). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16790) rbf wrong path when destination dir is not created
Xing Lin created HDFS-16790: --- Summary: rbf wrong path when destination dir is not created Key: HDFS-16790 URL: https://issues.apache.org/jira/browse/HDFS-16790 Project: Hadoop HDFS Issue Type: Bug Components: rbf Affects Versions: 3.4.0 Reporter: Xing Lin mount table at router {code:java} $HADOOP_HOME/bin/hdfs dfsrouteradmin -ls /data1ns1->/data /data2 ns2->/data /data3ns3->/data {code} At a client node, when /data is not created in ns2, the error message shows a wrong path. {code:java} utos@c01:/usr/local/bin/hadoop-3.4.0-SNAPSHOT$ bin/hadoop dfs -ls hdfs://ns-fed/data2 ls: File hdfs://ns-fed/data2/data2 does not exist. utos@c01:/usr/local/bin/hadoop-3.4.0-SNAPSHOT$ bin/hadoop dfs -ls hdfs://ns-fed/data3 -rw-r--r-- 3 utos supergroup 0 2022-10-02 17:35 hdfs://ns-fed/data3/file3 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16191) [FGL] Fix FSImage loading issues on dynamic partitions
[ https://issues.apache.org/jira/browse/HDFS-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414719#comment-17414719 ] Xing Lin commented on HDFS-16191: - Yeah, that does not sound right: when there are 256 partitions, we insert range keys [0, 16385], [1, 16385], [2, 16385], ... .[255, 16385]. If there are more partitions need to be created, the next ones should be created with range keys: [256, 16385], [257, 16385], [258, 16385], ... When the partition size is changed, we also need to update the indexof() method. We need a holistic approach to support dynamic partition sizes. # Do we support arbitrary partition size or only power of 2 partition sizes? Maybe probably later is simpler. # whenever the partition size is changed, we need to re-shuffle keys in the partitionedGSet. Essentially, it is a rehashing operation. If we double the partition size from 256 to 512, instead of doing indexKey%256, we need to do indexKey%521. > [FGL] Fix FSImage loading issues on dynamic partitions > -- > > Key: HDFS-16191 > URL: https://issues.apache.org/jira/browse/HDFS-16191 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > When new partitions gets added into PartitionGSet, iterator do not consider > the new partitions. Which always iterate on Static Partition count. This lead > to full of warn messages as below. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139780 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139781 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139784 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139785 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139786 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139788 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139789 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139790 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139791 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139793 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139795 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139796 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139797 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139800 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139801 when saving the leases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16191) [FGL] Fix FSImage loading issues on dynamic partitions
[ https://issues.apache.org/jira/browse/HDFS-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412107#comment-17412107 ] Xing Lin commented on HDFS-16191: - Hi [~prasad-acit], Thanks for working on this! As you asked in the github pull request, we don't support partitions larger than NUM_RANGES_STATIC right now. The key of a inode is calculated and then modulo by NUM_RANGES_STATIC in indexof(). As a result, any partition that has an id larger than NUM_RANGES_STATIC will receive no insertion. If we want to support dynamic partition numbers, we need to modify indexof() implementation as well. We need to replace `& (INodeMap.NUM_RANGES_STATIC -1)` with something like `% partition_num`. Also note, indexof() is a static function which means we can not access instance variable from here. I don't know how to handle it now. {code:java} public static long indexOf(long[] key) { if(key[key.length-1] == INodeId.ROOT_INODE_ID) { return key[0]; } long idx = LARGE_PRIME * key[0]; idx = (idx ^ (idx >> 32)) & (INodeMap.NUM_RANGES_STATIC -1); return idx; } {code} > [FGL] Fix FSImage loading issues on dynamic partitions > -- > > Key: HDFS-16191 > URL: https://issues.apache.org/jira/browse/HDFS-16191 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > When new partitions gets added into PartitionGSet, iterator do not consider > the new partitions. Which always iterate on Static Partition count. This lead > to full of warn messages as below. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139780 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139781 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139784 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139785 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139786 when saving the leases. > 2021-08-28 03:23:19,420 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139788 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139789 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139790 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139791 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139793 when saving the leases. > 2021-08-28 03:23:19,421 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139795 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139796 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139797 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139800 when saving the leases. > 2021-08-28 03:23:19,422 WARN namenode.FSImageFormatPBINode: Fail to find > inode 139801 when saving the leases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408949#comment-17408949 ] Xing Lin edited comment on HDFS-16128 at 9/2/21, 4:14 PM: -- Hi [~prasad-acit], The issue is given an inode as a long value, the function will first construct a INode object. But we don't know what the parent Inode is for this INode, thus we can not determine which partition to search for. That is why we fall back to iterate over all partitions to search for that inode. We construct a new INode object as following in this function before we do the search. {code:java} INode inode = new INodeDirectory(id, null, new PermissionStatus("", "", new FsPermission((short) 0)), 0);{code} You should also take a look at this function: public INode get(INode inode). Inside this function, we first check whether there are KEY_DEPTH - 1 levels of parent Inodes. If there are sufficient parent Inodes to construct the partition key, then we go directly with map.get(inode). Otherwise, we fall back to get(long inode), which basically scan all partitions and search for the inode. Hope this answers your question. was (Author: xinglin): Hi [~prasad-acit], The issue is given a inode as a long value, the function will first construct a INode object. But we don't know what the parent Inode is for this INode, thus we can not determine which partition to search for. That is why we fall back to iterate over all partitions to search for that inode. {code:java} INode inode = new INodeDirectory(id, null, new PermissionStatus("", "", new FsPermission((short) 0)), 0);{code} You should also take a look at this function: public INode get(INode inode). Inside this function, we first check whether there are KEY_DEPTH - 1 levels of parent Inodes. If there are sufficient parent Inodes, then we go directly with map.get(inode). Otherwise, we fall back to get(long inode), which basically scan all partitions and search for the inode. Hope this answers your question. > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > Fix For: Fine-Grained Locking > > Time Spent: 50m > Remaining Estimate: 0h > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. > h1. Saving FSImage > *Original HDFS design*: iterate every inode in inodeMap and save them into > the FSImage file. > *FGL*: no change is needed here, since PartitionedGSet also provides an > iterator interface, to iterate over inodes stored in partitions. > h1. Loading an HDFS > *Original HDFS design*: it first loads the FSImage files and then loads edit > logs for recent changes. FSImage files contain different sections, including > INodeSections and INodeDirectorySections. An InodeSection contains serialized > Inodes objects and the INodeDirectorySection contains the parent inode for an > Inode. When loading an FSImage, the system first loads INodeSections and then > load the INodeDirectorySections, to set the parent inode for each inode. > After FSImage files are loaded, edit logs are then loaded. Edit log contains > recent changes to the filesystem, including Inodes creation/deletion. For a > newly created INode, the parent inode is set before it is added to the > inodeMap. > *FGL*: when adding an Inode into the partitionedGSet, we need the parent > inode of an inode, in order to determine which partition to store that inode, > when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we > used a temporary LightweightGSet (inodeMapTemp), to store inodes. When > LoadFSImage is done, the parent inode for all existing inodes in FSImage > files is set. We can now move the inodes into a partitionedGSet. Load edit > logs can work as usual, as the parent inode for an inode is set before it is > added to the inodeMap. > In theory, PartitionedGSet can support to store inodes without setting its > parent inodes. All these inodes will be stored in the 0th partition. However, > we decide to use a temporary LightweightGSet (inodeMapTemp) to store these > inodes, to make this case more transparent. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408949#comment-17408949 ] Xing Lin commented on HDFS-16128: - Hi [~prasad-acit], The issue is given a inode as a long value, the function will first construct a INode object. But we don't know what the parent Inode is for this INode, thus we can not determine which partition to search for. That is why we fall back to iterate over all partitions to search for that inode. {code:java} INode inode = new INodeDirectory(id, null, new PermissionStatus("", "", new FsPermission((short) 0)), 0);{code} You should also take a look at this function: public INode get(INode inode). Inside this function, we first check whether there are KEY_DEPTH - 1 levels of parent Inodes. If there are sufficient parent Inodes, then we go directly with map.get(inode). Otherwise, we fall back to get(long inode), which basically scan all partitions and search for the inode. Hope this answers your question. > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > Fix For: Fine-Grained Locking > > Time Spent: 50m > Remaining Estimate: 0h > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. > h1. Saving FSImage > *Original HDFS design*: iterate every inode in inodeMap and save them into > the FSImage file. > *FGL*: no change is needed here, since PartitionedGSet also provides an > iterator interface, to iterate over inodes stored in partitions. > h1. Loading an HDFS > *Original HDFS design*: it first loads the FSImage files and then loads edit > logs for recent changes. FSImage files contain different sections, including > INodeSections and INodeDirectorySections. An InodeSection contains serialized > Inodes objects and the INodeDirectorySection contains the parent inode for an > Inode. When loading an FSImage, the system first loads INodeSections and then > load the INodeDirectorySections, to set the parent inode for each inode. > After FSImage files are loaded, edit logs are then loaded. Edit log contains > recent changes to the filesystem, including Inodes creation/deletion. For a > newly created INode, the parent inode is set before it is added to the > inodeMap. > *FGL*: when adding an Inode into the partitionedGSet, we need the parent > inode of an inode, in order to determine which partition to store that inode, > when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we > used a temporary LightweightGSet (inodeMapTemp), to store inodes. When > LoadFSImage is done, the parent inode for all existing inodes in FSImage > files is set. We can now move the inodes into a partitionedGSet. Load edit > logs can work as usual, as the parent inode for an inode is set before it is > added to the inodeMap. > In theory, PartitionedGSet can support to store inodes without setting its > parent inodes. All these inodes will be stored in the 0th partition. However, > we decide to use a temporary LightweightGSet (inodeMapTemp) to store these > inodes, to make this case more transparent. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797 ] Xing Lin edited comment on HDFS-14703 at 7/27/21, 6:07 AM: --- [~daryn] Thanks for your comments. I will address your last question and leave other questions to [~shv]. :) Regarding the results, we used the standard NNThroughputBenchmark, with commands like the following. {code:java} ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512{code} Here are a result from [~prasad-acit], since his QPS numbers are higher than what I got. {code:java} BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: — mkdirs inputs — 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: — mkdirs inputs — 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 {code} was (Author: xinglin): [~daryn] Thanks for your comments. I will address your last question and leave other questions to [~shv]. :) Regarding the results, we used the standard NNThroughputBenchmark, with commands like the following. ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 Here are a result from [~prasad-acit], since his QPS numbers are higher than what I got. BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797 ] Xing Lin commented on HDFS-14703: - [~daryn] Thanks for your comments. I will address your last question and leave other questions to [~shv]. :) Regarding the results, we used the standard NNThroughputBenchmark, with commands like the following. ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 Here are a result from [~prasad-acit], since his QPS numbers are higher than what I got. BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384675#comment-17384675 ] Xing Lin commented on HDFS-16128: - Instead of loading inodes directly into the final inodeMap, we now split it into two steps: first load them into a lightweightGSet and then move them into the partitionedGSet. But this is all done in-memory and hopefully, it won't bring too much performance degradation. Thanks for the +1 from you! > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. > h1. Saving FSImage > *Original HDFS design*: iterate every inode in inodeMap and save them into > the FSImage file. > *FGL*: no change is needed here, since PartitionedGSet also provides an > iterator interface, to iterate over inodes stored in partitions. > h1. Loading an HDFS > *Original HDFS design*: it first loads the FSImage files and then loads edit > logs for recent changes. FSImage files contain different sections, including > INodeSections and INodeDirectorySections. An InodeSection contains serialized > Inodes objects and the INodeDirectorySection contains the parent inode for an > Inode. When loading an FSImage, the system first loads INodeSections and then > load the INodeDirectorySections, to set the parent inode for each inode. > After FSImage files are loaded, edit logs are then loaded. Edit log contains > recent changes to the filesystem, including Inodes creation/deletion. For a > newly created INode, the parent inode is set before it is added to the > inodeMap. > *FGL*: when adding an Inode into the partitionedGSet, we need the parent > inode of an inode, in order to determine which partition to store that inode, > when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we > used a temporary LightweightGSet (inodeMapTemp), to store inodes. When > LoadFSImage is done, the parent inode for all existing inodes in FSImage > files is set. We can now move the inodes into a partitionedGSet. Load edit > logs can work as usual, as the parent inode for an inode is set before it is > added to the inodeMap. > In theory, PartitionedGSet can support to store inodes without setting its > parent inodes. All these inodes will be stored in the 0th partition. However, > we decide to use a temporary LightweightGSet (inodeMapTemp) to store these > inodes, to make this case more transparent. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382941#comment-17382941 ] Xing Lin commented on HDFS-16128: - [~prasad-acit] Thanks for your comments. One of them is a bug in my code. See my comments in the pull request. Updated my pull request. > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. > h1. Saving FSImage > *Original HDFS design*: iterate every inode in inodeMap and save them into > the FSImage file. > *FGL*: no change is needed here, since PartitionedGSet also provides an > iterator interface, to iterate over inodes stored in partitions. > h1. Loading an HDFS > *Original HDFS design*: it first loads the FSImage files and then loads edit > logs for recent changes. FSImage files contain different sections, including > INodeSections and INodeDirectorySections. An InodeSection contains serialized > Inodes objects and the INodeDirectorySection contains the parent inode for an > Inode. When loading an FSImage, the system first loads INodeSections and then > load the INodeDirectorySections, to set the parent inode for each inode. > After FSImage files are loaded, edit logs are then loaded. Edit log contains > recent changes to the filesystem, including Inodes creation/deletion. For a > newly created INode, the parent inode is set before it is added to the > inodeMap. > *FGL*: when adding an Inode into the partitionedGSet, we need the parent > inode of an inode, in order to determine which partition to store that inode, > when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we > used a temporary LightweightGSet (inodeMapTemp), to store inodes. When > LoadFSImage is done, the parent inode for all existing inodes in FSImage > files is set. We can now move the inodes into a partitionedGSet. Load edit > logs can work as usual, as the parent inode for an inode is set before it is > added to the inodeMap. > In theory, PartitionedGSet can support to store inodes without setting its > parent inodes. All these inodes will be stored in the 0th partition. However, > we decide to use a temporary LightweightGSet (inodeMapTemp) to store these > inodes, to make this case more transparent. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16128 started by Xing Lin. --- > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. > h1. Saving FSImage > *Original HDFS design*: iterate every inode in inodeMap and save them into > the FSImage file. > *FGL*: no change is needed here, since PartitionedGSet also provides an > iterator interface, to iterate over inodes stored in partitions. > h1. Loading an HDFS > *Original HDFS design*: it first loads the FSImage files and then loads edit > logs for recent changes. FSImage files contain different sections, including > INodeSections and INodeDirectorySections. An InodeSection contains serialized > Inodes objects and the INodeDirectorySection contains the parent inode for an > Inode. When loading an FSImage, the system first loads INodeSections and then > load the INodeDirectorySections, to set the parent inode for each inode. > After FSImage files are loaded, edit logs are then loaded. Edit log contains > recent changes to the filesystem, including Inodes creation/deletion. For a > newly created INode, the parent inode is set before it is added to the > inodeMap. > *FGL*: when adding an Inode into the partitionedGSet, we need the parent > inode of an inode, in order to determine which partition to store that inode, > when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we > used a temporary LightweightGSet (inodeMapTemp), to store inodes. When > LoadFSImage is done, the parent inode for all existing inodes in FSImage > files is set. We can now move the inodes into a partitionedGSet. Load edit > logs can work as usual, as the parent inode for an inode is set before it is > added to the inodeMap. > In theory, PartitionedGSet can support to store inodes without setting its > parent inodes. All these inodes will be stored in the 0th partition. However, > we decide to use a temporary LightweightGSet (inodeMapTemp) to store these > inodes, to make this case more transparent. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-14703: Fix Version/s: (was: Fine-Grained Locking) > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-14703: Fix Version/s: Fine-Grained Locking > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Fix For: Fine-Grained Locking > > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16128: Description: Add support to save Inodes stored in PartitionedGSet when saving an FS image and load Inodes into PartitionedGSet from a saved FS image. h1. Saving FSImage *Original HDFS design*: iterate every inode in inodeMap and save them into the FSImage file. *FGL*: no change is needed here, since PartitionedGSet also provides an iterator interface, to iterate over inodes stored in partitions. h1. Loading an HDFS *Original HDFS design*: it first loads the FSImage files and then loads edit logs for recent changes. FSImage files contain different sections, including INodeSections and INodeDirectorySections. An InodeSection contains serialized Inodes objects and the INodeDirectorySection contains the parent inode for an Inode. When loading an FSImage, the system first loads INodeSections and then load the INodeDirectorySections, to set the parent inode for each inode. After FSImage files are loaded, edit logs are then loaded. Edit log contains recent changes to the filesystem, including Inodes creation/deletion. For a newly created INode, the parent inode is set before it is added to the inodeMap. *FGL*: when adding an Inode into the partitionedGSet, we need the parent inode of an inode, in order to determine which partition to store that inode, when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we used a temporary LightweightGSet (inodeMapTemp), to store inodes. When LoadFSImage is done, the parent inode for all existing inodes in FSImage files is set. We can now move the inodes into a partitionedGSet. Load edit logs can work as usual, as the parent inode for an inode is set before it is added to the inodeMap. In theory, PartitionedGSet can support to store inodes without setting its parent inodes. All these inodes will be stored in the 0th partition. However, we decide to use a temporary LightweightGSet (inodeMapTemp) to store these inodes, to make this case more transparent. was:Add support to save Inodes stored in PartitionedGSet when saving an FS image and load Inodes into PartitionedGSet from a saved FS image. > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. > h1. Saving FSImage > *Original HDFS design*: iterate every inode in inodeMap and save them into > the FSImage file. > *FGL*: no change is needed here, since PartitionedGSet also provides an > iterator interface, to iterate over inodes stored in partitions. > h1. Loading an HDFS > *Original HDFS design*: it first loads the FSImage files and then loads edit > logs for recent changes. FSImage files contain different sections, including > INodeSections and INodeDirectorySections. An InodeSection contains serialized > Inodes objects and the INodeDirectorySection contains the parent inode for an > Inode. When loading an FSImage, the system first loads INodeSections and then > load the INodeDirectorySections, to set the parent inode for each inode. > After FSImage files are loaded, edit logs are then loaded. Edit log contains > recent changes to the filesystem, including Inodes creation/deletion. For a > newly created INode, the parent inode is set before it is added to the > inodeMap. > *FGL*: when adding an Inode into the partitionedGSet, we need the parent > inode of an inode, in order to determine which partition to store that inode, > when NAMESPACE_KEY_DEPTH = 2. Thus, in FGL, when loading FSImage files, we > used a temporary LightweightGSet (inodeMapTemp), to store inodes. When > LoadFSImage is done, the parent inode for all existing inodes in FSImage > files is set. We can now move the inodes into a partitionedGSet. Load edit > logs can work as usual, as the parent inode for an inode is set before it is > added to the inodeMap. > In theory, PartitionedGSet can support to store inodes without setting its > parent inodes. All these inodes will be stored in the 0th partition. However, > we decide to use a temporary LightweightGSet (inodeMapTemp) to store these > inodes, to make this case more transparent. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16128: Labels: pull-request-available (was: ) > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16128) [FGL] Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16128: Summary: [FGL] Add support for saving/loading an FS Image for PartitionedGSet (was: Add support for saving/loading an FS Image for PartitionedGSet) > [FGL] Add support for saving/loading an FS Image for PartitionedGSet > > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16128) Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16128: Parent: HDFS-14703 Issue Type: Sub-task (was: Improvement) > Add support for saving/loading an FS Image for PartitionedGSet > -- > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16128) Add support for saving/loading an FS Image
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16128: Description: Add support to save Inodes stored in PartitionedGSet when saving an FS image and load Inodes into PartitionedGSet from a saved FS image. (was: We target to enable fine-grained locking by splitting the in-memory namespace into multiple partitions each having a separate lock. Intended to improve performance of NameNode write operations.) > Add support for saving/loading an FS Image > -- > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Xing Lin >Priority: Major > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16128) Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16128: Summary: Add support for saving/loading an FS Image for PartitionedGSet (was: Add support for saving/loading an FS Image) > Add support for saving/loading an FS Image for PartitionedGSet > -- > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Xing Lin >Priority: Major > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16128) Add support for saving/loading an FS Image for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin reassigned HDFS-16128: --- Assignee: Xing Lin > Add support for saving/loading an FS Image for PartitionedGSet > -- > > Key: HDFS-16128 > URL: https://issues.apache.org/jira/browse/HDFS-16128 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > > Add support to save Inodes stored in PartitionedGSet when saving an FS image > and load Inodes into PartitionedGSet from a saved FS image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16128) Add support for saving/loading an FS Image
Xing Lin created HDFS-16128: --- Summary: Add support for saving/loading an FS Image Key: HDFS-16128 URL: https://issues.apache.org/jira/browse/HDFS-16128 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, namenode Reporter: Xing Lin We target to enable fine-grained locking by splitting the in-memory namespace into multiple partitions each having a separate lock. Intended to improve performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16125) fix the iterator for PartitionedGSet
[ https://issues.apache.org/jira/browse/HDFS-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16125: Summary: fix the iterator for PartitionedGSet (was: iterator for PartitionedGSet would visit the first partition twice) > fix the iterator for PartitionedGSet > - > > Key: HDFS-16125 > URL: https://issues.apache.org/jira/browse/HDFS-16125 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: Xing Lin >Priority: Minor > > Iterator in PartitionedGSet would visit the first partition twice, since we > did not set the keyIterator to move to the first key during initialization. > > This is related to fgl: https://issues.apache.org/jira/browse/HDFS-14703 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16125) iterator for PartitionedGSet would visit the first partition twice
[ https://issues.apache.org/jira/browse/HDFS-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Lin updated HDFS-16125: Description: Iterator in PartitionedGSet would visit the first partition twice, since we did not set the keyIterator to move to the first key during initialization. This is related to fgl: https://issues.apache.org/jira/browse/HDFS-14703 was:Iterator in PartitionedGSet would visit the first partition twice, since we did not set the keyIterator to move to the first key during initialization. > iterator for PartitionedGSet would visit the first partition twice > -- > > Key: HDFS-16125 > URL: https://issues.apache.org/jira/browse/HDFS-16125 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: Xing Lin >Priority: Minor > > Iterator in PartitionedGSet would visit the first partition twice, since we > did not set the keyIterator to move to the first key during initialization. > > This is related to fgl: https://issues.apache.org/jira/browse/HDFS-14703 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16125) iterator for PartitionedGSet would visit the first partition twice
Xing Lin created HDFS-16125: --- Summary: iterator for PartitionedGSet would visit the first partition twice Key: HDFS-16125 URL: https://issues.apache.org/jira/browse/HDFS-16125 Project: Hadoop HDFS Issue Type: Bug Components: hdfs, namenode Reporter: Xing Lin Iterator in PartitionedGSet would visit the first partition twice, since we did not set the keyIterator to move to the first key during initialization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359651#comment-17359651 ] Xing Lin commented on HDFS-14703: - Hi [~prasad-acit], that is awesome! Konstantin is on vacation this week and next week. I am sure he will be very happy to review your pull press for Create API. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Xing Lin edited comment on HDFS-14703 at 5/17/21, 4:42 AM: --- I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|http://www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 h1. SSD, hadoop.tmp.dir=/dev/sda4 h1. 23% improvement fgl vs. trunk trunk: 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 fgl 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D was (Author: xinglin): I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Xing Lin edited comment on HDFS-14703 at 5/17/21, 4:41 AM: --- I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 h1. SSD, hadoop.tmp.dir=/dev/sda4 h1. 23% improvement fgl vs. trunk trunk: 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 fgl 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D was (Author: xinglin): I did some performance benchmarks using a physical server (a d430 server in [utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Xing Lin commented on HDFS-14703: - I did some performance benchmarks using a physical server (a d430 server in [utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 h1. SSD, hadoop.tmp.dir=/dev/sda4 h1. 23% improvement fgl vs. trunk trunk: 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 fgl 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069 ] Xing Lin edited comment on HDFS-14703 at 5/15/21, 3:55 PM: --- [~prasad-acit] try this command: use -fs [file:///], instead of hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give you higher numbers at your VM. I use the default partition size of 256. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 was (Author: xinglin): [~prasad-acit] try this command: use -fs [file:///], instead of hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give you higher numbers at your VM. I use the default partition size of 256. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs [file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069 ] Xing Lin edited comment on HDFS-14703 at 5/15/21, 3:53 PM: --- [~prasad-acit] try this command: use -fs [file:///], instead of hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give you higher numbers at your VM. I use the default partition size of 256. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs [file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 was (Author: xinglin): [~prasad-acit] try this command: use -fs file:///, instead of hdfs://server:port. "-fs file:///" will bypass the RPC layer and should give you higher numbers at your VM. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs file:///* -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org