[jira] [Created] (HDFS-14534) NetworkTopology's fair lock has poor performance
Daryn Sharp created HDFS-14534: -- Summary: NetworkTopology's fair lock has poor performance Key: HDFS-14534 URL: https://issues.apache.org/jira/browse/HDFS-14534 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Daryn Sharp The {{NetworkTopology#netlock}} is a heavily accessed lock. HADOOP-15486 made the lock fair to avoid starvation of a re-registering datanode that holds the fsn write lock while waiting for the topology write lock. If nodes start flapping and re-registering, the contention with all other handlers becomes extreme. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14533) Datanode short circuit cache can become blocked
Daryn Sharp created HDFS-14533: -- Summary: Datanode short circuit cache can become blocked Key: HDFS-14533 URL: https://issues.apache.org/jira/browse/HDFS-14533 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Daryn Sharp Errors in the short circuit cache can leave clients indefinitely blocked in {{ShortCircuitCache#fetch}} on a waitable's condition that will never be signaled. The condition wait should be bounded with a timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14532) Datanode's BlockSender checksum buffer is too big
Daryn Sharp created HDFS-14532: -- Summary: Datanode's BlockSender checksum buffer is too big Key: HDFS-14532 URL: https://issues.apache.org/jira/browse/HDFS-14532 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Daryn Sharp Attachments: Screen Shot 2019-05-31 at 12.32.06 PM.png The BlockSender uses an excessively large 128K buffered input stream – 99% of the entire instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14531) Datanode's ScanInfo requires excessive memory
Daryn Sharp created HDFS-14531: -- Summary: Datanode's ScanInfo requires excessive memory Key: HDFS-14531 URL: https://issues.apache.org/jira/browse/HDFS-14531 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Attachments: Screen Shot 2019-05-31 at 12.25.54 PM.png The DirectoryScanner's ScanInfo map consumes ~4.5X memory as replicas as the replica map. For 1.1M replicas: the replica map is ~91M while the scan info is ~405M. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13929) DFSClient leaks data streamer sockets
Daryn Sharp created HDFS-13929: -- Summary: DFSClient leaks data streamer sockets Key: HDFS-13929 URL: https://issues.apache.org/jira/browse/HDFS-13929 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.8.0 Reporter: Daryn Sharp Sockets to DNs may linger in the CLOSE_WAIT state which means the remote peer (DN) closed the socket but the local client has not. The socket does not transition to TIME_WAIT until the java process exits which is indicative of a leaked file descriptor. Interestingly there is always 1 byte remaining to be read. {noformat} $ netstat -tnn|fgrep :1004|fgrep -v EST tcp1 0 THISHOST:57158 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:40346 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:45504 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:58958 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:45570 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:46496 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:58944 RANDOMHOST:1004 CLOSE_WAIT tcp1 0 THISHOST:55540 RANDOMHOST:1004 CLOSE_WAIT {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13465) Overlapping lease recoveries cause NPE in NN
Daryn Sharp created HDFS-13465: -- Summary: Overlapping lease recoveries cause NPE in NN Key: HDFS-13465 URL: https://issues.apache.org/jira/browse/HDFS-13465 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.8.0 Reporter: Daryn Sharp Overlapping lease recoveries for the same file will NPE in the DatanodeManager while creating LeaseRecoveryCommands, possibly losing other recovery commands. * client1 calls recoverLease, file is added to DN1's recovery queue * client2 calls recoverLease, file is added to DN2's recovery queue * one DN heartbeats, gets the block recovery command and it completes the synchronization before the other DN heartbeats; ie. file is closed. * other DN heartbeats, takes block from recovery queue, assumes it's still under construction, gets a NPE calling getExpectedLocations {code:java} //check lease recovery BlockInfo[] blocks = nodeinfo.getLeaseRecoveryCommand(Integer.MAX_VALUE); if (blocks != null) { BlockRecoveryCommand brCommand = new BlockRecoveryCommand( blocks.length); for (BlockInfo b : blocks) { BlockUnderConstructionFeature uc = b.getUnderConstructionFeature(); assert uc != null; final DatanodeStorageInfo[] storages = uc.getExpectedStorageLocations(); {code} This is "ok" to the NN state if only 1 block was queued. All recoveries are lost if multiple blocks were queued. Recovery will not occur until the client explicitly retries or the lease monitor recovers the lease. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13112) Token expiration edits may cause log corruption or deadlock
Daryn Sharp created HDFS-13112: -- Summary: Token expiration edits may cause log corruption or deadlock Key: HDFS-13112 URL: https://issues.apache.org/jira/browse/HDFS-13112 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 0.23.8, 2.1.0-beta Reporter: Daryn Sharp Assignee: Daryn Sharp HDFS-4477 specifically did not acquire the fsn lock during token cancellation based on the belief that edit logs are thread-safe. However, log rolling is not thread-safe. Failure to externally synchronize on the fsn lock during a roll will cause problems. For sync edit logging, it may cause corruption by interspersing edits with the end/start segment edits. Async edit logging may encounter a deadlock if the log queue overflows. Luckily, losing the race is extremely rare. In ~5 years, we've never encountered it. However, HDFS-13051 lost the race with async edits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt
Daryn Sharp created HDFS-13111: -- Summary: Close recovery may incorrectly mark blocks corrupt Key: HDFS-13111 URL: https://issues.apache.org/jira/browse/HDFS-13111 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp Close recovery can leave a block marked corrupt until the next FBR arrives from one of the DNs. The reason is unclear but has happened multiple times when a DN has io saturated disks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-13038) User with no permission on file is able to run getfacl for that file
[ https://issues.apache.org/jira/browse/HDFS-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp resolved HDFS-13038. Resolution: Not A Problem > User with no permission on file is able to run getfacl for that file > > > Key: HDFS-13038 > URL: https://issues.apache.org/jira/browse/HDFS-13038 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Namit Maheshwari >Assignee: Lokesh Jain >Priority: Major > Attachments: HDFS-13038.001.patch > > > Currently any user with EXECUTE permission can run getfacl on a file or > directory. This Jira adds a check for READ access of user on the inode path. > {code:java} > [root@host ~]$ hdfs dfs -copyFromLocal /etc/a.txt /tmp > [root@host ~]$ hdfs dfs -setfacl -m user:abc:--- /tmp/a.txt > {code} > Since user abc does not have read permission on the file 'cat' command throws > Permission Denied error but getfacl executes normally. > {code:java} > [abc@host ~]$ hdfs dfs -cat /tmp/a.txt > cat: Permission denied: user=abc, access=READ, > inode="/tmp/a.txt":abc:hdfs:-rw-r--r-- > [abc@host ~]$ hdfs dfs -getfacl /tmp/a.txt > # file: /tmp/a.txt > # owner:root > # group: hdfs > user::rw- > user:abc:--- > group::r-- > mask::r-- > other::r-- > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12914) Block report leases cause missing blocks until next report
Daryn Sharp created HDFS-12914: -- Summary: Block report leases cause missing blocks until next report Key: HDFS-12914 URL: https://issues.apache.org/jira/browse/HDFS-12914 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.8.0 Reporter: Daryn Sharp Priority: Critical {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for conditions such as "unknown datanode", "not in pending set", "lease has expired", wrong lease id, etc. Lease rejection does not throw an exception. It returns false which bubbles up to {{NameNodeRpcServer#blockReport}} and interpreted as {{noStaleStorages}}. A re-registering node whose FBR is rejected from an invalid lease becomes active with _no blocks_. A replication storm ensues possibly causing DNs to temporarily go dead (HDFS-12645), leading to more FBR lease rejections on re-registration. The cluster will have many "missing blocks" until the DNs next FBR is sent and/or forced. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12907) Allow read-only access to reserved raw for non-superusers
Daryn Sharp created HDFS-12907: -- Summary: Allow read-only access to reserved raw for non-superusers Key: HDFS-12907 URL: https://issues.apache.org/jira/browse/HDFS-12907 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Daryn Sharp HDFS-6509 added a special /.reserved/raw path prefix to access the raw file contents of EZ files. In the simplest sense it doesn't return the FE info in the {{LocatedBlocks}} so the dfs client doesn't try to decrypt the data. This facilitates allowing tools like distcp to copy raw bytes. Access to the raw hierarchy is restricted to superusers. This seems like an overly broad restriction designed to prevent non-admins from munging the EZ related xattrs. I believe we should relax the restriction to allow non-admins to perform read-only operations. Allowing non-superusers to easily read the raw bytes will be extremely useful for regular users, esp. for enabling webhdfs client-side encryption. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12747) Lease monitor may infinitely loop on the same lease
Daryn Sharp created HDFS-12747: -- Summary: Lease monitor may infinitely loop on the same lease Key: HDFS-12747 URL: https://issues.apache.org/jira/browse/HDFS-12747 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.8.0 Reporter: Daryn Sharp Priority: Critical Lease recovery incorrectly handles UC files if the last block is complete but the penultimate block is committed. Incorrectly handles is the euphemism for infinitely loops for days and leaves all abandoned streams open until customers complain. The problem may manifest when: # Block1 is committed but seemingly never committed # Block2 is allocated # Lease recovery is initiated for block2 # Commit block synchronization invokes {{FSNamesytem#closeFileCommitBlocks}}, causing: #* {{commitOrCompleteLastBlock}} to mark block2 as complete #* {{finalizeINodeFileUnderConstruction}}/{{INodeFile.assertAllBlocksComplete}} to throw {{IllegalStateException}} because the penultimate block1 is "COMMITTED but not COMPLETE" # The next lease recovery results in an infinite loop. The {{LeaseManager}} expects that {{FSNamesystem#internalReleaseLease}} will either init recovery and renew the lease, or remove the lease. In the described state it does neither. The switch case will break out if the last block is complete. (The case statement ironically contains an assert). Since nothing changed, the lease is still the “next” lease to be processed. The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on it again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-4493) webhdfs fails after SPNEGO token expires
[ https://issues.apache.org/jira/browse/HDFS-4493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp resolved HDFS-4493. --- Resolution: Works for Me Release Note: This issue was fixed years ago when authenticated url was removed from webhdfs. > webhdfs fails after SPNEGO token expires > > > Key: HDFS-4493 > URL: https://issues.apache.org/jira/browse/HDFS-4493 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0-alpha1 >Reporter: Daryn Sharp > > Webhdfs assumes that SPNEGO's {{AuthenticationException}} is always fatal. > The exception is thrown if authentication truly fails due to bad credentials, > _but_ it's also thrown when the auth token expires after 10h. The retry > policies are short-circuited and the fs becomes unusable which is unsuitable > for long running processes/daemons. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12705) WebHdfsFileSystem exceptions should retain the caused by exception
Daryn Sharp created HDFS-12705: -- Summary: WebHdfsFileSystem exceptions should retain the caused by exception Key: HDFS-12705 URL: https://issues.apache.org/jira/browse/HDFS-12705 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.0 Reporter: Daryn Sharp {{WebHdfsFileSystem#runWithRetry}} uses reflection to prepend the remote host to the exception. While it preserves the original stacktrace, it omits the original cause which complicates debugging. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12704) FBR may corrupt block state
Daryn Sharp created HDFS-12704: -- Summary: FBR may corrupt block state Key: HDFS-12704 URL: https://issues.apache.org/jira/browse/HDFS-12704 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.8.0 Reporter: Daryn Sharp Priority: Critical If FBR processing generates a runtime exception it is believed to foul the block state and lead to unpredictable behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12703) Exceptions are fatal to decommissioning monitor
Daryn Sharp created HDFS-12703: -- Summary: Exceptions are fatal to decommissioning monitor Key: HDFS-12703 URL: https://issues.apache.org/jira/browse/HDFS-12703 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0 Reporter: Daryn Sharp Priority: Critical The {{DecommissionManager.Monitor}} runs as an executor scheduled task. If an exception occurs, all decommissioning ceases until the NN is restarted. Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the task encounters an exception, subsequent executions are suppressed*. The monitor thread is alive but blocked waiting for an executor task that will never come. The code currently disposes of the future so the actual exception that aborted the task is gone. Failover is insufficient since the task is also likely dead on the standby. Replication queue init after the transition to active will fix the under replication of blocks on currently decommissioning nodes but future nodes never decommission. The standby must be bounced prior to failover – and hopefully the error condition does not reoccur. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12658) Lease renewal causes connection flapping
Daryn Sharp created HDFS-12658: -- Summary: Lease renewal causes connection flapping Key: HDFS-12658 URL: https://issues.apache.org/jira/browse/HDFS-12658 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Adding a dfsclient to the lease renewer use the minimum of 1/2 the soft timeout vs. 1/2 the client's timeout (when the client closes an idle connection). Both default to 1m, so clients with open files that are otherwise not making calls to the NN will experience connection flapping. Re-authentication is unnecessarily taxing on the ipc layer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12657) Operations based on inode id must not fallback to the path
Daryn Sharp created HDFS-12657: -- Summary: Operations based on inode id must not fallback to the path Key: HDFS-12657 URL: https://issues.apache.org/jira/browse/HDFS-12657 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.5.0 Reporter: Daryn Sharp HDFS-6294 added the ability for some path-based operations to specify an optional inode id to mimic file descriptors. If an inode id is provided and it exists, it replaces the provided path. If it doesn't exist, it has the broken behavior of falling back to the supplied path. A supplied inode id must be authoritative. A FNF should be thrown if the inode does not exist. (HDFS-10745 changed from string paths to IIPs but preserved the same broken semantics) This is broken since an operation specifying an inode for a deleted and recreated path will operate on the newer inode. If another client recreates the path, the operation is likely to fail for other reasons such as lease checks. However a multi-threaded client has a single lease id. If thread1 creates a file, it's somehow deleted, thread2 recreates the path, then further operations in thread1 may conflict with thread2 and corrupt the state of the file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12648) DN should provide feedback to NN for throttling commands
Daryn Sharp created HDFS-12648: -- Summary: DN should provide feedback to NN for throttling commands Key: HDFS-12648 URL: https://issues.apache.org/jira/browse/HDFS-12648 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp The NN should avoid sending commands to a DN with a high number of outstanding commands. The heartbeat could provide this feedback via perhaps a simple count of the commands or rate of processing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12647) DN commands processing should be async
Daryn Sharp created HDFS-12647: -- Summary: DN commands processing should be async Key: HDFS-12647 URL: https://issues.apache.org/jira/browse/HDFS-12647 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp Due to dataset lock contention, service actors may encounter significant latency while processing DN commands. Even the queuing of async deletions require multiple lock acquisitions. A slow disk will cause a backlog of xceivers instantiating block sender/receivers which starves the actor and leads to the NN falsely declaring the node dead. Async processing of all commands will free the actor to perform its primary purpose of heartbeating and block reporting. Note that FBRs will be dependent on queued block invalidations not being included in the report. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12646) Avoid IO while holding the FsDataset lock
Daryn Sharp created HDFS-12646: -- Summary: Avoid IO while holding the FsDataset lock Key: HDFS-12646 URL: https://issues.apache.org/jira/browse/HDFS-12646 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp IO operations should be allowed while holding the dataset lock. Notable offenders include but are not limited to the instantiation of a block sender/receiver, constructing the path to a block, unfinalizing a block. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12645) FSDatasetImpl lock will stall BP service actors and may cause missing blocks
Daryn Sharp created HDFS-12645: -- Summary: FSDatasetImpl lock will stall BP service actors and may cause missing blocks Key: HDFS-12645 URL: https://issues.apache.org/jira/browse/HDFS-12645 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp The DN is extremely susceptible to a slow volume due bad locking practices. DN operations require a fs dataset lock. IO in the dataset lock should not be permissible as it leads to severe performance degradation and possibly (temporarily) missing blocks. A slow disk will cause pipelines to experience significant latency and timeouts, increasing lock/io contention while cleaning up, leading to more timeouts, etc. Meanwhile, the actor service thread is interleaving multiple lock acquire/releases with xceivers. If many commands are issued, the node may be incorrectly declared as dead. HDFS-12639 documents that both actors synchronize on the offer service lock while processing commands. A backlogged active actor will block the standby actor and cause it to go dead too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12639) BPOfferService lock may stall all service actors
Daryn Sharp created HDFS-12639: -- Summary: BPOfferService lock may stall all service actors Key: HDFS-12639 URL: https://issues.apache.org/jira/browse/HDFS-12639 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp {{BPOfferService}} manages {{BPServiceActor}} instances for the active and standby. It uses a RW lock to primarily protect registration information while determining the active/standby from heartbeats. Unfortunately the write lock is held during command processing. If an actor is experiencing high latency processing commands, the other actor will neither be able to register (blocked in createRegistration, setNamespaceInfo, verifyAndSetNamespaceInfo) nor process heartbeats (blocked in updateActorStatesFromHeartbeat). The worst case scenario for processing commands while holding the lock is re-registration. The actor will loop, catching and logging exceptions, leaving the other actor blocked for an non-deterministic (possibly infinite) amount of time. The lock must not be held during command processing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12184) Avoid redundant ancestor metadata traversals for listStatus child entries.
Daryn Sharp created HDFS-12184: -- Summary: Avoid redundant ancestor metadata traversals for listStatus child entries. Key: HDFS-12184 URL: https://issues.apache.org/jira/browse/HDFS-12184 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.7.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Creating a file status requires metadata for EC, EZ, storage policies, etc that are computed by traversing up the ancestor inodes. List status will incur the same penalties for all child entries when one traversal is enough. The penalty for large directories is not trivial. Storage policies short-circuit the full traversal. The other lookups should too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12173) MiniDFSCluster cannot reliably use NameNode#stop
Daryn Sharp created HDFS-12173: -- Summary: MiniDFSCluster cannot reliably use NameNode#stop Key: HDFS-12173 URL: https://issues.apache.org/jira/browse/HDFS-12173 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0 Reporter: Daryn Sharp Sporadic test failures occur because {{NameNode#stop}} used by the mini cluster does not properly manage the HA context's state. It directly calls {{HAState#exitState(context)}} instead of {{HAState#setState(context,state)}}. The latter will properly lock the namesystem and update the ha state while locked, while the former does not. The result is that while the cluster is stopping, the lock is released and any queued rpc calls think the NN is still active and are processed while the NN is in an unstable half-stopped state. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12172) Reduce EZ lookup overhead
Daryn Sharp created HDFS-12172: -- Summary: Reduce EZ lookup overhead Key: HDFS-12172 URL: https://issues.apache.org/jira/browse/HDFS-12172 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Daryn Sharp Assignee: Daryn Sharp A number of inefficiencies exist in EZ lookups. These are amplified by frequent operations like list status. Once one encryption zone exists, all operations take the performance penalty. Ex. Operations should not perform redundant lookups. EZ path reconstruction should be lazy since it's not required in the common case. Renames do not need to reallocate new IIPs to check parent dirs for EZ. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12171) Reduce IIP object allocations for inode lookup
Daryn Sharp created HDFS-12171: -- Summary: Reduce IIP object allocations for inode lookup Key: HDFS-12171 URL: https://issues.apache.org/jira/browse/HDFS-12171 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.7.0 Reporter: Daryn Sharp Assignee: Daryn Sharp {{IIP#getReadOnlyINodes}} is invoked frequently for EZ and EC lookups. It allocates unnecessary objects to make the primitive array an immutable array list. IIP already has a method for indexed inode retrieval that can be tweaked to further improve performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12143) Improve performance of getting and removing inode features
Daryn Sharp created HDFS-12143: -- Summary: Improve performance of getting and removing inode features Key: HDFS-12143 URL: https://issues.apache.org/jira/browse/HDFS-12143 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Getting a feature uses an iterator which is less performant than an indexed for loop. Feature lookups are becoming more prolific so cycles count. Removing a feature requires building a string for up to 3 precondition checks. The worst case of 3 is the penalty for a successful removal. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12142) Files may be closed before streamer is done
Daryn Sharp created HDFS-12142: -- Summary: Files may be closed before streamer is done Key: HDFS-12142 URL: https://issues.apache.org/jira/browse/HDFS-12142 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.0 Reporter: Daryn Sharp We're encountering multiple cases of clients calling updateBlockForPipeline on completed blocks. Initial analysis is the client closes a file, completeFile succeeds, then it immediately attempts recovery. The exception is swallowed on the client, only logged on the NN by checkUCBlock. The problem "appears" to be benign (no data loss) but it's unproven if the issue always occurs for successfully closed files. There appears to be very poor coordination between the dfs output stream's threads which leads to races that confuse the streamer thread – which probably should have been joined before returning from close. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12140) Remove BPOfferService lock contention to get block pool id
Daryn Sharp created HDFS-12140: -- Summary: Remove BPOfferService lock contention to get block pool id Key: HDFS-12140 URL: https://issues.apache.org/jira/browse/HDFS-12140 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical The block pool id is protected by a lock in {{BPOfferService}}. This creates excessive contention especially for xceivers threads attempting to queue IBRs and heartbeat processing. When the latter is delayed due to excessive FSDataset lock contention, it causes pipelines to collapse. Accessing the block pool id should be lockless after registration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12137) DN dataset lock should be fair
Daryn Sharp created HDFS-12137: -- Summary: DN dataset lock should be fair Key: HDFS-12137 URL: https://issues.apache.org/jira/browse/HDFS-12137 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical The dataset lock is very highly contended. The unfair nature can be especially harmful to the heartbeat handling. Under high loads, partially expose by HDFS-12136 introducing disk i/o within the lock, the heartbeat handling thread may process commands so slowly due to the contention that the node becomes stale or falsely declared dead. The unfair lock is not helping and appears to be causing frequent starvation under load. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
Daryn Sharp created HDFS-12136: -- Summary: BlockSender performance regression due to volume scanner edge case Key: HDFS-12136 URL: https://issues.apache.org/jira/browse/HDFS-12136 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan by reading the last checksum of finalized blocks within the {{BlockSender}} ctor. Unfortunately it's holding the exclusive dataset lock to open and read the metafile multiple times Block sender instantiation becomes serialized. Performance completely collapses under heavy disk i/o utilization or high xceiver activity. Ex. lost node replication, balancing, or decommissioning. The xceiver threads congest creating block senders and impair the heartbeat processing that is contending for the same lock. Combined with other lock contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss
Daryn Sharp created HDFS-12070: -- Summary: Failed block recovery leaves files open indefinitely and at risk for data loss Key: HDFS-12070 URL: https://issues.apache.org/jira/browse/HDFS-12070 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Files will remain open indefinitely if block recovery fails which creates a high risk of data loss. The replication monitor will not replicate these blocks. The NN provides the primary node a list of candidate nodes for recovery which involves a 2-stage process. The primary node removes any candidates that cannot init replica recovery (essentially alive and knows about the block) to create a sync list. Stage 2 issues updates to the sync list – _but fails if any node fails_ unlike the first stage. The NN should be informed of nodes that did succeed. Manual recovery will also fail until the problematic node is temporarily stopped so a connection refused will induce the bad node to be pruned from the candidates. Recovery succeeds, the lease is released, under replication is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12049) Recommissioning live nodes stalls the NN
Daryn Sharp created HDFS-12049: -- Summary: Recommissioning live nodes stalls the NN Key: HDFS-12049 URL: https://issues.apache.org/jira/browse/HDFS-12049 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Critical A node refresh will recommission included nodes that are alive and in decommissioning or decommissioned state. The recommission will scan all blocks on the node, find over replicated blocks, chose an excess, queue an invalidate. The process is expensive and worsened by overhead of storage types (even when not in use). It can be especially devastating because the write lock is held for the entire node refresh. _Recommissioning 67 nodes with ~500k blocks/node stalled rpc services for over 4 mins._ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11648) Lazy construct the IIP pathname
Daryn Sharp created HDFS-11648: -- Summary: Lazy construct the IIP pathname Key: HDFS-11648 URL: https://issues.apache.org/jira/browse/HDFS-11648 Project: Hadoop HDFS Issue Type: Improvement Reporter: Daryn Sharp Assignee: Daryn Sharp The IIP pathname is a string constructed from the byte[][] components. If the pathname will never be accessed, ex. processing listStatus children, building the path is unnecessarily expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11379) DFSInputStream may infinite loop requesting block locations
Daryn Sharp created HDFS-11379: -- Summary: DFSInputStream may infinite loop requesting block locations Key: HDFS-11379 URL: https://issues.apache.org/jira/browse/HDFS-11379 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.7.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical DFSInputStream creation caches file size and initial range of locations. If the file is truncated (or replaced) and the client attempts to read outside the initial range, the client goes into a tight infinite looping requesting locations for the nonexistent range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11310) Reduce the performance impact of the balancer (trunk port)
Daryn Sharp created HDFS-11310: -- Summary: Reduce the performance impact of the balancer (trunk port) Key: HDFS-11310 URL: https://issues.apache.org/jira/browse/HDFS-11310 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, namenode Affects Versions: 3.0.0-alpha1 Reporter: Daryn Sharp Priority: Critical HDFS-7967 introduced a highly performant balancer getBlocks() query that scales to large/dense clusters. The simple design implementation depends on the triplets data structure. HDFS-9260 removed the triplets which fundamentally changes the implementation. Either that patch must be reverted or the getBlocks() patch needs reimplementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10997) Reduce number of path resolving methods
Daryn Sharp created HDFS-10997: -- Summary: Reduce number of path resolving methods Key: HDFS-10997 URL: https://issues.apache.org/jira/browse/HDFS-10997 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Daryn Sharp Assignee: Daryn Sharp FSDirectory contains many methods for resolving paths to an IIP and/or inode. These should be unified into a couple methods that will consistently do the basics of resolving reserved paths, blocking write ops from snapshot paths, verifying ancestors as directories, and throwing if symlinks are encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10980) Optimize check for existence of parent directory
Daryn Sharp created HDFS-10980: -- Summary: Optimize check for existence of parent directory Key: HDFS-10980 URL: https://issues.apache.org/jira/browse/HDFS-10980 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Daryn Sharp Assignee: Daryn Sharp {{FSDirectory.verifyParentDir()}} uses a {{Path}} object to parse and return the parent path. This is very expensive compared to using the path within the IIP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10979) Pass IIP for FSDirDeleteOp methods
Daryn Sharp created HDFS-10979: -- Summary: Pass IIP for FSDirDeleteOp methods Key: HDFS-10979 URL: https://issues.apache.org/jira/browse/HDFS-10979 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Daryn Sharp Assignee: Daryn Sharp Remove path strings from method signatures and/or replace with IIP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10956) Remove rename/delete performance penalty when not using snapshots
Daryn Sharp created HDFS-10956: -- Summary: Remove rename/delete performance penalty when not using snapshots Key: HDFS-10956 URL: https://issues.apache.org/jira/browse/HDFS-10956 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Daryn Sharp Assignee: Daryn Sharp When deleting or renaming directories, the entire subtree(s) is scanned for snapshottable directories. The performance penalty may become very expensive for dense trees. The snapshot manager knows if snapshots are in use, so clusters not using snapshots should not take the performance penalty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10955) Pass IIP for FSDirAttr methods
Daryn Sharp created HDFS-10955: -- Summary: Pass IIP for FSDirAttr methods Key: HDFS-10955 URL: https://issues.apache.org/jira/browse/HDFS-10955 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Daryn Sharp Assignee: Daryn Sharp Methods should always use the resolved IIP instead of re-solving the path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10940) Reduce performance penalty of block caching when not used
Daryn Sharp created HDFS-10940: -- Summary: Reduce performance penalty of block caching when not used Key: HDFS-10940 URL: https://issues.apache.org/jira/browse/HDFS-10940 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.7 Reporter: Daryn Sharp Assignee: Daryn Sharp For every block location generated, the CacheManager will create a junk object for a hash lookup of cached locations. If there are no cached blocks, none of this is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10939) Reduce performance penalty of encryption zones
Daryn Sharp created HDFS-10939: -- Summary: Reduce performance penalty of encryption zones Key: HDFS-10939 URL: https://issues.apache.org/jira/browse/HDFS-10939 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 2.7 Reporter: Daryn Sharp Assignee: Daryn Sharp The encryption zone APIs should be optimized to extensively use IIPs to eliminate path resolutions. The performance penalties incurred by common operations like creation of file statuses may be reduced by more extensive short-circuiting of EZ lookups when no EZs exist. All file creates should not be subjected to the multi-stage locking performance penalty required only for EDEK generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10851) FSDirStatAndListingOp: stop passing path as string
Daryn Sharp created HDFS-10851: -- Summary: FSDirStatAndListingOp: stop passing path as string Key: HDFS-10851 URL: https://issues.apache.org/jira/browse/HDFS-10851 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp Path strings should be resolved once into INodesInPath. The IIP should be used extensively from that point forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10850) getEZForPath should NOT throw FNF
Daryn Sharp created HDFS-10850: -- Summary: getEZForPath should NOT throw FNF Key: HDFS-10850 URL: https://issues.apache.org/jira/browse/HDFS-10850 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.0 Reporter: Daryn Sharp Priority: Blocker HDFS-9433 made an incompatible change to the semantics of getEZForPath. It used to return the EZ of the closest ancestor path. It never threw FNF. A common use of getEZForPath to determining if a file can be renamed, or must be copied due to mismatched EZs. Notably, this has broken hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10789) Route webhdfs through the RPC call queue
Daryn Sharp created HDFS-10789: -- Summary: Route webhdfs through the RPC call queue Key: HDFS-10789 URL: https://issues.apache.org/jira/browse/HDFS-10789 Project: Hadoop HDFS Issue Type: Improvement Components: ipc, webhdfs Reporter: Daryn Sharp Assignee: Daryn Sharp Webhdfs is extremely expensive under load and is not subject to the QoS benefits of the RPC call queue. HADOOP-13537 provides the basis for routing webhdfs through the call queue to provide unified QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10779) Rename does not need to re-solve destination
Daryn Sharp created HDFS-10779: -- Summary: Rename does not need to re-solve destination Key: HDFS-10779 URL: https://issues.apache.org/jira/browse/HDFS-10779 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp Rename uses {{FSDirectory.isDir(String)}} to determine if the destination is a directory. This dissect the path, creates an IIP, checks if the last inode is a directory. The rename operations already have the IIP and can check it directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10772) Reduce byte/string conversions for get listing
Daryn Sharp created HDFS-10772: -- Summary: Reduce byte/string conversions for get listing Key: HDFS-10772 URL: https://issues.apache.org/jira/browse/HDFS-10772 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Daryn Sharp Assignee: Daryn Sharp {{FSDirectory.getListingInt}} does a byte/string conversion for the byte[] startAfter just to determine if it should be resolved as an inode path. This is not the common case but rather for NFS support so it should be avoided. When the resolution is necessary the conversions may be reduced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10768) Optimize mkdir ops
Daryn Sharp created HDFS-10768: -- Summary: Optimize mkdir ops Key: HDFS-10768 URL: https://issues.apache.org/jira/browse/HDFS-10768 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp Directory creation causes excessive object allocation: ex. an immutable list builder, containing the string of components converted from the IIP's byte[]s, sublist views of the string list, iterable, followed by string to byte[] conversion. This can all be eliminated by accessing the component's byte[] in the IIP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10762) Pass IIP for file status related methods
Daryn Sharp created HDFS-10762: -- Summary: Pass IIP for file status related methods Key: HDFS-10762 URL: https://issues.apache.org/jira/browse/HDFS-10762 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp The frequently called file status methods will not require path re-resolves if the IIP is passed down the call stack. The code can be simplified further if the IIP tracks if the original path was a reserved raw path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10745) Directly resolve paths into INodesInPath
Daryn Sharp created HDFS-10745: -- Summary: Directly resolve paths into INodesInPath Key: HDFS-10745 URL: https://issues.apache.org/jira/browse/HDFS-10745 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Daryn Sharp Assignee: Daryn Sharp The intermediate resolution to a string, only to be decomposed by {{INodesInPath}} back into a byte[][] can be eliminated by resolving directly to an IIP. The IIP will contain the resolved path if required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10744) Internally optimize path component resolution
Daryn Sharp created HDFS-10744: -- Summary: Internally optimize path component resolution Key: HDFS-10744 URL: https://issues.apache.org/jira/browse/HDFS-10744 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp {{FSDirectory}}'s path resolution currently uses a mixture of string & byte[][] conversions, back to string, back to byte[][] for {{INodesInPath}}. Internally all path component resolution should be byte[][]-based as the precursor to instantiating an {{INodesInPath}} w/o the last 2 unnecessary conversions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10743) MiniDFSCluster test runtimes can be drastically reduce
Daryn Sharp created HDFS-10743: -- Summary: MiniDFSCluster test runtimes can be drastically reduce Key: HDFS-10743 URL: https://issues.apache.org/jira/browse/HDFS-10743 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp {{MiniDFSCluster}} tests have excessive runtimes. The main problem appears to be the heartbeat interval. The NN may have to wait up to 3s (default value) for all DNs to heartbeat, triggering registration, so NN can go active. Tests that repeatedly restart the NN are severely affected. Example for varying heartbeat intervals for {{TestFSImageWithAcl}}: * 3s = ~70s -- (disgusting, why I investigated) * 1s = ~27s * 500ms = ~17s -- (had to hack DNConf for millisecond precision) That a 4x improvement in runtime. 17s is still excessively long for what the test does. Further areas to explore when running tests: * Reduce numerous sleeps intervals in DN's {{BPServiceActor}}. * Ensure heartbeats and initial BR are sent immediately upon (re)registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10722) Fix race condition in TestEditLog#testBatchedSyncWithClosedLogs
Daryn Sharp created HDFS-10722: -- Summary: Fix race condition in TestEditLog#testBatchedSyncWithClosedLogs Key: HDFS-10722 URL: https://issues.apache.org/jira/browse/HDFS-10722 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp The test may fail the following assertion if async edit logs are enabled: {{logging edit without syncing should do not affect txid expected:<1> but was:<2>}}. The async thread is doing batched syncs in the background. logSync just ensures the edit is durable, so the txid may increase prior to sync. It's a race. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp reopened HDFS-10301: > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Assignee: Vinitha Reddy Gankidi >Priority: Critical > Fix For: 2.7.4 > > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, > HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, > HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, > HDFS-10301.012.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, > HDFS-10301.sample.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10711) Optimize FSPermissionChecker group membership check
Daryn Sharp created HDFS-10711: -- Summary: Optimize FSPermissionChecker group membership check Key: HDFS-10711 URL: https://issues.apache.org/jira/browse/HDFS-10711 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp HADOOP-13442 obviates the need for multiple group related object allocations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10674) Optimize creating a full path from an inode
Daryn Sharp created HDFS-10674: -- Summary: Optimize creating a full path from an inode Key: HDFS-10674 URL: https://issues.apache.org/jira/browse/HDFS-10674 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp {{INode#getFullPathName}} walks up the inode tree, creates a INode[], converts each component byte[] name to a String while building the path. This involves many allocations, copies, and char conversions. The path should be built with a single byte[] allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10673) Optimize FSPermissionChecker's internal path usage
Daryn Sharp created HDFS-10673: -- Summary: Optimize FSPermissionChecker's internal path usage Key: HDFS-10673 URL: https://issues.apache.org/jira/browse/HDFS-10673 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp The INodeAttributeProvider and AccessControlEnforcer features degrade performance and generate excessive garbage even when neither is used. Main issues: # A byte[][] of components is unnecessarily created. Each path component lookup converts a subrange of the byte[][] to a new String[] - then not used by default attribute provider. # Subaccess checks are insanely expensive. The full path of every subdir is created by walking up the inode tree, creating a INode[], building a string by converting each inode's byte[] name to a string, etc. Which will only be used if there's an exception. The expensive of #1 should only be incurred when using the provider/enforcer feature. For #2, paths should be created on-demand for exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10662) Optimize UTF8 string/byte conversions
Daryn Sharp created HDFS-10662: -- Summary: Optimize UTF8 string/byte conversions Key: HDFS-10662 URL: https://issues.apache.org/jira/browse/HDFS-10662 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp String/byte conversions may take either a Charset instance or its canonical name. One might think a Charset instance would be faster due to avoiding a lookup and instantiation of a Charset, but it's not. The canonical string name variants will cache the string encoder/decoder (obtained from a Charset) resulting in better performance. LOG4J2-935 describes a real-world performance boost. I micro-benched a marginal runtime improvement on jdk 7/8. However for a 16 byte path, using the canonical name generated 50% less garbage. For a 64 byte path, 25% of the garbage. Given the sheer number of times that paths are (re)parsed, the cost adds up quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10656) Optimize conversion of byte arrays back to path string
Daryn Sharp created HDFS-10656: -- Summary: Optimize conversion of byte arrays back to path string Key: HDFS-10656 URL: https://issues.apache.org/jira/browse/HDFS-10656 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp {{DFSUtil.byteArray2PathString}} generates excessive object allocation. # each byte array is encoded to a string (copy) # string appended to a builder which extracts the chars from the intermediate string (copy) and adds to its own char array # builder's char array is re-alloced if over 16 chars (copy) # builder's toString creates another string (copy) Instead of allocating all these objects and performing multiple byte/char encoding/decoding conversions, the byte array can be built in-place with a single final conversion to a string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10653) Optimize conversion from path string to components
Daryn Sharp created HDFS-10653: -- Summary: Optimize conversion from path string to components Key: HDFS-10653 URL: https://issues.apache.org/jira/browse/HDFS-10653 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Converting a path String to a byte[][] currently requires an unnecessary intermediate conversion from String to String[]. Removing this will reduce excessive object allocation and byte copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10619) Cache path in InodesInPath
Daryn Sharp created HDFS-10619: -- Summary: Cache path in InodesInPath Key: HDFS-10619 URL: https://issues.apache.org/jira/browse/HDFS-10619 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Daryn Sharp Assignee: Daryn Sharp INodesInPath#getPath, a frequently called method, dynamically builds the path. IIP should cache the path upon construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10616) Improve performance of path handling
Daryn Sharp created HDFS-10616: -- Summary: Improve performance of path handling Key: HDFS-10616 URL: https://issues.apache.org/jira/browse/HDFS-10616 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Path handling in the namesystem and directory is very inefficient. The path is repeatedly resolved, decomposed into path components, recombined to a full path. parsed again, throughout the system. This is directly inefficient for general performance, and indirectly via unnecessary pressure on young gen GC. The namesystem should only operate on paths, parse it once into inodes, and the directory should only operate on inodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10343) BlockManager#createLocatedBlocks may return blocks on failed storages
Daryn Sharp created HDFS-10343: -- Summary: BlockManager#createLocatedBlocks may return blocks on failed storages Key: HDFS-10343 URL: https://issues.apache.org/jira/browse/HDFS-10343 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.6.0 Reporter: Daryn Sharp Storage state is ignored when building the machines list. Failed storage removal is not immediate so clients may be directed to bad locations. The client recovers but it's less than ideal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-10342) BlockManager#createLocatedBlocks should not check corrupt replicas if none are corrupt
Daryn Sharp created HDFS-10342: -- Summary: BlockManager#createLocatedBlocks should not check corrupt replicas if none are corrupt Key: HDFS-10342 URL: https://issues.apache.org/jira/browse/HDFS-10342 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 2.7.0 Reporter: Daryn Sharp {{corruptReplicas#isReplicaCorrupt(block, node)}} is called for every node while populating the machines array. There's no need to invoke the method if {{corruptReplicas#numCorruptReplicas(block)}} returned 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines
Daryn Sharp created HDFS-10326: -- Summary: Disable setting tcp socket send/receive buffers for write pipelines Key: HDFS-10326 URL: https://issues.apache.org/jira/browse/HDFS-10326 Project: Hadoop HDFS Issue Type: Bug Components: hdfs, datanode Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Daryn Sharp The DataStreamer and the Datanode use a hardcoded DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write pipeline. Explicitly setting tcp buffer sizes disables tcp stack auto-tuning. The hardcoded value will saturate a 1Gb with 1ms RTT. 105Mbs at 10ms. Paltry 11Mbs over a 100ms long haul. 10Gb networks are underutilized. There should either be a configuration to completely disable settings the buffers, or the the setReceiveBuffer and setSendBuffer should be removed entirely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9566) Remove expensive getStorages method
Daryn Sharp created HDFS-9566: - Summary: Remove expensive getStorages method Key: HDFS-9566 URL: https://issues.apache.org/jira/browse/HDFS-9566 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.0.0, 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp HDFS-5318 added a {{BlocksMap#getStorages(Block, State)}} which is based on iterables and predicates. The method is very expensive compared to a simple comparison/continue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9557) Reduce object allocation in PB conversion
Daryn Sharp created HDFS-9557: - Summary: Reduce object allocation in PB conversion Key: HDFS-9557 URL: https://issues.apache.org/jira/browse/HDFS-9557 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp PB conversions use {{ByteString.copyFrom}} to populate the builder. Unfortunately this creates unique instances for empty arrays instead of returning the singleton {{ByteString.EMPTY}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9287) Block placement completely fails if too many nodes are decommissioning
Daryn Sharp created HDFS-9287: - Summary: Block placement completely fails if too many nodes are decommissioning Key: HDFS-9287 URL: https://issues.apache.org/jira/browse/HDFS-9287 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Critical The DatanodeManager coordinates with the HeartbeatManager to update HeartbeatManager.Stats to track capacity and load. This is crucial for block placement to consider space and load. It's completely broken for decomm nodes. The heartbeat manager substracts the prior values before it adds new values. During registration of a decomm node, it substracts before seeding the initial values. This decrements nodesInService, flips state to decomm, add will not increment nodesInService (correct). There are other math bugs (double adding) that accidentally work due to 0 values. The result is every decomm node decrements the node count used for block placement. When enough nodes are decomm, the replication monitor will silently stop working. No logging. It searches all nodes and just gives up. Eventually, all block allocation will also completely fail. No files can be created. No jobs can be submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9258) NN should indicate which nodes are stale
Daryn Sharp created HDFS-9258: - Summary: NN should indicate which nodes are stale Key: HDFS-9258 URL: https://issues.apache.org/jira/browse/HDFS-9258 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Determining why the NN is not coming out of safemode is difficult - is it a bug or pending block reports? If the number of nodes appears sufficient, but there are missing blocks, it would be nice to know which nodes haven't block reported (stale). Instead of forcing the NN to leave safemode prematurely, the SE can first force block reports from stale nodes. The datanode report and the web ui's node list should contain this information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9198) Coalesce IBR processing in the NN
Daryn Sharp created HDFS-9198: - Summary: Coalesce IBR processing in the NN Key: HDFS-9198 URL: https://issues.apache.org/jira/browse/HDFS-9198 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp IBRs from thousands of DNs under load will degrade NN performance due to excessive write-lock contention from multiple IPC handler threads. The IBR processing is quick, so the lock contention may be reduced by coalescing multiple IBRs into a single write-lock transaction. The handlers will also be freed up faster for other operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC
Daryn Sharp created HDFS-9107: - Summary: Prevent NN's unrecoverable death spiral after full GC Key: HDFS-9107 URL: https://issues.apache.org/jira/browse/HDFS-9107 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical A full GC pause in the NN that exceeds the dead node interval can lead to an infinite cycle of full GCs. The most common situation that precipitates an unrecoverable state is a network issue that temporarily cuts off multiple racks. The NN wakes up and falsely starts marking nodes dead. This bloats the replication queues which increases memory pressure. The replications create a flurry of incremental block reports and a glut of over-replicated blocks. The "dead" nodes heartbeat within seconds. The NN forces a re-registration which requires a full block report - more memory pressure. The NN now has to invalidate all the over-replicated blocks. The extra blocks are added to invalidation queues, tracked in an excess blocks map, etc - much more memory pressure. All the memory pressure can push the NN into another full GC which repeats the entire cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9017) UI shows wrong last contact for dead nodes
Daryn Sharp created HDFS-9017: - Summary: UI shows wrong last contact for dead nodes Key: HDFS-9017 URL: https://issues.apache.org/jira/browse/HDFS-9017 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Minor It's showing the last contact as the restart of the NN host (not process, host). Presumably it's using monotonic time 0. Ideally last contact for nodes that never connected would be "never" instead of the epoch or boot time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8776) Decom manager should not be active on standby
Daryn Sharp created HDFS-8776: - Summary: Decom manager should not be active on standby Key: HDFS-8776 URL: https://issues.apache.org/jira/browse/HDFS-8776 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Daryn Sharp The decommission manager should not be actively processing on the standby. The decomm manager goes through the costly computation for determining every block on the node requires replication yet doesn't queue them for replication - because it's in standby. The decomm manager is holding the namesystem write lock, causing DNs to timeout on heartbeats or IBRs, NN purges the call queue of timed out clients, NN processes some heartbeats/IBRs before the decomm manager locks up the namesystem again. Nodes attempting to register will be sending full BRs which are more costly to send and discard than a heartbeat. If a failover is required, the standby will likely have to struggle very hard to not GC while catching up on its queued IBRs while DNs continue to fill the call queue and time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8674) Improve performance of postponed block scans
Daryn Sharp created HDFS-8674: - Summary: Improve performance of postponed block scans Key: HDFS-8674 URL: https://issues.apache.org/jira/browse/HDFS-8674 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.6.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical When a standby goes active, it marks all nodes as stale which will cause block invalidations for over-replicated blocks to be queued until full block reports are received from the nodes with the block. The replication monitor scans the queue with O(N) runtime. It picks a random offset and iterates through the set to randomize blocks scanned. The result is devastating when a cluster loses multiple nodes during a rolling upgrade. Re-replication occurs, the nodes come back, the excess block invalidations are postponed. Rescanning just 2k blocks out of millions of postponed blocks may take multiple seconds. During the scan, the write lock is held which stalls all other processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8675) IBRs from dead DNs go into infinite loop
Daryn Sharp created HDFS-8675: - Summary: IBRs from dead DNs go into infinite loop Key: HDFS-8675 URL: https://issues.apache.org/jira/browse/HDFS-8675 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Daryn Sharp If the DN sends an IBR after the NN declares it dead, the NN returns an IOE of unregistered or dead. The DN catches the IOE, ignores it, and infinitely loops spamming the NN with retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8616) Cherry pick HDFS-6495 for excess block leak
Daryn Sharp created HDFS-8616: - Summary: Cherry pick HDFS-6495 for excess block leak Key: HDFS-8616 URL: https://issues.apache.org/jira/browse/HDFS-8616 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Busy clusters quickly leak tens or hundreds of thousands of excess blocks which slow BR processing. HDFS-6495 should be cherry picked into 2.7.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8498) Blocks can be committed with wrong size
Daryn Sharp created HDFS-8498: - Summary: Blocks can be committed with wrong size Key: HDFS-8498 URL: https://issues.apache.org/jira/browse/HDFS-8498 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.5.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical When an IBR for a UC block arrives, the NN updates the expected location's block and replica state _only_ if it's on an unexpected storage for an expected DN. If it's for an expected storage, only the genstamp is updated. When the block is committed, and the expected locations are verified, only the genstamp is checked. The size is not checked but it wasn't updated in the expected locations anyway. A faulty client may misreport the size when committing the block. The block is effectively corrupted. If the NN issues replications, the received IBR is considered corrupt, the NN invalidates the block, immediately issues another replication. The NN eventually realizes all the original replicas are corrupt after full BRs are received from the original DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8491) DN shutdown race conditions with open xceivers
Daryn Sharp created HDFS-8491: - Summary: DN shutdown race conditions with open xceivers Key: HDFS-8491 URL: https://issues.apache.org/jira/browse/HDFS-8491 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Daryn Sharp DN shutdowns at least for restarts have many race conditions. Shutdown is very noisy with exceptions. The DN notifies writers of the restart, waits 1s and then interrupts the xceiver threads but does not join. The ipc server is stopped and then the bpos services are stopped. Xceivers then encounter NPEs in closeBlock because the block no longer exists in the volume map when transient storage is checked. Just before that, the DN notifies the NN the block was received. This does not appear to always be true, but rather that the thread was interrupted. They race with bpos shutdown, and luckily appear to lose, to send the block received. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8492) DN should notify NN when client requests a missing block
Daryn Sharp created HDFS-8492: - Summary: DN should notify NN when client requests a missing block Key: HDFS-8492 URL: https://issues.apache.org/jira/browse/HDFS-8492 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp If the DN has a block its volume map but not on-disk, it tells clients it's an invalid block id. The NN is not informed of the missing block until either the bp slice scanner or the directory scanner detects the missing block. DN should remove the replica from the volume map and inform the NN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8402) Fsck exit codes are not reliable
Daryn Sharp created HDFS-8402: - Summary: Fsck exit codes are not reliable Key: HDFS-8402 URL: https://issues.apache.org/jira/browse/HDFS-8402 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0 Reporter: Daryn Sharp Assignee: Daryn Sharp HDFS-6663 added the ability to check specific blocks. The exit code is non-deterministically based on the state (corrupt, healthy, etc) of the last displayed block's last storage location - instead of whether any of the checked blocks' storages are corrupt. Blocks with decommissioning or decommissioned nodes should not be flagged as an error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8133) Improve readability of deleted block check
Daryn Sharp created HDFS-8133: - Summary: Improve readability of deleted block check Key: HDFS-8133 URL: https://issues.apache.org/jira/browse/HDFS-8133 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp The current means of checking if a block is deleted is checking if its block collection is null. A more readable approach is an isDeleted method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7990) IBR delete ack should not be delayed
Daryn Sharp created HDFS-7990: - Summary: IBR delete ack should not be delayed Key: HDFS-7990 URL: https://issues.apache.org/jira/browse/HDFS-7990 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp HDFS-395 added the incremental BR feature. A concern was avoiding a race condition with ack-ing block invalidates followed by the directory scanner re-adding the block before the async delete service removes the block, possibly resulting in a full BR that includes the previously delete ack-ed block. The solution was to batch delay block deletion acks via a hardcoded interval of 100 * heartbeat interval (default: 5min). The delay isn't required: # The {{FSDatasetSpi}} tracks blocks pending deletions precisely so the scanner won't re-add the block to the volume map # Block receiving, received, and deleted go into the same pending report. A block received event will trigger an immediate IBR which includes the deletion acks. Ie. The delay is meaningless for all but a quiescent cluster # Failing to promptly report deleted blocks on a quiescent cluster prevents the NN from updating the block maps to remove the locations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7967) Reduce the performance impact of the balancer
Daryn Sharp created HDFS-7967: - Summary: Reduce the performance impact of the balancer Key: HDFS-7967 URL: https://issues.apache.org/jira/browse/HDFS-7967 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp The balancer needs to query for blocks to move from overly full DNs. The block lookup is extremely inefficient. An iterator of the node's blocks is created from the iterators of its storages' blocks. A random number is chosen corresponding to how many blocks will be skipped via the iterator. Each skip requires costly scanning of triplets. The current design also only considers node imbalances while ignoring imbalances within the nodes's storages. A more efficient and intelligent design may eliminate the costly skipping of blocks via round-robin selection of blocks from the storages based on remaining capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7964) Add support for async edit logging
Daryn Sharp created HDFS-7964: - Summary: Add support for async edit logging Key: HDFS-7964 URL: https://issues.apache.org/jira/browse/HDFS-7964 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: 2.0.2-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Edit logging is a major source of contention within the NN. LogEdit is called within the namespace write log, while logSync is called outside of the lock to allow greater concurrency. The handler thread remains busy until logSync returns to provide the client with a durability guarantee for the response. Write heavy RPC load and/or slow IO causes handlers to stall in logSync. Although the write lock is not held, readers are limited/starved and the call queue fills. Combining an edit log thread with postponed RPC responses from HADOOP-9953 will provide the same durability guarantee but immediately free up the handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7586) HFTP does not work when namenode bind on wildcard
[ https://issues.apache.org/jira/browse/HDFS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp resolved HDFS-7586. --- Resolution: Not a Problem HFTP does not work when namenode bind on wildcard - Key: HDFS-7586 URL: https://issues.apache.org/jira/browse/HDFS-7586 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Benoit Perroud Priority: Minor Attachments: HDFS-7586-v0.1.txt When wildcard binding for NameNode RPC is turned on (i.e. dfs.namenode.rpc-address=0.0.0.0:8020), HFTP download is failing. Call to http://namenode:50070/data/.. returns the header Location with parameter nnaddr=0.0.0.0:8020, which is unlikely to ever succeed :) The idea would be, if wildcard binding is enabled, to get read the IP address the request is actually connected to from the HttpServletRequest and return this one. WDYT? How to reproduce: 1. Turn on wildcard binding {code}dfs.namenode.rpc-address=0.0.0.0:8020{code} 2. Upload a file {code}$ echo 123 | hdfs dfs -put - /tmp/randomFile.txt{code} 3. Validate it's failing {code} $ hdfs dfs -cat hftp://namenode1/tmp/randomFile.txt {code} 4. Get more details via curl {code} $ curl -vv http://namenode1:50070/data/tmp/randomFile.txt?ugi=hdfs | grep Location: Location: http://datanode003:50075/streamFile/tmp/randomFile.txt?ugi=hdfsamp;nnaddr=0.0.0.0:8020 {code} We can clearly see the 0.0.0.0 returned as the NN ip. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7607) Use random rack-local node for webhdfs opens to avoid OOM on DNs
Daryn Sharp created HDFS-7607: - Summary: Use random rack-local node for webhdfs opens to avoid OOM on DNs Key: HDFS-7607 URL: https://issues.apache.org/jira/browse/HDFS-7607 Project: Hadoop HDFS Issue Type: Improvement Components: namenode, webhdfs Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Webhdfs currently redirects a client to the DN that physically has one of the replicas. Unlike the hdfs data streamer protocol which can easily handle hundreds or thousands of connections, jetty has poor performance under heavy load. Webhdfs clients can easily overwhelm the DNs and likely cause OOMs or excessive GC. The NN should redirect the client to a rack-local location to distribute the webhdfs load across multiple hosts. The rack can then use the lightweight streamer protocol. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7597) Clients seeking over webhdfs may crash the NN
Daryn Sharp created HDFS-7597: - Summary: Clients seeking over webhdfs may crash the NN Key: HDFS-7597 URL: https://issues.apache.org/jira/browse/HDFS-7597 Project: Hadoop HDFS Issue Type: Improvement Components: webhdfs Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Webhdfs seeks involve closing the current connection, and reissuing a new open request with the new offset. The RPC layer caches connections so the DN keeps a lingering connection open to the NN. Connection caching is in part based on UGI. Although the client used the same token for the new offset request, the UGI is different which forces the DN to open another unnecessary connection to the NN. A job that performs many seeks will easily crash the NN due to fd exhaustion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7457) DatanodeID generates excessive garbage
Daryn Sharp created HDFS-7457: - Summary: DatanodeID generates excessive garbage Key: HDFS-7457 URL: https://issues.apache.org/jira/browse/HDFS-7457 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp {{DatanodeID#getXferAddr}} is a dynamically generated string. This string is repeatedly generated for the hash code, equality, comparisons, and stringification. Every DN-NN RPC method calls {{DatanodeManager#getDatanode}} to validate if the node is registered, which involves a call to {{getXferAddr}}. The dynamic computation generates unnecessary trash that puts unnecessary pressure on the GC. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7433) DatanodeMap is inefficient
Daryn Sharp created HDFS-7433: - Summary: DatanodeMap is inefficient Key: HDFS-7433 URL: https://issues.apache.org/jira/browse/HDFS-7433 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical The datanode map is currently a {{TreeMap}}. For many thousands of datanodes, tree lookups are ~10X more expensive than a {{HashMap}}. Insertions and removals are up to 100X more expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7434) DatanodeID hashCode should not be mutable
Daryn Sharp created HDFS-7434: - Summary: DatanodeID hashCode should not be mutable Key: HDFS-7434 URL: https://issues.apache.org/jira/browse/HDFS-7434 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Mutable hash codes may lead to orphaned instances in a collection. Instances must always be removed prior to modification of hash code values, and re-inserted. Although current code appears to do this, the mutable hash code is a landmine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7435) PB encoding of block reports is very inefficient
Daryn Sharp created HDFS-7435: - Summary: PB encoding of block reports is very inefficient Key: HDFS-7435 URL: https://issues.apache.org/jira/browse/HDFS-7435 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Block reports are encoded as a PB repeating long. Repeating fields use an {{ArrayList}} with default capacity of 10. A block report containing tens or hundreds of thousand of longs (3 for each replica) is extremely expensive since the {{ArrayList}} must realloc many times. Also, decoding repeating fields will box the primitive longs which must then be unboxed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7213) processIncrementalBlockReport performance degradation
Daryn Sharp created HDFS-7213: - Summary: processIncrementalBlockReport performance degradation Key: HDFS-7213 URL: https://issues.apache.org/jira/browse/HDFS-7213 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Daryn Sharp Assignee: Eric Payne Priority: Critical {{BlockManager#processIncrementalBlockReport}} has a debug line that is missing a {{isDebugEnabled}} check. The write lock is being held. Coupled with the increase in incremental block reports from receiving blocks, under heavy load this log line noticeably degrades performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7046) HA NN can NPE upon transition to active
Daryn Sharp created HDFS-7046: - Summary: HA NN can NPE upon transition to active Key: HDFS-7046 URL: https://issues.apache.org/jira/browse/HDFS-7046 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.5.0, 3.0.0 Reporter: Daryn Sharp Priority: Critical While processing edits, the NN may decide after adjusting block totals to leave safe mode - in the middle of the edit. Going active starts the secret manager which generates a new secret key, which in turn generates an edit, which NPEs because the edit log is not open. # Transitions should _not_ occur in the middle of an edit. # The edit log appears to claim it's open for write when the stream isn't even open -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7005) DFS input streams do not timeout
Daryn Sharp created HDFS-7005: - Summary: DFS input streams do not timeout Key: HDFS-7005 URL: https://issues.apache.org/jira/browse/HDFS-7005 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.5.0, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Input streams lost their timeout. The problem appears to be {{DFSClient#newConnectedPeer}} does not set the read timeout. During a temporary network interruption the server will close the socket, unbeknownst to the client host, which blocks on a read forever. The results are dire. Services such as the RM, JHS, NMs, oozie servers, etc all need to be restarted to recover - unless you want to wait many hours for the tcp stack keepalive to detect the broken socket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-6964) NN fails to fix under replication leading to data loss
Daryn Sharp created HDFS-6964: - Summary: NN fails to fix under replication leading to data loss Key: HDFS-6964 URL: https://issues.apache.org/jira/browse/HDFS-6964 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Priority: Blocker We've encountered lost blocks due to node failure even when there is ample time to fix the under-replication. 2 nodes were lost. The 3rd node with the last remaining replicas averaged 1 copy block per heartbeat (3s) until ~7h later when that node was lost resulting in over 50 lost blocks. When the node was restarted and sent its BR the NN immediately began fixing the replication. In another data loss event, over 150 blocks were lost due to node failure but the timing of the node loss is not known so there may have been inadequate time to fix the under-replication unlike the first case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6967) DNs may OOM under high webhdfs load
Daryn Sharp created HDFS-6967: - Summary: DNs may OOM under high webhdfs load Key: HDFS-6967 URL: https://issues.apache.org/jira/browse/HDFS-6967 Project: Hadoop HDFS Issue Type: Bug Components: datanode, webhdfs Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Webhdfs uses jetty. The size of the request thread pool is limited, but jetty will accept and queue infinite connections. Every queued connection is heavy with buffers, etc. Unlike data streamer connections, thousands of webhdfs connections will quickly OOM a DN. The accepted requests must be bounded and excess clients rejected so they retry on a new DN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6948) DN rejects blocks if it has older UC block
Daryn Sharp created HDFS-6948: - Summary: DN rejects blocks if it has older UC block Key: HDFS-6948 URL: https://issues.apache.org/jira/browse/HDFS-6948 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp DNs appear to always reject blocks, even with newer genstamps, if it already has a UC copy in its tmp dir. {noformat}ReplicaAlreadyExistsException: Block XXX already exists in state TEMPORARY and thus cannot be created{noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6773) MiniDFSCluster can run dramatically faster
Daryn Sharp created HDFS-6773: - Summary: MiniDFSCluster can run dramatically faster Key: HDFS-6773 URL: https://issues.apache.org/jira/browse/HDFS-6773 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp The mini cluster is unnecessarily running with durable edit logs. The following change cut runtime of a single test from ~30s to ~10s. {code}EditLogFileOutputStream.setShouldSkipFsyncForTesting(true);{code} The mini cluster should default to this behavior after identifying the few edit log tests that probably depend on durable logs. -- This message was sent by Atlassian JIRA (v6.2#6252)