[jira] [Created] (HDFS-13123) RBF: Add a balancer tool to move data across subsluter
Wei Yan created HDFS-13123: -- Summary: RBF: Add a balancer tool to move data across subsluter Key: HDFS-13123 URL: https://issues.apache.org/jira/browse/HDFS-13123 Project: Hadoop HDFS Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Follow the discussion in HDFS-12615. This Jira is to track effort for building a rebalancer tool, used by router-based federation to move data among subclusters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13122) FSImage should not update quota counts on ObserverNode
Erik Krogen created HDFS-13122: -- Summary: FSImage should not update quota counts on ObserverNode Key: HDFS-13122 URL: https://issues.apache.org/jira/browse/HDFS-13122 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs, namenode Reporter: Erik Krogen Assignee: Erik Krogen Currently in {{FSImage#loadEdits()}}, after applying a set of edits, we call {code} updateCountForQuota(target.getBlockManager().getStoragePolicySuite(), target.dir.rootDir); {code} to update the quota counts for the entire namespace, which can be very expensive. This makes sense if we are about to become the ANN, since we need valid quotas, but not on an ObserverNode which does not need to enforce quotas. This is related to increasing the frequency with which the SbNN can tail edits from the ANN to decrease the lag time for transactions to appear on the Observer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13121) NPE when request file descriptors when SC read
Gang Xie created HDFS-13121: --- Summary: NPE when request file descriptors when SC read Key: HDFS-13121 URL: https://issues.apache.org/jira/browse/HDFS-13121 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 3.0.0 Reporter: Gang Xie Recently, we hit an issue that the DFSClient throws NPE. The case is that, the app process exceeds the limit of the max open file. In the case, the libhadoop never throw and exception but return null to the request of fds. But requestFileDescriptors use the returned fds directly without any check and then NPE. We need add a sanity check here of null pointer. private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer, Slot slot) throws IOException { ShortCircuitCache cache = clientContext.getShortCircuitCache(); final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(peer.getOutputStream())); SlotId slotId = slot == null ? null : slot.getSlotId(); new Sender(out).requestShortCircuitFds(block, token, slotId, 1, failureInjector.getSupportsReceiptVerification()); DataInputStream in = new DataInputStream(peer.getInputStream()); BlockOpResponseProto resp = BlockOpResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); DomainSocket sock = peer.getDomainSocket(); failureInjector.injectRequestFileDescriptorsFailure(); switch (resp.getStatus()) { case SUCCESS: byte buf[] = new byte[1]; FileInputStream[] fis = new FileInputStream[2]; {color:#d04437}sock.recvFileInputStreams(fis, buf, 0, buf.length);{color} ShortCircuitReplica replica = null; try { ExtendedBlockId key = new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId()); if (buf[0] == USE_RECEIPT_VERIFICATION.getNumber()) { LOG.trace("Sending receipt verification byte for slot {}", slot); sock.getOutputStream().write(0); } {color:#d04437}replica = new ShortCircuitReplica(key, fis[0], fis[1], cache,{color} {color:#d04437} Time.monotonicNow(), slot);{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Re: [DISCUSS] Meetup for HDFS tests and build infra
Created a poll [1] to inform scheduling. -C [1]: https://doodle.com/poll/r22znitzae9apfbf On Tue, Feb 6, 2018 at 3:09 PM, Chris Douglaswrote: > The HDFS build is not healthy. Many of the unit tests aren't actually > run in Jenkins due to resource exhaustion, haven't been updated since > build/test/data was the test temp dir, or are chronically unstable > (I'm looking at you, TestDFSStripedOutputStreamWithFailure). The > situation has deteriorated slowly, but we can't confidently merge > patches, let alone significant features, when our CI infra is in this > state. > > How would folks feel about a half to full-day meetup to work through > patches improving this, specifically? We can improve tests, > troubleshoot the build, and rev/commit existing patches. It would > require some preparation, so the simultaneous attention is productive > and not a coordination bottleneck. I started a wiki page for this [1], > please add to it. > > If enough people can make time for this, say in 2-3 weeks, the project > would certainly benefit. -C > > [1]: https://s.apache.org/ng3C - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13120) Snapshot diff could be corrupted after concat
Xiaoyu Yao created HDFS-13120: - Summary: Snapshot diff could be corrupted after concat Key: HDFS-13120 URL: https://issues.apache.org/jira/browse/HDFS-13120 Project: Hadoop HDFS Issue Type: Bug Components: namenode, snapshots Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao The snapshot diff can be corrupted after concat files. This could lead to Assertion upon DeleteSnapshot and getSnapshotDiff operations later. For example, we have seen customers hit stack trace similar to the one below but during loading edit entry of DeleteSnapshotOp. After the investigation, we found this is a regression caused by HDFS-3689 where the snapshot diff is not fully cleaned up after concat. I will post the unit test to repro this and fix for it shortly. {code} org.apache.hadoop.ipc.RemoteException(java.lang.AssertionError): Element already exists: element=0.txt, CREATED=[0.txt, 1.txt, 2.txt] at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:196) at org.apache.hadoop.hdfs.util.Diff.create(Diff.java:216) at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:463) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.combinePosteriorAndCollectBlocks(DirectoryWithSnapshotFeature.java:205) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.combinePosteriorAndCollectBlocks(DirectoryWithSnapshotFeature.java:162) at org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:100) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:728) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:830) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:237) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:292) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:321) at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.deleteSnapshot(FSDirSnapshotOp.java:249) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:6566) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1823) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1200) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1007) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:873) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:819) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2679) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13119) RBF: manage unavailable clusters
Íñigo Goiri created HDFS-13119: -- Summary: RBF: manage unavailable clusters Key: HDFS-13119 URL: https://issues.apache.org/jira/browse/HDFS-13119 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Íñigo Goiri When a federated cluster has one of the subcluster down, operations that run in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC connections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: branch2+JDK7 on Linux/x86
For more details, see https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/129/ [Feb 6, 2018 8:04:52 PM] (billie) YARN-7890. NPE during container relaunch. Contributed by Jason Lowe [Feb 6, 2018 9:36:32 PM] (kihwal) HADOOP-15212. Add independent secret manager method for logging expired -1 overall The following subsystems voted -1: unit xml The following subsystems voted -1 but were configured to be filtered/ignored: cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace The following subsystems are considered long running: (runtime bigger than 1h 0m 0s) unit Specific tests: Unreaped Processes : hadoop-common:1 hadoop-hdfs:22 bkjournal:7 hadoop-mapreduce-client-jobclient:13 hadoop-archives:1 hadoop-distcp:6 hadoop-extras:1 hadoop-gridmix:1 hadoop-yarn-applications-distributedshell:1 hadoop-yarn-client:6 hadoop-yarn-server-timelineservice:1 Failed junit tests : hadoop.fs.http.server.TestHttpFSServerNoACLs hadoop.ipc.TestMRCJCSocketFactory hadoop.mapred.TestClusterMRNotification hadoop.tools.TestIntegration hadoop.tools.util.TestProducerConsumer hadoop.tools.TestDistCpViewFs hadoop.resourceestimator.solver.impl.TestLpSolver hadoop.resourceestimator.service.TestResourceEstimatorService hadoop.yarn.sls.appmaster.TestAMSimulator hadoop.yarn.server.nodemanager.containermanager.linux.runtime.TestDockerContainerRuntime hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication Timed out junit tests : org.apache.hadoop.log.TestLogLevel org.apache.hadoop.hdfs.TestWriteRead org.apache.hadoop.hdfs.TestDatanodeRegistration org.apache.hadoop.hdfs.TestReservedRawPaths org.apache.hadoop.hdfs.TestAclsEndToEnd org.apache.hadoop.hdfs.TestFileCreation org.apache.hadoop.hdfs.TestDatanodeDeath org.apache.hadoop.hdfs.TestSafeMode org.apache.hadoop.hdfs.TestBlockMissingException org.apache.hadoop.hdfs.TestDFSClientRetries org.apache.hadoop.hdfs.TestFileAppend2 org.apache.hadoop.hdfs.TestFileCorruption org.apache.hadoop.hdfs.TestFileCreationDelete org.apache.hadoop.hdfs.TestDFSAddressConfig org.apache.hadoop.hdfs.TestSeekBug org.apache.hadoop.hdfs.TestDFSInputStream org.apache.hadoop.hdfs.TestRestartDFS org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache org.apache.hadoop.hdfs.TestDFSClientSocketSize org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead org.apache.hadoop.hdfs.TestDFSRollback org.apache.hadoop.hdfs.TestDFSClientExcludedNodes org.apache.hadoop.hdfs.TestAbandonBlock org.apache.hadoop.contrib.bkjournal.TestBootstrapStandbyWithBKJM org.apache.hadoop.contrib.bkjournal.TestBookKeeperJournalManager org.apache.hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints org.apache.hadoop.contrib.bkjournal.TestBookKeeperAsHASharedDir org.apache.hadoop.contrib.bkjournal.TestBookKeeperEditLogStreams org.apache.hadoop.contrib.bkjournal.TestBookKeeperSpeculativeRead org.apache.hadoop.contrib.bkjournal.TestCurrentInprogress org.apache.hadoop.mapred.lib.TestDelegatingInputFormat org.apache.hadoop.mapred.TestMRCJCFileInputFormat org.apache.hadoop.mapred.TestClusterMapReduceTestCase org.apache.hadoop.mapred.TestMRIntermediateDataEncryption org.apache.hadoop.mapred.TestJobSysDirWithDFS org.apache.hadoop.mapred.TestMRTimelineEventHandling org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath org.apache.hadoop.mapred.TestNetworkedJob org.apache.hadoop.mapred.TestMiniMRClientCluster org.apache.hadoop.mapred.TestReduceFetchFromPartialMem org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapred.TestMROpportunisticMaps org.apache.hadoop.tools.TestHadoopArchives org.apache.hadoop.tools.TestDistCpWithAcls org.apache.hadoop.tools.TestDistCpSync org.apache.hadoop.tools.TestDistCpWithXAttrs org.apache.hadoop.tools.TestDistCpSyncReverseFromTarget org.apache.hadoop.tools.TestDistCpSystem org.apache.hadoop.tools.TestDistCpSyncReverseFromSource org.apache.hadoop.tools.TestCopyFiles org.apache.hadoop.mapred.gridmix.TestSleepJob org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy org.apache.hadoop.yarn.client.TestRMFailover org.apache.hadoop.yarn.client.cli.TestYarnCLI org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.api.impl.TestYarnClientWithReservation
[jira] [Resolved] (HDFS-13105) Make hadoop proxy user changes reconfigurable in Datanode
[ https://issues.apache.org/jira/browse/HDFS-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukul Kumar Singh resolved HDFS-13105. -- Resolution: Not A Problem As pointed by [~kihwal] & [~rajive], -refreshSuperUserGroupsConfiguration provides method to update proxy user information on NN. > Make hadoop proxy user changes reconfigurable in Datanode > - > > Key: HDFS-13105 > URL: https://issues.apache.org/jira/browse/HDFS-13105 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > > Currently any changes to add/delete a new proxy user requires DN restart > requiring a downtime. This jira proposes to make the changes in proxy/user > configuration reconfiguration via that ReconfigurationProtocol so that the > changes can take effect without a DN restart. For details please refer > https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/Superusers.html. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-11187) Optimize disk access for last partial chunk checksum of Finalized replica
[ https://issues.apache.org/jira/browse/HDFS-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Bota reopened HDFS-11187: --- Assignee: Gabor Bota (was: Wei-Chiu Chuang) Reopening this to add the change to branch-2 > Optimize disk access for last partial chunk checksum of Finalized replica > - > > Key: HDFS-11187 > URL: https://issues.apache.org/jira/browse/HDFS-11187 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Wei-Chiu Chuang >Assignee: Gabor Bota >Priority: Major > Fix For: 3.1.0, 3.0.2 > > Attachments: HDFS-11187.001.patch, HDFS-11187.002.patch, > HDFS-11187.003.patch, HDFS-11187.004.patch, HDFS-11187.005.patch > > > The patch at HDFS-11160 ensures BlockSender reads the correct version of > metafile when there are concurrent writers. > However, the implementation is not optimal, because it must always read the > last partial chunk checksum from disk while holding FsDatasetImpl lock for > every reader. It is possible to optimize this by keeping an up-to-date > version of last partial checksum in-memory and reduce disk access. > I am separating the optimization into a new jira, because maintaining the > state of in-memory checksum requires a lot more work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13118) SnapshotDiffReport should provide the INode type
Ewan Higgs created HDFS-13118: - Summary: SnapshotDiffReport should provide the INode type Key: HDFS-13118 URL: https://issues.apache.org/jira/browse/HDFS-13118 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Ewan Higgs Currently the snapshot diff report will list which inodes were added, removed, renamed, etc. But to see what the INode actually is, we need to actually access the underlying snapshot - and this is cumbersome to do programmatically when the snapshot diff already has the information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org