[jira] [Commented] (HDFS-7037) Using distcp to copy data from insecure to secure cluster via hftp doesn't work (branch-2 only)

2015-01-21 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286085#comment-14286085
 ] 

Haohui Mai commented on HDFS-7037:
--

Note that distcp over webhdfs has the same issue as it has been discussed 
extensively in HDFS-6776. This should be fixed in distcp, not in 
{{HftpFileSystem}}.

 Using distcp to copy data from insecure to secure cluster via hftp doesn't 
 work  (branch-2 only)
 

 Key: HDFS-7037
 URL: https://issues.apache.org/jira/browse/HDFS-7037
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security, tools
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7037.001.patch


 This is a branch-2 only issue since hftp is only supported there. 
 Issuing distcp hftp://insecureCluster hdfs://secureCluster gave the 
 following failure exception:
 {code}
 14/09/13 22:07:40 INFO tools.DelegationTokenFetcher: Error when dealing 
 remote token:
 java.io.IOException: Error when dealing remote token: Internal Server Error
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.run(DelegationTokenFetcher.java:375)
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:238)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:501)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
   at 
 org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:77)
   at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:81)
   at 
 org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:342)
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:154)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:121)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:390)
 14/09/13 22:07:40 WARN security.UserGroupInformation: 
 PriviledgedActionException as:hadoopu...@xyz.com (auth:KERBEROS) 
 cause:java.io.IOException: Unable to obtain remote token
 14/09/13 22:07:40 ERROR tools.DistCp: Exception encountered 
 java.io.IOException: Unable to obtain remote token
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:249)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:501)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at 

[jira] [Commented] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2015-01-21 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286128#comment-14286128
 ] 

Arpit Agarwal commented on HDFS-7645:
-

The first restore was by design when the rolling upgrade feature was added 
(HDFS-6005). It simplified the rollback procedure by not requiring the 
{{-rollback}} flag to the DataNode, so regular startup/rollback could be 
treated similarly by restoring from trash.

HDFS-6800 added back the requirement to pass the {{-rollback}} flag during RU 
rollback, to support layout changes. The second restore was a side effect of 
the same fix. We can probably eliminate both restores now.

DN layout changes will be rare for minor/point releases. I am wary of 
eliminating trash without some numbers showing hard link performance with 
millions of blocks is on par with trash. Even a few seconds per DN adds up to 
many hours/days when upgrading thousands of DNs sequentially. Once we fix this 
issue raised by Nathan the overhead of trash as compared to regular startup is 
nil. 



 Rolling upgrade is restoring blocks from trash multiple times
 -

 Key: HDFS-7645
 URL: https://issues.apache.org/jira/browse/HDFS-7645
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts

 When performing an HDFS rolling upgrade, the trash directory is getting 
 restored twice when under normal circumstances it shouldn't need to be 
 restored at all. iiuc, the only time these blocks should be restored is if we 
 need to rollback a rolling upgrade. 
 On a busy cluster, this can cause significant and unnecessary block churn 
 both on the datanodes, and more importantly in the namenode.
 The two times this happens are:
 1) restart of DN onto new software
 {code}
   private void doTransition(DataNode datanode, StorageDirectory sd,
   NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
 if (startOpt == StartupOption.ROLLBACK  sd.getPreviousDir().exists()) {
   Preconditions.checkState(!getTrashRootDir(sd).exists(),
   sd.getPreviousDir() +  and  + getTrashRootDir(sd) +  should not 
  +
both be present.);
   doRollback(sd, nsInfo); // rollback if applicable
 } else {
   // Restore all the files in the trash. The restored files are retained
   // during rolling upgrade rollback. They are deleted during rolling
   // upgrade downgrade.
   int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
   LOG.info(Restored  + restored +  block files from trash.);
 }
 {code}
 2) When heartbeat response no longer indicates a rollingupgrade is in progress
 {code}
   /**
* Signal the current rolling upgrade status as indicated by the NN.
* @param inProgress true if a rolling upgrade is in progress
*/
   void signalRollingUpgrade(boolean inProgress) throws IOException {
 String bpid = getBlockPoolId();
 if (inProgress) {
   dn.getFSDataset().enableTrash(bpid);
   dn.getFSDataset().setRollingUpgradeMarker(bpid);
 } else {
   dn.getFSDataset().restoreTrash(bpid);
   dn.getFSDataset().clearRollingUpgradeMarker(bpid);
 }
   }
 {code}
 HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
 clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7496) Fix FsVolume removal race conditions on the DataNode by reference-counting the volume instances

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7496:
---
Fix Version/s: (was: 3.0.0)
   2.7.0

committed to 2.7.  Thanks.

 Fix FsVolume removal race conditions on the DataNode by reference-counting 
 the volume instances
 ---

 Key: HDFS-7496
 URL: https://issues.apache.org/jira/browse/HDFS-7496
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Lei (Eddy) Xu
 Fix For: 2.7.0

 Attachments: HDFS-7496-branch-2.000.patch, HDFS-7496.000.patch, 
 HDFS-7496.001.patch, HDFS-7496.002.patch, HDFS-7496.003.patch, 
 HDFS-7496.003.patch, HDFS-7496.004.patch, HDFS-7496.005.patch, 
 HDFS-7496.006.patch, HDFS-7496.007.patch


 We discussed a few FsVolume removal race conditions on the DataNode in 
 HDFS-7489.  We should figure out a way to make removing an FsVolume safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3443) Unable to catch up edits during standby to active switch due to NPE

2015-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286148#comment-14286148
 ] 

Hudson commented on HDFS-3443:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6904 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6904/])
HDFS-3443. Fix NPE when namenode transition to active during startup by adding 
checkNNStartup() in NameNodeRpcServer.  Contributed by Vinayakumar B (szetszwo: 
rev db334bb8625da97c7e518cbcf477530c7ba7001e)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 Unable to catch up edits during standby to active switch due to NPE
 ---

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 

[jira] [Commented] (HDFS-7406) SimpleHttpProxyHandler puts incorrect Connection: Close header

2015-01-21 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286074#comment-14286074
 ] 

Haohui Mai commented on HDFS-7406:
--

This is not an incompatible change since it has not been released.

 SimpleHttpProxyHandler puts incorrect Connection: Close header
 

 Key: HDFS-7406
 URL: https://issues.apache.org/jira/browse/HDFS-7406
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haohui Mai
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7406.000.patch


 tcpdump reveals that SimpleHttpProxyHandler puts incorrect values of the 
 {{Connection}} header:
 {noformat}
 Connection: io.netty.channel.ChannelFutureListener$1@36866933
 {noformat}
 which should be
 {noformat}
 Connection: close
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3443) Unable to catch up edits during standby to active switch due to NPE

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286058#comment-14286058
 ] 

Tsz Wo Nicholas Sze commented on HDFS-3443:
---

 I have intentionally didnt add check to these. ... is that fine with you?

Sure, let's leave them unchecked for the moment.  We may add checkNNStartup() 
later on if necessary.

 Unable to catch up edits during standby to active switch due to NPE
 ---

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3443) Unable to catch up edits during standby to active switch due to NPE

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-3443:
--
Hadoop Flags: Reviewed

+1 patch looks good

 Unable to catch up edits during standby to active switch due to NPE
 ---

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7496) Fix FsVolume removal race conditions on the DataNode by reference-counting the volume instances

2015-01-21 Thread Lei (Eddy) Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu updated HDFS-7496:

Attachment: HDFS-7496-branch-2.000.patch

Uploaded for branch-2

 Fix FsVolume removal race conditions on the DataNode by reference-counting 
 the volume instances
 ---

 Key: HDFS-7496
 URL: https://issues.apache.org/jira/browse/HDFS-7496
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Lei (Eddy) Xu
 Fix For: 3.0.0

 Attachments: HDFS-7496-branch-2.000.patch, HDFS-7496.000.patch, 
 HDFS-7496.001.patch, HDFS-7496.002.patch, HDFS-7496.003.patch, 
 HDFS-7496.003.patch, HDFS-7496.004.patch, HDFS-7496.005.patch, 
 HDFS-7496.006.patch, HDFS-7496.007.patch


 We discussed a few FsVolume removal race conditions on the DataNode in 
 HDFS-7489.  We should figure out a way to make removing an FsVolume safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286164#comment-14286164
 ] 

Hadoop QA commented on HDFS-7548:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12693618/HDFS-7548-v5.patch
  against trunk revision 6b17eb9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.qjournal.TestSecureNNWithQJM

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9293//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9293//console

This message is automatically generated.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286169#comment-14286169
 ] 

Colin Patrick McCabe commented on HDFS-7575:


This patch does change the layout format.  It changes it from one where storage 
ID may or may not be unique to one where it definitely is.

Can you response to the practical points I made above?  I made a few points 
that nobody has responded to yet.
* Changing the storage ID during startup basically changes storage ID from 
being a permanent identifier to a temporary one... makes persisting this later 
impossible.  It commits us to an architecture where block locations can't be 
persisted.
* With approach #1, we have to carry the burden of the dedupe code forever.
* Approach #1 degrades error handling.  If you somehow end up with two volumes 
that map to the same directory, the code silently does the wrong thing.

I would appreciate a response to these.  thanks

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date information)
 Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286051#comment-14286051
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7575:
---

 ... we have bumped the layout version in the past even when the old software 
 could handle the new layout. ...

For HDFS-6482, it does change layout format.  So bumping layout version makes 
sense.  However, the patch here does not change layout format.  Disagree?

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date information)
 Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4929) [NNBench mark] Lease mismatch error when running with multiple mappers

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285996#comment-14285996
 ] 

Hadoop QA commented on HDFS-4929:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12693611/HDFS4929.patch
  against trunk revision 6b17eb9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9292//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9292//console

This message is automatically generated.

 [NNBench mark] Lease mismatch error when running with multiple mappers
 --

 Key: HDFS-4929
 URL: https://issues.apache.org/jira/browse/HDFS-4929
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: benchmarks
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
Priority: Critical
 Attachments: HDFS4929.patch


 Command :
 ./yarn jar 
 ../share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.1-tests.jar 
 nnbench -operation create_write -numberOfFiles 1000 -blockSize 268435456 
 -bytesToWrite 102400 -baseDir /benchmarks/NNBench`hostname -s` 
 -replicationFactorPerFile 3 -maps 100 -reduces 10
 Trace :
 013-06-21 10:44:53,763 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 7 on 9005, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
 192.168.105.214:36320: error: 
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /benchmarks/NNBenchlinux-185/data/file_linux-214__0 owned by 
 DFSClient_attempt_1371782327901_0001_m_48_0_1383437860_1 but is accessed 
 by DFSClient_attempt_1371782327901_0001_m_84_0_1880545303_1
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /benchmarks/NNBenchlinux-185/data/file_linux-214__0 owned by 
 DFSClient_attempt_1371782327901_0001_m_48_0_1383437860_1 but is accessed 
 by DFSClient_attempt_1371782327901_0001_m_84_0_1880545303_1
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2351)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2098)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2019)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:501)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:213)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:52012)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:435)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:925)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1710)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1706)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3443) Unable to catch up edits during standby to active switch due to NPE

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-3443:
--
   Resolution: Fixed
Fix Version/s: 2.6.1
   Status: Resolved  (was: Patch Available)

I have committed this.  Thanks, Vinay!

 Unable to catch up edits during standby to active switch due to NPE
 ---

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
 Fix For: 2.6.1

 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2015-01-21 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286128#comment-14286128
 ] 

Arpit Agarwal edited comment on HDFS-7645 at 1/21/15 7:34 PM:
--

The first restore was by design when the rolling upgrade feature was added 
(HDFS-6005). It simplified the rollback procedure by not requiring the 
{{-rollback}} flag to the DataNode, so regular startup/rollback could be 
treated similarly by restoring from trash.

HDFS-6800 added back the requirement to pass the {{-rollback}} flag during RU 
rollback, to support layout changes. The second restore was a side effect of 
the same fix. We can probably eliminate both restores now.

bq. I think we should get rid of trash and just always create a previous/ 
directory when doing rolling upgrade, the same as we do with regular upgrade. 
The speed is clearly acceptable since we've done these upgrades in the field 
when switching to the blockid-based layout with no problems. And it will be a 
lot more maintainable and less confusing.
DN layout changes will be rare for minor/point releases. I am wary of 
eliminating trash without some numbers showing hard link performance with 
millions of blocks is on par with trash. Even a few seconds per DN adds up to 
many hours/days when upgrading thousands of DNs sequentially. Once we fix this 
issue raised by Nathan the overhead of trash as compared to regular startup is 
nil. 




was (Author: arpitagarwal):
The first restore was by design when the rolling upgrade feature was added 
(HDFS-6005). It simplified the rollback procedure by not requiring the 
{{-rollback}} flag to the DataNode, so regular startup/rollback could be 
treated similarly by restoring from trash.

HDFS-6800 added back the requirement to pass the {{-rollback}} flag during RU 
rollback, to support layout changes. The second restore was a side effect of 
the same fix. We can probably eliminate both restores now.

DN layout changes will be rare for minor/point releases. I am wary of 
eliminating trash without some numbers showing hard link performance with 
millions of blocks is on par with trash. Even a few seconds per DN adds up to 
many hours/days when upgrading thousands of DNs sequentially. Once we fix this 
issue raised by Nathan the overhead of trash as compared to regular startup is 
nil. 



 Rolling upgrade is restoring blocks from trash multiple times
 -

 Key: HDFS-7645
 URL: https://issues.apache.org/jira/browse/HDFS-7645
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts

 When performing an HDFS rolling upgrade, the trash directory is getting 
 restored twice when under normal circumstances it shouldn't need to be 
 restored at all. iiuc, the only time these blocks should be restored is if we 
 need to rollback a rolling upgrade. 
 On a busy cluster, this can cause significant and unnecessary block churn 
 both on the datanodes, and more importantly in the namenode.
 The two times this happens are:
 1) restart of DN onto new software
 {code}
   private void doTransition(DataNode datanode, StorageDirectory sd,
   NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
 if (startOpt == StartupOption.ROLLBACK  sd.getPreviousDir().exists()) {
   Preconditions.checkState(!getTrashRootDir(sd).exists(),
   sd.getPreviousDir() +  and  + getTrashRootDir(sd) +  should not 
  +
both be present.);
   doRollback(sd, nsInfo); // rollback if applicable
 } else {
   // Restore all the files in the trash. The restored files are retained
   // during rolling upgrade rollback. They are deleted during rolling
   // upgrade downgrade.
   int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
   LOG.info(Restored  + restored +  block files from trash.);
 }
 {code}
 2) When heartbeat response no longer indicates a rollingupgrade is in progress
 {code}
   /**
* Signal the current rolling upgrade status as indicated by the NN.
* @param inProgress true if a rolling upgrade is in progress
*/
   void signalRollingUpgrade(boolean inProgress) throws IOException {
 String bpid = getBlockPoolId();
 if (inProgress) {
   dn.getFSDataset().enableTrash(bpid);
   dn.getFSDataset().setRollingUpgradeMarker(bpid);
 } else {
   dn.getFSDataset().restoreTrash(bpid);
   dn.getFSDataset().clearRollingUpgradeMarker(bpid);
 }
   }
 {code}
 HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
 clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-5423) Verify initializations of LocatedBlock/RecoveringBlock

2015-01-21 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal resolved HDFS-5423.
-
Resolution: Invalid

Resolving as Invalid. I don't believe any work is remaining here.

 Verify initializations of LocatedBlock/RecoveringBlock
 --

 Key: HDFS-5423
 URL: https://issues.apache.org/jira/browse/HDFS-5423
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Affects Versions: Heterogeneous Storage (HDFS-2832)
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal

 Tracking Jira to make sure we verify initialization of LocatedBlock and 
 RecoveringBlock, possibly reorg the constructors to make missing 
 initialization of StorageIDs less likely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeIDs but not StorageIDs

2015-01-21 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286012#comment-14286012
 ] 

Konstantin Shvachko commented on HDFS-7647:
---

Yes, I would go with simple. That is, sort DatanodeIDs, then use the mapping to 
reorganize storages. The mapping of course could be an index array as Arpit 
suggested.

 DatanodeManager.sortLocatedBlocks() sorts DatanodeIDs but not StorageIDs
 

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai

 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeIDs inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeIDs and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7406) SimpleHttpProxyHandler puts incorrect Connection: Close header

2015-01-21 Thread Haohui Mai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-7406:
-
Hadoop Flags: Reviewed  (was: Incompatible change,Reviewed)

 SimpleHttpProxyHandler puts incorrect Connection: Close header
 

 Key: HDFS-7406
 URL: https://issues.apache.org/jira/browse/HDFS-7406
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haohui Mai
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7406.000.patch


 tcpdump reveals that SimpleHttpProxyHandler puts incorrect values of the 
 {{Connection}} header:
 {noformat}
 Connection: io.netty.channel.ChannelFutureListener$1@36866933
 {noformat}
 which should be
 {noformat}
 Connection: close
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7037) Using distcp to copy data from insecure to secure cluster via hftp doesn't work (branch-2 only)

2015-01-21 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285856#comment-14285856
 ] 

Yongjun Zhang commented on HDFS-7037:
-

Thanks [~qwertymaniac], indeed what you described is a very good use case here.

Hi [~atm] and [~daryn], you guys are not pushed by me because I thought we had 
webhdfs as an alternative solution:-) For the scenario Harsh described (distcp 
from old pre-security release that doesn't have webhdfs support yet), we don't 
have alternative, would you please help reviewing the patch? Thanks much!




 Using distcp to copy data from insecure to secure cluster via hftp doesn't 
 work  (branch-2 only)
 

 Key: HDFS-7037
 URL: https://issues.apache.org/jira/browse/HDFS-7037
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security, tools
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7037.001.patch


 This is a branch-2 only issue since hftp is only supported there. 
 Issuing distcp hftp://insecureCluster hdfs://secureCluster gave the 
 following failure exception:
 {code}
 14/09/13 22:07:40 INFO tools.DelegationTokenFetcher: Error when dealing 
 remote token:
 java.io.IOException: Error when dealing remote token: Internal Server Error
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.run(DelegationTokenFetcher.java:375)
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:238)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:501)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
   at 
 org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:77)
   at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:81)
   at 
 org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:342)
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:154)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:121)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:390)
 14/09/13 22:07:40 WARN security.UserGroupInformation: 
 PriviledgedActionException as:hadoopu...@xyz.com (auth:KERBEROS) 
 cause:java.io.IOException: Unable to obtain remote token
 14/09/13 22:07:40 ERROR tools.DistCp: Exception encountered 
 java.io.IOException: Unable to obtain remote token
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:249)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 

[jira] [Commented] (HDFS-7610) Fix removal of dynamically added DN volumes

2015-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285783#comment-14285783
 ] 

Hudson commented on HDFS-7610:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2031 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2031/])
HDFS-7610. Fix removal of dynamically added DN volumes (Lei (Eddy) Xu via Colin 
P. McCabe) (cmccabe: rev a17584936cc5141e3f5612ac3ecf35e27968e439)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java
HDFS-7610. Add CHANGES.txt (cmccabe: rev 
a1222784fbc4bb51be96586ec2ae7098264b)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Fix removal of dynamically added DN volumes
 ---

 Key: HDFS-7610
 URL: https://issues.apache.org/jira/browse/HDFS-7610
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
 Fix For: 2.7.0

 Attachments: HDFS-7610.000.patch, HDFS-7610.001.patch


 In the hot swap feature, {{FsDatasetImpl#addVolume}} uses the base volume dir 
 (e.g. {{/foo/data0}}, instead of volume's current dir 
 {{/foo/data/current}} to construct {{FsVolumeImpl}}. As a result, DataNode 
 can not remove this newly added volume, because its 
 {{FsVolumeImpl#getBasePath}} returns {{/foo}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7641) Update archival storage user doc for list/set/get block storage policies

2015-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285781#comment-14285781
 ] 

Hudson commented on HDFS-7641:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2031 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2031/])
HDFS-7641. Update archival storage user doc for list/set/get block storage 
policies. (yliu) (yliu: rev 889ab074d54bda643899cdfc2ded58625524465e)
* hadoop-hdfs-project/hadoop-hdfs/src/site/apt/ArchivalStorage.apt.vm
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Update archival storage user doc for list/set/get block storage policies
 

 Key: HDFS-7641
 URL: https://issues.apache.org/jira/browse/HDFS-7641
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.7.0
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor
 Fix For: 2.7.0

 Attachments: HDFS-7641.001.patch, HDFS-7641.002.patch


 After HDFS-7323, the list/set/get block storage policies commands are 
 different, we should update the corresponding user doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7406) SimpleHttpProxyHandler puts incorrect Connection: Close header

2015-01-21 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-7406:
-
Hadoop Flags: Incompatible change,Reviewed  (was: Reviewed)

 SimpleHttpProxyHandler puts incorrect Connection: Close header
 

 Key: HDFS-7406
 URL: https://issues.apache.org/jira/browse/HDFS-7406
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haohui Mai
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7406.000.patch


 tcpdump reveals that SimpleHttpProxyHandler puts incorrect values of the 
 {{Connection}} header:
 {noformat}
 Connection: io.netty.channel.ChannelFutureListener$1@36866933
 {noformat}
 which should be
 {noformat}
 Connection: close
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-7548:
-
Status: Open  (was: Patch Available)

Cancelling the patch to address findbugs warning

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-21 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7577:
---
Attachment: HDFS-7577-branch-HDFS-6994-1.patch

Good catch! {{os/windows/cupid.h}} is x86-specific. Attached is another patch 
which addresses this.

 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7577-branch-HDFS-6994-0.patch, 
 HDFS-7577-branch-HDFS-6994-1.patch


 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7037) Using distcp to copy data from insecure to secure cluster via hftp doesn't work (branch-2 only)

2015-01-21 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285840#comment-14285840
 ] 

Harsh J commented on HDFS-7037:
---

FWIW, this patch is still required in order to get basically any post-security 
releases to copy from pre-security releases running today.

 Using distcp to copy data from insecure to secure cluster via hftp doesn't 
 work  (branch-2 only)
 

 Key: HDFS-7037
 URL: https://issues.apache.org/jira/browse/HDFS-7037
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security, tools
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7037.001.patch


 This is a branch-2 only issue since hftp is only supported there. 
 Issuing distcp hftp://insecureCluster hdfs://secureCluster gave the 
 following failure exception:
 {code}
 14/09/13 22:07:40 INFO tools.DelegationTokenFetcher: Error when dealing 
 remote token:
 java.io.IOException: Error when dealing remote token: Internal Server Error
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.run(DelegationTokenFetcher.java:375)
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:238)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:501)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
   at 
 org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:77)
   at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:81)
   at 
 org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:342)
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:154)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:121)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:390)
 14/09/13 22:07:40 WARN security.UserGroupInformation: 
 PriviledgedActionException as:hadoopu...@xyz.com (auth:KERBEROS) 
 cause:java.io.IOException: Unable to obtain remote token
 14/09/13 22:07:40 ERROR tools.DistCp: Exception encountered 
 java.io.IOException: Unable to obtain remote token
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:249)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:501)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
   at 
 

[jira] [Commented] (HDFS-7037) Using distcp to copy data from insecure to secure cluster via hftp doesn't work (branch-2 only)

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285848#comment-14285848
 ] 

Hadoop QA commented on HDFS-7037:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12668640/HDFS-7037.001.patch
  against trunk revision 6b17eb9.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9294//console

This message is automatically generated.

 Using distcp to copy data from insecure to secure cluster via hftp doesn't 
 work  (branch-2 only)
 

 Key: HDFS-7037
 URL: https://issues.apache.org/jira/browse/HDFS-7037
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security, tools
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7037.001.patch


 This is a branch-2 only issue since hftp is only supported there. 
 Issuing distcp hftp://insecureCluster hdfs://secureCluster gave the 
 following failure exception:
 {code}
 14/09/13 22:07:40 INFO tools.DelegationTokenFetcher: Error when dealing 
 remote token:
 java.io.IOException: Error when dealing remote token: Internal Server Error
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.run(DelegationTokenFetcher.java:375)
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:238)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:501)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
   at 
 org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:77)
   at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:81)
   at 
 org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:342)
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:154)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:121)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:390)
 14/09/13 22:07:40 WARN security.UserGroupInformation: 
 PriviledgedActionException as:hadoopu...@xyz.com (auth:KERBEROS) 
 cause:java.io.IOException: Unable to obtain remote token
 14/09/13 22:07:40 ERROR tools.DistCp: Exception encountered 
 java.io.IOException: Unable to obtain remote token
   at 
 org.apache.hadoop.hdfs.tools.DelegationTokenFetcher.getDTfromRemote(DelegationTokenFetcher.java:249)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:252)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$2.run(HftpFileSystem.java:247)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.getDelegationToken(HftpFileSystem.java:247)
   at 
 org.apache.hadoop.hdfs.web.TokenAspect.ensureTokenInitialized(TokenAspect.java:140)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.addDelegationTokenParam(HftpFileSystem.java:337)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:324)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:457)
   at 
 org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:472)
   at 
 

[jira] [Commented] (HDFS-7496) Fix FsVolume removal race conditions on the DataNode by reference-counting the volume instances

2015-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285780#comment-14285780
 ] 

Hudson commented on HDFS-7496:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2031 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2031/])
HDFS-7496. Fix FsVolume removal race conditions on the DataNode by 
reference-counting the volume instances (lei via cmccabe) (cmccabe: rev 
b7f4a3156c0f5c600816c469637237ba6c9b330c)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/extdataset/ExternalVolumeImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/RamDiskAsyncLazyPersistService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/extdataset/ExternalDatasetImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestSimulatedFSDataset.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/ReplicaHandler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/proto/datatransfer.proto
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsVolumeReference.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetAsyncDiskService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeListTest.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestWriteToReplica.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/ReplicaInputStreams.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsVolumeSpi.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDirectoryScanner.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestWriteBlockGetsBlockLengthHint.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeHotSwapVolumes.java
HDFS-7496: add to CHANGES.txt (cmccabe: rev 
73b72a048f70c275051747d13dc948845f4cef17)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Fix FsVolume removal race conditions on the DataNode by reference-counting 
 the volume instances
 ---

 Key: HDFS-7496
 URL: https://issues.apache.org/jira/browse/HDFS-7496
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Lei (Eddy) Xu
 Fix For: 3.0.0

 Attachments: HDFS-7496.000.patch, HDFS-7496.001.patch, 
 HDFS-7496.002.patch, HDFS-7496.003.patch, HDFS-7496.003.patch, 
 HDFS-7496.004.patch, HDFS-7496.005.patch, HDFS-7496.006.patch, 
 HDFS-7496.007.patch


 We discussed a few FsVolume removal race conditions on the DataNode in 
 HDFS-7489.  We should figure out a way to make removing an FsVolume safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-7548:
-
Status: Patch Available  (was: Open)

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-7548:
-
Attachment: HDFS-7548-v5.patch

Attachnig a new patch to address findbugs warning.
Added synchronized keyword to newly added method.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4929) [NNBench mark] Lease mismatch error when running with multiple mappers

2015-01-21 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated HDFS-4929:
---
Attachment: HDFS4929.patch

 [NNBench mark] Lease mismatch error when running with multiple mappers
 --

 Key: HDFS-4929
 URL: https://issues.apache.org/jira/browse/HDFS-4929
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: benchmarks
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
Priority: Critical
 Attachments: HDFS4929.patch


 Command :
 ./yarn jar 
 ../share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.1-tests.jar 
 nnbench -operation create_write -numberOfFiles 1000 -blockSize 268435456 
 -bytesToWrite 102400 -baseDir /benchmarks/NNBench`hostname -s` 
 -replicationFactorPerFile 3 -maps 100 -reduces 10
 Trace :
 013-06-21 10:44:53,763 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 7 on 9005, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
 192.168.105.214:36320: error: 
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /benchmarks/NNBenchlinux-185/data/file_linux-214__0 owned by 
 DFSClient_attempt_1371782327901_0001_m_48_0_1383437860_1 but is accessed 
 by DFSClient_attempt_1371782327901_0001_m_84_0_1880545303_1
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /benchmarks/NNBenchlinux-185/data/file_linux-214__0 owned by 
 DFSClient_attempt_1371782327901_0001_m_48_0_1383437860_1 but is accessed 
 by DFSClient_attempt_1371782327901_0001_m_84_0_1880545303_1
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2351)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2098)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2019)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:501)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:213)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:52012)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:435)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:925)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1710)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1706)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7594) Add isFileClosed and IsInSafeMode APIs in o.a.h.hdfs.client.HdfsAdmin

2015-01-21 Thread Uma Maheswara Rao G (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G updated HDFS-7594:
--
Attachment: HDFS-7594.patch

Attached a simple patch. Since this is just api delegation, did not add any new 
tests.

 Add isFileClosed and IsInSafeMode APIs in o.a.h.hdfs.client.HdfsAdmin
 -

 Key: HDFS-7594
 URL: https://issues.apache.org/jira/browse/HDFS-7594
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Affects Versions: 3.0.0, 2.6.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Attachments: HDFS-7594.patch


 DistributedFileSystem has exposed isFileClosed and IsInSafeMode.
 Applications like Hbase are using this apis directly from 
 DistributedFileSystem class. Now HdfsAdmin public class added and one of the 
 purpose is for exposing the apis from DistributedFileSystem/DfsClient for the 
 apps.
 So, It would be nice to add this 2 apis also in that public class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4929) [NNBench mark] Lease mismatch error when running with multiple mappers

2015-01-21 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated HDFS-4929:
---
Status: Patch Available  (was: Open)

Attached patch with my proposal(having unique name for each file)

 [NNBench mark] Lease mismatch error when running with multiple mappers
 --

 Key: HDFS-4929
 URL: https://issues.apache.org/jira/browse/HDFS-4929
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: benchmarks
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
Priority: Critical
 Attachments: HDFS4929.patch


 Command :
 ./yarn jar 
 ../share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.1-tests.jar 
 nnbench -operation create_write -numberOfFiles 1000 -blockSize 268435456 
 -bytesToWrite 102400 -baseDir /benchmarks/NNBench`hostname -s` 
 -replicationFactorPerFile 3 -maps 100 -reduces 10
 Trace :
 013-06-21 10:44:53,763 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 7 on 9005, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
 192.168.105.214:36320: error: 
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /benchmarks/NNBenchlinux-185/data/file_linux-214__0 owned by 
 DFSClient_attempt_1371782327901_0001_m_48_0_1383437860_1 but is accessed 
 by DFSClient_attempt_1371782327901_0001_m_84_0_1880545303_1
 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: Lease mismatch 
 on /benchmarks/NNBenchlinux-185/data/file_linux-214__0 owned by 
 DFSClient_attempt_1371782327901_0001_m_48_0_1383437860_1 but is accessed 
 by DFSClient_attempt_1371782327901_0001_m_84_0_1880545303_1
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2351)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2098)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2019)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:501)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:213)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:52012)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:435)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:925)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1710)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1706)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5183) Combine ReplicaPlacementPolicy with VolumeChoosingPolicy together to have a global view in choosing DN storage for replica.

2015-01-21 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286183#comment-14286183
 ] 

Arpit Agarwal commented on HDFS-5183:
-

Also thanks for filing this and the suggested solutions [~djp] and [~sirianni].

This is another of the issues that got pushed out as part of Heterogeneous 
Storages phase-2 work and was later covered under the Archival Storage feature.

 Combine ReplicaPlacementPolicy with VolumeChoosingPolicy together to have a 
 global view in choosing DN storage for replica.
 ---

 Key: HDFS-5183
 URL: https://issues.apache.org/jira/browse/HDFS-5183
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode, performance
Affects Versions: Heterogeneous Storage (HDFS-2832)
Reporter: Junping Du

 Per discussion in HDFS-5157, There are two different ways to handle 
 BlockPlacementPolicy and ReplicaChoosingPolicy in case of multiple storage 
 types:
  1. Client specifies the required storage type when calling addBlock(..) to 
 NN. BlockPlacementPolicy in NN chooses a set of datanodes accounting for the 
 storage type. Then, client passes the required storage type to the datanode 
 set and each datanode chooses a particular storage using a 
 VolumeChoosingPolicy.
  2. Same as before, client specifies the required storage type when calling 
 addBlock(..) to NN. Now, BlockPlacementPolicy in NN chooses a set of storages 
 (instead of datanodes). Then, client writes to the corresponding storages. 
 VolumeChoosingPolicy is no longer needed and it should be removed.
 We think #2 is more powerful as it will bring global view to volume choosing 
 or bring storage status into consideration in replica choosing, so we propose 
 to combine two polices together.
 One concern here is it may increase the load of NameNode as previously volume 
 choosing is decided by DN. We may verify it later (that's why I put 
 performance in component).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7353) Raw Erasure Coder API for concrete encoding and decoding

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286264#comment-14286264
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7353:
---

 ... has potential to contain more APIs some of which I'm going to add.

Do you have something in your mind?

 Raw Erasure Coder API for concrete encoding and decoding
 

 Key: HDFS-7353
 URL: https://issues.apache.org/jira/browse/HDFS-7353
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Kai Zheng
Assignee: Kai Zheng
 Fix For: HDFS-EC

 Attachments: HDFS-7353-v1.patch


 This is to abstract and define raw erasure coder API across different codes 
 algorithms like RS, XOR and etc. Such API can be implemented by utilizing 
 various library support, such as Intel ISA library and Jerasure library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6833) DirectoryScanner should not register a deleting block with memory of DataNode

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286390#comment-14286390
 ] 

Tsz Wo Nicholas Sze commented on HDFS-6833:
---

Patch looks good in general.  Some comments:
- Do not add removeDeletedBlocks to FsDatasetSpi.  FsDatasetAsyncDiskService is 
a part of the FsDataset implementation.  Simply pass FsDatasetImpl in the 
constructor. 

- The ReplicaInfo added to the new deletingBlock ReplicaMap is never used.  How 
about simply using MapString, SetLong?

- In FsDatasetAsyncDiskService.updateDeletedBlockId(..), use entrySet() to 
avoid multiple lookups.

- Check if debug is enabled before calling LOG.debug(..).

- The patch does not apply anymore.  Need to update it.


 DirectoryScanner should not register a deleting block with memory of DataNode
 -

 Key: HDFS-6833
 URL: https://issues.apache.org/jira/browse/HDFS-6833
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.5.0, 2.5.1
Reporter: Shinichi Yamashita
Assignee: Shinichi Yamashita
Priority: Critical
 Attachments: HDFS-6833-10.patch, HDFS-6833-11.patch, 
 HDFS-6833-12.patch, HDFS-6833-13.patch, HDFS-6833-14.patch, 
 HDFS-6833-6-2.patch, HDFS-6833-6-3.patch, HDFS-6833-6.patch, 
 HDFS-6833-7-2.patch, HDFS-6833-7.patch, HDFS-6833.8.patch, HDFS-6833.9.patch, 
 HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, 
 HDFS-6833.patch


 When a block is deleted in DataNode, the following messages are usually 
 output.
 {code}
 2014-08-07 17:53:11,606 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
  Scheduling blk_1073741825_1001 file 
 /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
  for deletion
 2014-08-07 17:53:11,617 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
 /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
 {code}
 However, DirectoryScanner may be executed when DataNode deletes the block in 
 the current implementation. And the following messsages are output.
 {code}
 2014-08-07 17:53:30,519 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
  Scheduling blk_1073741825_1001 file 
 /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
  for deletion
 2014-08-07 17:53:31,426 INFO 
 org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
 BP-1887080305-172.28.0.101-1407398838872 Total blocks: 1, missing metadata 
 files:0, missing block files:0, missing blocks in memory:1, mismatched 
 blocks:0
 2014-08-07 17:53:31,426 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
 missing block to memory FinalizedReplica, blk_1073741825_1001, FINALIZED
   getNumBytes() = 21230663
   getBytesOnDisk()  = 21230663
   getVisibleLength()= 21230663
   getVolume()   = /hadoop/data1/dfs/data/current
   getBlockFile()= 
 /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
   unlinked  =false
 2014-08-07 17:53:31,531 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
 /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
 {code}
 Deleting block information is registered in DataNode's memory.
 And when DataNode sends a block report, NameNode receives wrong block 
 information.
 For example, when we execute recommission or change the number of 
 replication, NameNode may delete the right block as ExcessReplicate by this 
 problem.
 And Under-Replicated Blocks and Missing Blocks occur.
 When DataNode run DirectoryScanner, DataNode should not register a deleting 
 block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286224#comment-14286224
 ] 

Rushabh S Shah commented on HDFS-7548:
--

[~daryn] and [~kihwal]  Thanks for the comments and review.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286238#comment-14286238
 ] 

Hudson commented on HDFS-7548:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6906 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6906/])
HDFS-7548. Corrupt block reporting delayed until datablock scanner thread 
detects it. Contributed by Rushabh Shah. (kihwal: rev 
c0af72c7f74b6925786e24543cac433b906dd6d3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataBlockScanner.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeBlockScanner.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java


 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Fix For: 2.7.0

 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-5183) Combine ReplicaPlacementPolicy with VolumeChoosingPolicy together to have a global view in choosing DN storage for replica.

2015-01-21 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal resolved HDFS-5183.
-
  Resolution: Implemented
Hadoop Flags:   (was: Incompatible change)

Resolving this as Implemented.

As part of HDFS-6584 the first of your two approaches was chosen.
bq. 1. Client specifies the required storage type when calling addBlock(..) to 
NN. BlockPlacementPolicy in NN chooses a set of datanodes accounting for the 
storage type. Then, client passes the required storage type to the datanode set 
and each datanode chooses a particular storage using a VolumeChoosingPolicy.


 Combine ReplicaPlacementPolicy with VolumeChoosingPolicy together to have a 
 global view in choosing DN storage for replica.
 ---

 Key: HDFS-5183
 URL: https://issues.apache.org/jira/browse/HDFS-5183
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode, performance
Affects Versions: Heterogeneous Storage (HDFS-2832)
Reporter: Junping Du

 Per discussion in HDFS-5157, There are two different ways to handle 
 BlockPlacementPolicy and ReplicaChoosingPolicy in case of multiple storage 
 types:
  1. Client specifies the required storage type when calling addBlock(..) to 
 NN. BlockPlacementPolicy in NN chooses a set of datanodes accounting for the 
 storage type. Then, client passes the required storage type to the datanode 
 set and each datanode chooses a particular storage using a 
 VolumeChoosingPolicy.
  2. Same as before, client specifies the required storage type when calling 
 addBlock(..) to NN. Now, BlockPlacementPolicy in NN chooses a set of storages 
 (instead of datanodes). Then, client writes to the corresponding storages. 
 VolumeChoosingPolicy is no longer needed and it should be removed.
 We think #2 is more powerful as it will bring global view to volume choosing 
 or bring storage status into consideration in replica choosing, so we propose 
 to combine two polices together.
 One concern here is it may increase the load of NameNode as previously volume 
 choosing is decided by DN. We may verify it later (that's why I put 
 performance in component).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286230#comment-14286230
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7575:
---

 Can you response to the practical points I made above?

If there is not layout format change, the practical points seem irrelevant.  
Anyway, let me comment on them.

 Changing the storage ID during startup basically changes storage ID from 
 being a permanent identifier to a temporary one... 

We only change a storage ID when it is invalid but not changing the storage ID 
arbitrarily.  Valid storage IDs are permanent.

 With approach #1, we have to carry the burden of the dedupe code forever.

The code is for validating storage IDs (but not for de-duplication) and is very 
simple.  It is good to keep.

 ... If you somehow end up with two volumes that map to the same directory, 
 the code silently does the wrong thing.

Is this a practical error?  Have you seen it in practice?

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date information)
 Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7652) Process block reports for erasure coded blocks

2015-01-21 Thread Zhe Zhang (JIRA)
Zhe Zhang created HDFS-7652:
---

 Summary: Process block reports for erasure coded blocks
 Key: HDFS-7652
 URL: https://issues.apache.org/jira/browse/HDFS-7652
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Zhe Zhang
Assignee: Zhe Zhang


HDFS-7339 adds support in NameNode for persisting block groups. For memory 
efficiency, erasure coded blocks under the striping layout are not stored in 
{{BlockManager#blocksMap}}. Instead, entire block groups are stored in 
{{BlockGroupManager#blockGroups}}. When a block report arrives from the 
DataNode, it should be processed under the block group that it belongs to. The 
following naming protocol is used to calculate the group of a given block:
{code}
 * HDFS-EC introduces a hierarchical protocol to name blocks and groups:
 * Contiguous: {reserved block IDs | flag | block ID}
 * Striped: {reserved block IDs | flag | block group ID | index in group}
 *
 * Following n bits of reserved block IDs, The (n+1)th bit in an ID
 * distinguishes contiguous (0) and striped (1) blocks. For a striped block,
 * bits (n+2) to (64-m) represent the ID of its block group, while the last m
 * bits represent its index of the group. The value m is determined by the
 * maximum number of blocks in a group (MAX_BLOCKS_IN_GROUP).
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Refactor the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286408#comment-14286408
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7430:
---

It seems that the entire BlockScanner is rewritten.  This is not a code 
refactoring.  Is it correct?

If yes, how about making it configurable so that it is possible to use the old 
scanner?  The new scanner needs to take some time to be stabilized.

 Refactor the BlockScanner to use O(1) memory and use multiple threads
 -

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286217#comment-14286217
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7575:
---

 This patch does change the layout format. It changes it from one where 
 storage ID may or may not be unique to one where it definitely is.

So, you claim that the current format is a layout, where some storage IDs could 
be the same?
{code}
ADD_DATANODE_AND_STORAGE_UUIDS(-49, Replace StorageID with DatanodeUuid.
+  Use distinct StorageUuid per storage directory.),
{code}
It is clearly specified in the LV -49 that the IDs must be distinct.

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date information)
 Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-7548:
-
   Resolution: Fixed
Fix Version/s: 2.7.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Thanks for working on the bug, Rushabh. I've committed this to trunk and 
branch-2.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Fix For: 2.7.0

 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7566) Remove obsolete entries from hdfs-default.xml

2015-01-21 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286284#comment-14286284
 ] 

Ray Chiang commented on HDFS-7566:
--

RE: findbugs

No code changes in this patch.

RE: Failing unit tests

Unit test passes in my tree.

 Remove obsolete entries from hdfs-default.xml
 -

 Key: HDFS-7566
 URL: https://issues.apache.org/jira/browse/HDFS-7566
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: HDFS-7566.001.patch


 So far, I've found these five properties which may be obsolete in 
 hdfs-default.xml:
 - dfs.https.enable
 - dfs.namenode.edits.journal-plugin.qjournal
 - dfs.namenode.logging.level
 - dfs.ha.namenodes.EXAMPLENAMESERVICE
   + Should this be kept in the .xml file?
 - dfs.support.append
   + Removed with HDFS-6246
 I'd like to get feedback about the state of any of the above properties.
 This is the HDFS equivalent of MAPREDUCE-6057 and YARN-2460.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-21 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286221#comment-14286221
 ] 

Kihwal Lee commented on HDFS-7548:
--

+1 the patch looks good.
The test case is failing even without this patch and the patch was not changing 
anything related to it.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548-v2.patch, HDFS-7548-v3.patch, 
 HDFS-7548-v4.patch, HDFS-7548-v5.patch, HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3689) Add support for variable length block

2015-01-21 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3689:

Attachment: HDFS-3689.008.patch

Thanks for the comments, Colin. To use CreateFlag is a good suggestion. Update 
the patch to address your comments.

Currently the patch targets for 3.0. We can merge this into 2.x after making 
sure this feature does not break existing applications' functionalities. So far 
I only checked {{FileInputFormat}} and looks like variable length block may 
only affect its performance but will not break its functionality.

 Add support for variable length block
 -

 Key: HDFS-3689
 URL: https://issues.apache.org/jira/browse/HDFS-3689
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, hdfs-client, namenode
Affects Versions: 3.0.0
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-3689.000.patch, HDFS-3689.001.patch, 
 HDFS-3689.002.patch, HDFS-3689.003.patch, HDFS-3689.003.patch, 
 HDFS-3689.004.patch, HDFS-3689.005.patch, HDFS-3689.006.patch, 
 HDFS-3689.007.patch, HDFS-3689.008.patch


 Currently HDFS supports fixed length blocks. Supporting variable length block 
 will allow new use cases and features to be built on top of HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Refactor the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286495#comment-14286495
 ] 

Colin Patrick McCabe commented on HDFS-7430:


It is fair to call this a rewrite of major parts of the block scanner.

I don't think it makes sense to maintain two block scanners in parallel.  There 
would have to be a lot of glue code and extra interfaces to get both working.  
Let's let this soak in trunk for a while and then merge to branch-2 when it is 
stabilized, the same as we did with other things such as truncate.

 Refactor the BlockScanner to use O(1) memory and use multiple threads
 -

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Desai updated HDFS-7647:
--
Summary: DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not 
StorageIDs  (was: DatanodeManager.sortLocatedBlocks() sorts DatanodeIDs but not 
StorageIDs)

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeIDs inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeIDs and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Desai updated HDFS-7647:
--
Description: DatanodeManager.sortLocatedBlocks() sorts the array of 
DatanodeInfos inside each LocatedBlock, but does not touch the array of 
StorageIDs and StorageTypes. As a result, the DatanodeInfos and 
StorageIDs/StorageTypes are mismatched. The method is called by 
FSNamesystem.getBlockLocations(), so the client will not know which 
StorageID/Type corresponds to which DatanodeInfo.  (was: 
DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeIDs inside each 
LocatedBlock, but does not touch the array of StorageIDs and StorageTypes. As a 
result, the DatanodeIDs and StorageIDs/StorageTypes are mismatched. The method 
is called by FSNamesystem.getBlockLocations(), so the client will not know 
which StorageID/Type corresponds to which DatanodeID.)

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286583#comment-14286583
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7575:
---

For the so called practical points you made (say, Again, if I accidentally 
duplicate a directory on a datanode, ...) , how could updating layout version 
help?

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date information)
 Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286588#comment-14286588
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7430:
---

 It is fair to call this a rewrite of major parts of the block scanner.

Then, could you reuse the old class and keep the old code you are using so that 
it is easier to review?

Since this is not a small patch, how about working this in a branch?

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-3458) Convert Forrest docs to APT

2015-01-21 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved HDFS-3458.

  Resolution: Fixed
   Fix Version/s: 2.0.0-alpha
Target Version/s:   (was: )

Closing as fixed, since this was done eons ago.

 Convert Forrest docs to APT
 ---

 Key: HDFS-3458
 URL: https://issues.apache.org/jira/browse/HDFS-3458
 Project: Hadoop HDFS
  Issue Type: Task
  Components: documentation
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
  Labels: newbie
 Fix For: 2.0.0-alpha


 HDFS side of HADOOP-8427. The src/main/docs/src/documentation/content/xdocs 
 contents needs to be converted to APT and removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3689) Add support for variable length block

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286436#comment-14286436
 ] 

Hadoop QA commented on HDFS-3689:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12693724/HDFS-3689.008.patch
  against trunk revision 0742591.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9295//console

This message is automatically generated.

 Add support for variable length block
 -

 Key: HDFS-3689
 URL: https://issues.apache.org/jira/browse/HDFS-3689
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, hdfs-client, namenode
Affects Versions: 3.0.0
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-3689.000.patch, HDFS-3689.001.patch, 
 HDFS-3689.002.patch, HDFS-3689.003.patch, HDFS-3689.003.patch, 
 HDFS-3689.004.patch, HDFS-3689.005.patch, HDFS-3689.006.patch, 
 HDFS-3689.007.patch, HDFS-3689.008.patch


 Currently HDFS supports fixed length blocks. Supporting variable length block 
 will allow new use cases and features to be built on top of HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-21 Thread Byron Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286458#comment-14286458
 ] 

Byron Wong commented on HDFS-7611:
--

Found it.

The problem occurs in how we do {{FSImage$loadEdits}}.
The gist of it looks like:
{code}
private long loadEdits(...) {
  try {
loadEdits();
  } finally {
updateCountForQuota();
  }
}
{code}

In {{TestFileTruncate$testUpgradeAndRestart()}}, notice that we do:
{code}
saveNamespace();
restart();
deleteSnapshot();
{code}

Since there are no edits to load directly after restart, we immediately call 
{{updateCountForQuota()}}, which will set namespace count for the root 
directory from 1 to 5. Then deleting the snapshot will decrement the count from 
5 to 2.

However, we also do a restart in 
{{TestFileTruncate$testTruncateEditLogLoad()}}. In this case, there is an edit 
to replay, namely the {{deleteSnapshot()}}. This will decrement the namespace 
count from 1 to -1, and afterwards {{updateCountForQuota()}} will set it back 
to 2.

 TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
 Cluster to start
 -

 Key: HDFS-7611
 URL: https://issues.apache.org/jira/browse/HDFS-7611
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Konstantin Shvachko
Assignee: Byron Wong
 Attachments: testTruncateEditLogLoad.log


 I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
 comming ready after NN restart.
 Not sure if it is truncate specific, as I've seen same behaviour with other 
 tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7584) Enable Quota Support for Storage Types (SSD)

2015-01-21 Thread Xiaoyu Yao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Yao updated HDFS-7584:
-
Attachment: HDFS-7584.0.patch

 Enable Quota Support for Storage Types (SSD) 
 -

 Key: HDFS-7584
 URL: https://issues.apache.org/jira/browse/HDFS-7584
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Reporter: Xiaoyu Yao
Assignee: Xiaoyu Yao
 Attachments: HDFS-7584 Quota by Storage Type - 01202015.pdf, 
 HDFS-7584.0.patch


 Phase II of the Heterogeneous storage features have completed by HDFS-6584. 
 This JIRA is opened to enable Quota support of different storage types in 
 terms of storage space usage. This is more important for certain storage 
 types such as SSD as it is precious and more performant. 
 As described in the design doc of HDFS-5682, we plan to add new 
 quotaByStorageType command and new name node RPC protocol for it. The quota 
 by storage type feature is applied to HDFS directory level similar to 
 traditional HDFS space quota. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Desai updated HDFS-7647:
--
Status: Patch Available  (was: Open)

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6874) Add GET_BLOCK_LOCATIONS operation to HttpFS

2015-01-21 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286613#comment-14286613
 ] 

Charles Lamb commented on HDFS-6874:


[~lianggz],

Thanks for working on this.

In general the patch looks good. I have a few minor comments.

The patch on the trunk needs to be rebased. I didn't check the branch-2 patch, 
so it may need to be rebased too.

In general, lots of lines exceed the 80 char limit.

FSOperations.java

s/private static Map  blockLocationsToJSON/private static Map 
blockLocationsToJSON/
You may want to add java doc for the @param, and @return of that method.

HttpFSFileSystem.java

getFileBlockLocations should have javadoc for the @return. In this method, the 
call to HttpFSUtils.validateResponse should probably be changed to 
HttpExceptionUtils.validateResponse().

HttpFSServer.java
s/offset,len/offset, len/
Is it correct that passing a len=0 implies Long.MAX_VALUE?

JsonUtil.java
The javadoc formatting for toBlockLocations is messed up a little.
s/IOException{/IOException {/

WebHdfsFileSystem.java
for isWebHDFSJson, s/json){/json) {/ and s/m!=null/m != null/. Also, the 
javadoc needs filling in.

Charles


 Add GET_BLOCK_LOCATIONS operation to HttpFS
 ---

 Key: HDFS-6874
 URL: https://issues.apache.org/jira/browse/HDFS-6874
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Gao Zhong Liang
Assignee: Gao Zhong Liang
 Attachments: HDFS-6874-branch-2.6.0.patch, HDFS-6874.patch


 GET_BLOCK_LOCATIONS operation is missing in HttpFS, which is already 
 supported in WebHDFS.  For the request of GETFILEBLOCKLOCATIONS in 
 org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far:
 ...
  case GETFILEBLOCKLOCATIONS: {
 response = Response.status(Response.Status.BAD_REQUEST).build();
 break;
   }
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-21 Thread Byron Wong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Byron Wong updated HDFS-7611:
-
Attachment: blocksNotDeletedTest.patch

Attached blocksNotDeletedTest.patch.
Includes a test that reproduces this issue.
The test fails both before and after truncate was committed.

Also, I forgot to mention before that quotas must be enabled for the directory 
to reproduce the error. Otherwise, the {{last.computeQuotaUsage()}} call in 
{{unprotectedDelete()}} will actually count the namespace objects instead of 
extracting the number from the {{DirectoryWithQuotaFeature}} object.

 TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
 Cluster to start
 -

 Key: HDFS-7611
 URL: https://issues.apache.org/jira/browse/HDFS-7611
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Konstantin Shvachko
Assignee: Byron Wong
 Attachments: blocksNotDeletedTest.patch, testTruncateEditLogLoad.log


 I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
 comming ready after NN restart.
 Not sure if it is truncate specific, as I've seen same behaviour with other 
 tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7611) deleteSnapshot and delete of a file can leave orphaned blocks in the blocksMap on NameNode restart.

2015-01-21 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-7611:
--
  Component/s: namenode
  Description: If quotas are enabled a combination of operations 
*deleteSnapshot* and *delete* of a file can leave  orphaned  blocks in the 
blocksMap on NameNode restart. They are counted as missing on the NameNode, and 
can prevent NameNode from coming out of safeMode and could cause memory leak 
during startup.  (was: I've seen it failing on Jenkins a couple of times. 
Somehow the cluster is not comming ready after NN restart.
Not sure if it is truncate specific, as I've seen same behaviour with other 
tests that restart the NameNode.)
 Priority: Critical  (was: Major)
 Target Version/s: 2.7.0  (was: 3.0.0)
Affects Version/s: (was: 3.0.0)
   2.6.0
  Summary: deleteSnapshot and delete of a file can leave orphaned 
blocks in the blocksMap on NameNode restart.  (was: 
TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
Cluster to start)

Changed the description. It used to be: 
TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
Cluster to start.

 deleteSnapshot and delete of a file can leave orphaned blocks in the 
 blocksMap on NameNode restart.
 ---

 Key: HDFS-7611
 URL: https://issues.apache.org/jira/browse/HDFS-7611
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Konstantin Shvachko
Assignee: Byron Wong
Priority: Critical
 Attachments: blocksNotDeletedTest.patch, testTruncateEditLogLoad.log


 If quotas are enabled a combination of operations *deleteSnapshot* and 
 *delete* of a file can leave  orphaned  blocks in the blocksMap on NameNode 
 restart. They are counted as missing on the NameNode, and can prevent 
 NameNode from coming out of safeMode and could cause memory leak during 
 startup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-3519:

Target Version/s: 2.7.0

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, HDFS-3519.patch, 
 test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-3519:

Hadoop Flags: Reviewed

+1 for the patch.  Ming, thank you for incorporating the feedback.  Could you 
also please provide a branch-2 patch?  There is a small difference in 
{{FSImage}} on branch-2 that prevents me from applying the trunk patch.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, HDFS-3519.patch, 
 test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286617#comment-14286617
 ] 

Colin Patrick McCabe commented on HDFS-7430:


bq. Then, could you reuse the old class and keep the old code you are using so 
that it is easier to review?

Maybe I was not clear.  You said that this was a rewrite.  I agreed that it was 
a rewrite.  Since it is a rewrite, there is no reason to keep the old code.

bq. Since this is not a small patch, how about working this in a branch?

I don't see a reason to do this in a branch.  The patch is already done and has 
a +1.  It is a big change, to be sure, but it is not adding a new feature.  It 
is just changing some existing code.  There is no follow-up work to do, aside 
from fixing any issues we find in trunk in the next few days prior to the merge 
to branch-2.

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-21 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286697#comment-14286697
 ] 

Konstantin Shvachko commented on HDFS-7611:
---

Byron, good job investigating and reproducing the bug.
Sounds like a serious problem. I also confirmed it on branch-2 with your test.

_So if quotas are enabled a combination of operations *deleteSnapshot* and 
*delete* of a file can leave orphaned blocks in the blocksMap on NameNode 
restart. They are counted as missing on the NameNode, and can prevent NameNode 
from coming out of safeMode and could cause memory leak (at least during 
startup)._

I'll rename the jira and unlink from HDFS-3107 as it is not related to truncate.

 TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
 Cluster to start
 -

 Key: HDFS-7611
 URL: https://issues.apache.org/jira/browse/HDFS-7611
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Konstantin Shvachko
Assignee: Byron Wong
 Attachments: blocksNotDeletedTest.patch, testTruncateEditLogLoad.log


 I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
 comming ready after NN restart.
 Not sure if it is truncate specific, as I've seen same behaviour with other 
 tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-147) Stack trace on spaceQuota excced .

2015-01-21 Thread Xiaoyu Yao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Yao reassigned HDFS-147:
---

Assignee: Xiaoyu Yao

 Stack trace on spaceQuota excced .
 --

 Key: HDFS-147
 URL: https://issues.apache.org/jira/browse/HDFS-147
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: All
Reporter: Ravi Phulari
Assignee: Xiaoyu Yao
  Labels: newbie

 Currently disk space quota exceed exception spits out stack trace . It's 
 better to show error message instead of stack trace .
 {code}
 somehost:Hadoop guesti$ bin/hdfs dfsadmin -setSpaceQuota 2 2344
 somehost:Hadoop guest$ bin/hadoop fs -put conf 2344
 09/06/19 16:44:30 WARN hdfs.DFSClient: DataStreamer Exception: 
 org.apache.hadoop.hdfs.protocol.QuotaExceededException: 
 org.apache.hadoop.hdfs.protocol.QuotaExceededException: The quota of 
 /user/guest/2344 is exceeded: namespace quota=-1 file count=4, diskspace 
 quota=2 diskspace=67108864
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 ..
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-147) Stack trace on spaceQuota excced .

2015-01-21 Thread Xiaoyu Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286711#comment-14286711
 ] 

Xiaoyu Yao commented on HDFS-147:
-

I just checked and it still can repro with space quota but not namespace quota. 
I will take a look at it. 

 Stack trace on spaceQuota excced .
 --

 Key: HDFS-147
 URL: https://issues.apache.org/jira/browse/HDFS-147
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: All
Reporter: Ravi Phulari
  Labels: newbie

 Currently disk space quota exceed exception spits out stack trace . It's 
 better to show error message instead of stack trace .
 {code}
 somehost:Hadoop guesti$ bin/hdfs dfsadmin -setSpaceQuota 2 2344
 somehost:Hadoop guest$ bin/hadoop fs -put conf 2344
 09/06/19 16:44:30 WARN hdfs.DFSClient: DataStreamer Exception: 
 org.apache.hadoop.hdfs.protocol.QuotaExceededException: 
 org.apache.hadoop.hdfs.protocol.QuotaExceededException: The quota of 
 /user/guest/2344 is exceeded: namespace quota=-1 file count=4, diskspace 
 quota=2 diskspace=67108864
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 ..
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3689) Add support for variable length block

2015-01-21 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-3689:

Attachment: HDFS-3689.008.patch

 Add support for variable length block
 -

 Key: HDFS-3689
 URL: https://issues.apache.org/jira/browse/HDFS-3689
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, hdfs-client, namenode
Affects Versions: 3.0.0
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-3689.000.patch, HDFS-3689.001.patch, 
 HDFS-3689.002.patch, HDFS-3689.003.patch, HDFS-3689.003.patch, 
 HDFS-3689.004.patch, HDFS-3689.005.patch, HDFS-3689.006.patch, 
 HDFS-3689.007.patch, HDFS-3689.008.patch, HDFS-3689.008.patch


 Currently HDFS supports fixed length blocks. Supporting variable length block 
 will allow new use cases and features to be built on top of HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7339) Allocating and persisting block groups in NameNode

2015-01-21 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HDFS-7339:

Attachment: HDFS-7339-006.patch

Addressing Vinay's comments.

 Allocating and persisting block groups in NameNode
 --

 Key: HDFS-7339
 URL: https://issues.apache.org/jira/browse/HDFS-7339
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7339-001.patch, HDFS-7339-002.patch, 
 HDFS-7339-003.patch, HDFS-7339-004.patch, HDFS-7339-005.patch, 
 HDFS-7339-006.patch, Meta-striping.jpg, NN-stripping.jpg


 All erasure codec operations center around the concept of _block group_; they 
 are formed in initial encoding and looked up in recoveries and conversions. A 
 lightweight class {{BlockGroup}} is created to record the original and parity 
 blocks in a coding group, as well as a pointer to the codec schema (pluggable 
 codec schemas will be supported in HDFS-7337). With the striping layout, the 
 HDFS client needs to operate on all blocks in a {{BlockGroup}} concurrently. 
 Therefore we propose to extend a file’s inode to switch between _contiguous_ 
 and _striping_ modes, with the current mode recorded in a binary flag. An 
 array of BlockGroups (or BlockGroup IDs) is added, which remains empty for 
 “traditional” HDFS files with contiguous block layout.
 The NameNode creates and maintains {{BlockGroup}} instances through the new 
 {{ECManager}} component; the attached figure has an illustration of the 
 architecture. As a simple example, when a {_Striping+EC_} file is created and 
 written to, it will serve requests from the client to allocate new 
 {{BlockGroups}} and store them under the {{INodeFile}}. In the current phase, 
 {{BlockGroups}} are allocated both in initial online encoding and in the 
 conversion from replication to EC. {{ECManager}} also facilitates the lookup 
 of {{BlockGroup}} information for block recovery work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7339) Allocating and persisting block groups in NameNode

2015-01-21 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286473#comment-14286473
 ] 

Zhe Zhang commented on HDFS-7339:
-

Thanks for review [~vinayrpet], it was very helpful. The updated patch 
addresses most of the comments. Please find some additional notes below:

bq.  I believe numBytes will be set at client side and later committed to NN
That's right; {{numBytes}} will be updated when the client patch (HDFS-7545) 
comes in.

bq. But genstamp should be generated initially at the namenode side itself.
Thanks for pointing it out. The new patch uses {{genStamp}} from 
{{BlockIdManager}} instead of creating a new {{GenerationStamp}} pool for block 
groups. Thoughts?

bq. BlockGroupManager#chooseNewGroupTargets rejects the allocation if min 
number of nodes selected is less than groupSize. Is there anyway to continue 
with less number of nodes?
That's a very good point. This is the main change in this new patch; basically 
it uses a {{minStripeSize}} variable to mimic {{BlockManager#minReplication}}

bq. Where the below mentioned method of identifying blockGroup from blockId 
will be used?
It will be used to locate the block group when a block report comes in. I 
created HDFS-7652 as a place holder to implement this logic (after the client 
patch is in).

 Allocating and persisting block groups in NameNode
 --

 Key: HDFS-7339
 URL: https://issues.apache.org/jira/browse/HDFS-7339
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7339-001.patch, HDFS-7339-002.patch, 
 HDFS-7339-003.patch, HDFS-7339-004.patch, HDFS-7339-005.patch, 
 HDFS-7339-006.patch, Meta-striping.jpg, NN-stripping.jpg


 All erasure codec operations center around the concept of _block group_; they 
 are formed in initial encoding and looked up in recoveries and conversions. A 
 lightweight class {{BlockGroup}} is created to record the original and parity 
 blocks in a coding group, as well as a pointer to the codec schema (pluggable 
 codec schemas will be supported in HDFS-7337). With the striping layout, the 
 HDFS client needs to operate on all blocks in a {{BlockGroup}} concurrently. 
 Therefore we propose to extend a file’s inode to switch between _contiguous_ 
 and _striping_ modes, with the current mode recorded in a binary flag. An 
 array of BlockGroups (or BlockGroup IDs) is added, which remains empty for 
 “traditional” HDFS files with contiguous block layout.
 The NameNode creates and maintains {{BlockGroup}} instances through the new 
 {{ECManager}} component; the attached figure has an illustration of the 
 architecture. As a simple example, when a {_Striping+EC_} file is created and 
 written to, it will serve requests from the client to allocate new 
 {{BlockGroups}} and store them under the {{INodeFile}}. In the current phase, 
 {{BlockGroups}} are allocated both in initial online encoding and in the 
 conversion from replication to EC. {{ECManager}} also facilitates the lookup 
 of {{BlockGroup}} information for block recovery work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7430) Refactor the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7430:
---
Attachment: HDFS-7430.012.patch

 Refactor the BlockScanner to use O(1) memory and use multiple threads
 -

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7584) Enable Quota Support for Storage Types (SSD)

2015-01-21 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286520#comment-14286520
 ] 

Zhe Zhang commented on HDFS-7584:
-

This is a very useful feature and thanks [~xyao] for initiating the work. A few 
comments:
# bq. For example, when SSD is not available but Quota of SSD is available, the 
write will fallback to DISK and the storage usage deducted from both SSD and 
the cumulative disk space quota of the directory even no SSD space is being 
consumed. 
This is an interesting scenario and is worth more discussion. It is a 
conservative and safe policy to deduct from both SSD and DISK quotas. However 
it doesn't fully comply with the principle of _quota based on intended usage_, 
which might make it appear counter-intuitive to users (e.g. why am I double 
charged?). As an extreme example, what if the user doesn't have any DISK quota?
# How about calculating quota truly based on intended usage? The charged quota 
might be different than the usage, but so is the case with existing quota 
logic. What are other disadvantages?
# If we do want to charge by actual usage (5.2), maybe we should allow 
different quota currencies to be exchanged? Something like 1GB of SSD = 2GB 
of DISK = 4GB of ARCHIVAL. Or at least allow a user with _only_ 1GB SSD quota 
to use 1GB DISK space.

 Enable Quota Support for Storage Types (SSD) 
 -

 Key: HDFS-7584
 URL: https://issues.apache.org/jira/browse/HDFS-7584
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Reporter: Xiaoyu Yao
Assignee: Xiaoyu Yao
 Attachments: HDFS-7584 Quota by Storage Type - 01202015.pdf


 Phase II of the Heterogeneous storage features have completed by HDFS-6584. 
 This JIRA is opened to enable Quota support of different storage types in 
 terms of storage space usage. This is more important for certain storage 
 types such as SSD as it is precious and more performant. 
 As described in the design doc of HDFS-5682, we plan to add new 
 quotaByStorageType command and new name node RPC protocol for it. The quota 
 by storage type feature is applied to HDFS directory level similar to 
 traditional HDFS space quota. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeIDs but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Desai updated HDFS-7647:
--
Attachment: HDFS-7647.patch

Thanks [~shv] and [~arpitagarwal]. I was unable to use the index arrays method 
as that would require changing the way networktopology.sortByDistance works. In 
the attached a patch I create a map to associate the DatanodeInfos with 
StorageIDs/Types and use the map after the sorting.

 DatanodeManager.sortLocatedBlocks() sorts DatanodeIDs but not StorageIDs
 

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeIDs inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeIDs and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286567#comment-14286567
 ] 

Arpit Agarwal commented on HDFS-7647:
-

Hi Milan, while this solution would work I'd prefer the other approach you 
suggested. We should fix LocatedBlock to group the three fields in a new class. 
Existing code (except sortLocatedBlocks) can be shielded from the change by 
keeping public the interface the same. What do you think?

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-7430:
--
Summary: Rewrite the BlockScanner to use O(1) memory and use multiple 
threads  (was: Refactor the BlockScanner to use O(1) memory and use multiple 
threads)

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286527#comment-14286527
 ] 

Colin Patrick McCabe commented on HDFS-7575:


bq. So, you claim that the current format is a layout, where some storage IDs 
could be the same?... It is clearly specified in the LV -49 that the IDs must 
be distinct.

What's important is what was implemented, not what was written in the comment 
about the layout version.  And what was implemented does allow duplicate 
storage IDs.

bq. Is \[two volumes that map to the same directory\] a practical error? Have 
you seen it in practice?

Yes.  Recently we had a cluster with two datanodes connected to the same shared 
storage accidentally.  I guess you could argue that lock files should prevent 
problems here.  However, I do not like the idea of datanodes modifying VERSION 
on startup at all.  If one of the DNs had terminated before the other one tried 
to lock the directory, it would have succeeded.  And with the retry failed 
volume stuff, we probably have a wide window for this to happen.

bq. We only change a storage ID when it is invalid but not changing the storage 
ID arbitrarily. Valid storage IDs are permanent.

Again, if I accidentally duplicate a directory on a datanode, then the storage 
ID morph for one of the directories.  That doesn't sound permanent to me.

bq. The code is for validating storage IDs (but not for de-duplication) and is 
very simple. It is good to keep.

I agree that it is good to validate the storage IDs are unique.  But this is 
the same as when we validate that the cluster ID is correct, or the layout 
version is correct.  We don't change incorrect values to fix them.  If 
they're wrong then we need to find out why, not sweep the problem under the rug.

Are there any practical arguments in favor of not doing a layout version 
change?  The main argument in favor of not changing the layout here I see is 
basically that this isn't a big enough change to merit a new LV.  But that 
seems irrelevant to me-- the question is which approach is better for error 
handling and more maintainable.

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision 

[jira] [Commented] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286576#comment-14286576
 ] 

Colin Patrick McCabe commented on HDFS-7645:


bq. DN layout changes will be rare for minor/point releases. I am wary of 
eliminating trash without some numbers showing hard link performance with 
millions of blocks is on par with trash. Even a few seconds per DN adds up to 
many hours/days when upgrading thousands of DNs sequentially. Once we fix this 
issue raised by Nathan the overhead of trash as compared to regular startup is 
nil.

Yeah.  Startup time during an upgrade is important.  Our numbers for creating 
the previous directory in HDFS-6482 were about 1 second per 100,000 blocks.  
We also parallelized the hard link process across all volumes.  So I would 
expect it to be very quick for the average DN, which has about 200k-400k blocks 
split across 10 storage directories.

Anyway, I don't feel strongly about this... if we can make trash work, then 
so be it.  It sounds like the fix is not that difficult.

 Rolling upgrade is restoring blocks from trash multiple times
 -

 Key: HDFS-7645
 URL: https://issues.apache.org/jira/browse/HDFS-7645
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts

 When performing an HDFS rolling upgrade, the trash directory is getting 
 restored twice when under normal circumstances it shouldn't need to be 
 restored at all. iiuc, the only time these blocks should be restored is if we 
 need to rollback a rolling upgrade. 
 On a busy cluster, this can cause significant and unnecessary block churn 
 both on the datanodes, and more importantly in the namenode.
 The two times this happens are:
 1) restart of DN onto new software
 {code}
   private void doTransition(DataNode datanode, StorageDirectory sd,
   NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
 if (startOpt == StartupOption.ROLLBACK  sd.getPreviousDir().exists()) {
   Preconditions.checkState(!getTrashRootDir(sd).exists(),
   sd.getPreviousDir() +  and  + getTrashRootDir(sd) +  should not 
  +
both be present.);
   doRollback(sd, nsInfo); // rollback if applicable
 } else {
   // Restore all the files in the trash. The restored files are retained
   // during rolling upgrade rollback. They are deleted during rolling
   // upgrade downgrade.
   int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
   LOG.info(Restored  + restored +  block files from trash.);
 }
 {code}
 2) When heartbeat response no longer indicates a rollingupgrade is in progress
 {code}
   /**
* Signal the current rolling upgrade status as indicated by the NN.
* @param inProgress true if a rolling upgrade is in progress
*/
   void signalRollingUpgrade(boolean inProgress) throws IOException {
 String bpid = getBlockPoolId();
 if (inProgress) {
   dn.getFSDataset().enableTrash(bpid);
   dn.getFSDataset().setRollingUpgradeMarker(bpid);
 } else {
   dn.getFSDataset().restoreTrash(bpid);
   dn.getFSDataset().clearRollingUpgradeMarker(bpid);
 }
   }
 {code}
 HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
 clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286574#comment-14286574
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7575:
---

 What's important is what was implemented, not what was written in the comment 
 about the layout version. And what was implemented does allow duplicate 
 storage IDs.

I disagree.  The implementation is a bug -- it supposes to change the old ids 
(in old id format) to use the new uuid format.  The entire heterogeneous 
storage design requires storage ID to be unique.  Which implementation works 
correctly with the duplicate storage IDs?  


 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date information)
 Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286594#comment-14286594
 ] 

Milan Desai commented on HDFS-7647:
---

Hi [~arpitagarwal], sounds good - I will give it a shot.

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286637#comment-14286637
 ] 

Colin Patrick McCabe commented on HDFS-7575:


bq. I disagree. The implementation is a bug – it supposes to change the old ids 
(in old id format) to use the new uuid format. The entire heterogeneous storage 
design requires storage ID to be unique. Which implementation works correctly 
with the duplicate storage IDs?

I don't think it's productive to argue about whether this represents a true 
layout version change whether it is layout version changey enough.  
Clearly we both agree that doing an LV change here would work and solve the 
problem.  At the end of the day, we have to make the decision based on which 
way is more maintainable.

This does bring up a practical point, though.  It will be easier to backport 
the silently modify the VERSION file patch to 2.6.1 than the LV change.  In 
view of this, I think it's fine to backport the silently change VERSION fix 
to 2.6.1.  I just don't want to have to support it forever in 3.0 and onward.

bq. For the so called practical points you made (say, Again, if I accidentally 
duplicate a directory on a datanode, ...) , how could updating layout version 
help?

If we check for directories with duplicate storage IDs and exclude them, then 
the system administrator becomes aware that there is a problem.  It helps by 
not harming-- by not changing the VERSION file when we don't know for sure 
the reasons why the VERSION file is wrong.

 NameNode not handling heartbeats properly after HDFS-2832
 -

 Key: HDFS-7575
 URL: https://issues.apache.org/jira/browse/HDFS-7575
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.6.0
Reporter: Lars Francke
Assignee: Arpit Agarwal
Priority: Critical
 Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
 HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
 HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
 testUpgrade22via24GeneratesStorageIDs.tgz, 
 testUpgradeFrom22GeneratesStorageIDs.tgz, 
 testUpgradeFrom24PreservesStorageId.tgz


 Before HDFS-2832 each DataNode would have a unique storageId which included 
 its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
 storage directory which is just a random UUID.
 They send reports per storage directory in their heartbeats. This heartbeat 
 is processed on the NameNode in the 
 {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
 just store the information per Datanode. After the patch though each DataNode 
 can have multiple different storages so it's stored in a map keyed by the 
 storage Id.
 This works fine for all clusters that have been installed post HDFS-2832 as 
 they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
 different keys. On each Heartbeat the Map is searched and updated 
 ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
 {code:title=DatanodeStorageInfo}
   void updateState(StorageReport r) {
 capacity = r.getCapacity();
 dfsUsed = r.getDfsUsed();
 remaining = r.getRemaining();
 blockPoolUsed = r.getBlockPoolUsed();
   }
 {code}
 On clusters that were upgraded from a pre HDFS-2832 version though the 
 storage Id has not been rewritten (at least not on the four clusters I 
 checked) so each directory will have the exact same storageId. That means 
 there'll be only a single entry in the {{storageMap}} and it'll be 
 overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
 in the {{updateState}} method above. This just assigns the capacity from the 
 received report, instead it should probably sum it up per received heartbeat.
 The Balancer seems to be one of the only things that actually uses this 
 information so it now considers the utilization of a random drive per 
 DataNode for balancing purposes.
 Things get even worse when a drive has been added or replaced as this will 
 now get a new storage Id so there'll be two entries in the storageMap. As new 
 drives are usually empty it skewes the balancers decision in a way that this 
 node will never be considered over-utilized.
 Another problem is that old StorageReports are never removed from the 
 storageMap. So if I replace a drive and it gets a new storage Id the old one 
 will still be in place and used for all calculations by the Balancer until a 
 restart of the NameNode.
 I can try providing a patch that does the following:
 * Instead of using a Map I could just store the array we receive or instead 
 of storing an array sum up the values for reports with the same Id
 * On each heartbeat clear the map (so we know we have up to date 

[jira] [Updated] (HDFS-4173) If the NameNode has already been formatted, but a QuroumJournal has not, auto-format it on startup

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-4173:
---
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

We currently require a format for both local file and QJM-based edit log 
directories, so the behavior is consistent without this change.  Closing as 
Won't Fix... we can reopen later if we reconsider.

 If the NameNode has already been formatted, but a QuroumJournal has not, 
 auto-format it on startup
 --

 Key: HDFS-4173
 URL: https://issues.apache.org/jira/browse/HDFS-4173
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: journal-node, namenode
Affects Versions: 2.0.3-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: HDFS-4173.001.patch


 If we have multiple edit log directories, and some of them are formatted, but 
 others are not, we format the unformatted ones.  However, when we implemented 
 QuorumJournalManager, we did not extend this behavior to it.  It makes sense 
 to do this.
 One use case is if you want to add a QuorumJournalManager URI 
 ({{journal://}}) to an existing {{NameNode}}, without reformatting 
 everything.  There is currently no easy way to do this, since {{namenode 
 \-format}} will nuke everything, and there's no other way to format the 
 {{JournalNodes}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7654) TestFileTruncate#testTruncateEditLogLoad fails intermittently

2015-01-21 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HDFS-7654:
--

 Summary: TestFileTruncate#testTruncateEditLogLoad fails 
intermittently
 Key: HDFS-7654
 URL: https://issues.apache.org/jira/browse/HDFS-7654
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe


TestFileTruncate#testTruncateEditLogLoad fails intermittently with an error 
message like this: 
{code}
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
at 
org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1194)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1819)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1780)
at 
org.apache.hadoop.hdfs.server.namenode.TestFileTruncate.testTruncateEditLogLoad(TestFileTruncate.java:500)
{code}

Also, FSNamesystem ERROR logs appear in the test run log even when the test 
passes.  Example:
{code}
2015-01-21 18:52:36,474 ERROR namenode.NameNode 
(DirectoryWithQuotaFeature.java:checkDiskspace(82)) - BUG: Inconsistent 
diskspace for directory /test. Cached = 48 != Computed = 54
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286857#comment-14286857
 ] 

Colin Patrick McCabe commented on HDFS-7430:


TestBalancer#testUnknownDatanode failure is HDFS-7267, not caused by this 
patch.  TestFileTruncate#testTruncateEditLogLoad is HDFS-7654.

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286858#comment-14286858
 ] 

Hadoop QA commented on HDFS-7647:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12693747/HDFS-7647.patch
  against trunk revision 0742591.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistFiles

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9299//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9299//console

This message is automatically generated.

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Desai updated HDFS-7647:
--
Status: Open  (was: Patch Available)

 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (HDFS-7647) DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs

2015-01-21 Thread Milan Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7647 started by Milan Desai.
-
 DatanodeManager.sortLocatedBlocks() sorts DatanodeInfos but not StorageIDs
 --

 Key: HDFS-7647
 URL: https://issues.apache.org/jira/browse/HDFS-7647
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Milan Desai
Assignee: Milan Desai
 Attachments: HDFS-7647.patch


 DatanodeManager.sortLocatedBlocks() sorts the array of DatanodeInfos inside 
 each LocatedBlock, but does not touch the array of StorageIDs and 
 StorageTypes. As a result, the DatanodeInfos and StorageIDs/StorageTypes are 
 mismatched. The method is called by FSNamesystem.getBlockLocations(), so the 
 client will not know which StorageID/Type corresponds to which DatanodeInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever

2015-01-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286987#comment-14286987
 ] 

Yi Liu edited comment on HDFS-7392 at 1/22/15 5:47 AM:
---

[~peschd], I think this issue is not important, the reasons are:
*1.* You connect to an incorrect fs uri, if you connect to a correct one, then 
there is no issue.
*2.* Even for an incorrect fs uri, indeed it will retry again and again in the 
specific environment you described, but you can see the failure from log.

Could you tell me whether this blocks your usage and you are not able to use an 
correct fs uri? I think this issue is mostly caused by you use an incorrect fs 
uri.


was (Author: hitliuyi):
[~peschd], I think this issue is not important, the reasons are:
*1.* You connect to an incorrect fs uri, if you connect to a correct one, then 
there is no issue.
*2.* Even for an incorrect fs uri, indeed it will retry again and again, but 
you can see the failure from log.

Could you tell me whether this blocks your usage and you are not able to use an 
correct fs uri? I think this issue is mostly caused by you use an incorrect fs 
uri.

 org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
 -

 Key: HDFS-7392
 URL: https://issues.apache.org/jira/browse/HDFS-7392
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Frantisek Vacek
Assignee: Yi Liu
 Attachments: 1.png, 2.png, HDFS-7392.diff


 In some specific circumstances, 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
 and last forever.
 What are specific circumstances:
 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
 to valid IP address but without name node service running on it.
 2) There should be at least 2 IP addresses for such a URI. See output below:
 {quote}
 [~/proj/quickbox]$ nslookup share.example.com
 Server: 127.0.1.1
 Address:127.0.1.1#53
 share.example.com canonical name = 
 internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 192.168.1.223
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 192.168.1.65
 {quote}
 In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
 returns sometimes true (even if address didn't actually changed see img. 1) 
 and the timeoutFailures counter is set to 0 (see img. 2). The 
 maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
 repeated forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7584) Enable Quota Support for Storage Types (SSD)

2015-01-21 Thread Xiaoyu Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287001#comment-14287001
 ] 

Xiaoyu Yao commented on HDFS-7584:
--

Thanks [~zhz] for the feedback. 

bq. 1. This is an interesting scenario and is worth more discussion. It is a 
conservative and safe policy to deduct from both SSD and DISK quotas. However 
it doesn't fully comply with the principle of quota based on intended usage, 
which might make it appear counter-intuitive to users (e.g. why am I double 
charged?). As an extreme example, what if the user doesn't have any DISK quota?

Agree. the cumulative disk space quota in the spec means traditional space 
quota not quota of DISK type. I will update it to avoid confusion. 
We are actually calculating quota based on intended usage by not deducting DISK 
quotas due to policy fallback.

For example, we have a directory \ssd1 with ONE_SSD policy enabled. The 
directory currently has 2 blocks of SSD Quota and 3 blocks of DISK Quota 
remaining.
If we want to create a file of 1 block with a replication factor of 3 under 
\ssd1 but the actual available SSD storage is 0,  the creation fallback to 
DISK. 
After that, the remaining SSD and DISK Quota should be 1 block and 1 block 
instead of 2 block and 0 block, respectively. 

If the fallback still can't be satisfied due to DISK quota unavailable, the 
user will get QuotaByStorageTypeExceeded exception.

bq. 2. How about calculating quota truly based on intended usage? The charged 
quota might be different than the usage, but so is the case with existing quota 
logic. What are other disadvantages?

Quota calculation based on intended usage saves the replication monitor from 
updating traditional space/namespace quota usage for under/over replicated 
blocks. For quota by storage type, it can similarly save the Mover from 
updating quota when the blocks are moved across storage tiers to meet the 
policy requirement.

bq. 3. If we do want to charge by actual usage (5.2), maybe we should allow 
different quota currencies to be exchanged? Something like 1GB of SSD = 2GB 
of DISK = 4GB of ARCHIVAL. Or at least allow a user with only 1GB SSD quota to 
use 1GB DISK space.

As mentioned above, we prefer charging by intended usage for its simplicity and 
consistency. Correlation quota of different storage types looks interesting, it 
may requires additional tuning to get appropriate currency rates between 
storage types for different user scenarios.  

 Enable Quota Support for Storage Types (SSD) 
 -

 Key: HDFS-7584
 URL: https://issues.apache.org/jira/browse/HDFS-7584
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Reporter: Xiaoyu Yao
Assignee: Xiaoyu Yao
 Attachments: HDFS-7584 Quota by Storage Type - 01202015.pdf, 
 HDFS-7584.0.patch


 Phase II of the Heterogeneous storage features have completed by HDFS-6584. 
 This JIRA is opened to enable Quota support of different storage types in 
 terms of storage space usage. This is more important for certain storage 
 types such as SSD as it is precious and more performant. 
 As described in the design doc of HDFS-5682, we plan to add new 
 quotaByStorageType command and new name node RPC protocol for it. The quota 
 by storage type feature is applied to HDFS directory level similar to 
 traditional HDFS space quota. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7633) When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime throws IllegalArgumentException

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287002#comment-14287002
 ] 

Hadoop QA commented on HDFS-7633:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12693787/HDFS-7633.patch
  against trunk revision ee7d22e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistFiles
  org.apache.hadoop.hdfs.server.namenode.TestFileTruncate

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9301//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9301//console

This message is automatically generated.

 When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime 
 throws IllegalArgumentException
 

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever

2015-01-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286987#comment-14286987
 ] 

Yi Liu commented on HDFS-7392:
--

[~peschd], I think this issue is not important, the reasons are:
*1.* You connect to an incorrect fs uri, if you connect to a correct one, then 
there is no issue.
*2.* Even for an incorrect fs uri, indeed it will retry again and again, but 
you can see the failure from log.

Could you tell me whether this blocks your usage and you are not able to use an 
correct fs uri? I think this issue is mostly caused by you use an incorrect fs 
uri.

 org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
 -

 Key: HDFS-7392
 URL: https://issues.apache.org/jira/browse/HDFS-7392
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Frantisek Vacek
Assignee: Yi Liu
 Attachments: 1.png, 2.png, HDFS-7392.diff


 In some specific circumstances, 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
 and last forever.
 What are specific circumstances:
 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
 to valid IP address but without name node service running on it.
 2) There should be at least 2 IP addresses for such a URI. See output below:
 {quote}
 [~/proj/quickbox]$ nslookup share.example.com
 Server: 127.0.1.1
 Address:127.0.1.1#53
 share.example.com canonical name = 
 internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 192.168.1.223
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 192.168.1.65
 {quote}
 In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
 returns sometimes true (even if address didn't actually changed see img. 1) 
 and the timeoutFailures counter is set to 0 (see img. 2). The 
 maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
 repeated forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7339) Allocating and persisting block groups in NameNode

2015-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286776#comment-14286776
 ] 

Hadoop QA commented on HDFS-7339:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12693731/HDFS-7339-006.patch
  against trunk revision 0742591.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9297//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9297//console

This message is automatically generated.

 Allocating and persisting block groups in NameNode
 --

 Key: HDFS-7339
 URL: https://issues.apache.org/jira/browse/HDFS-7339
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7339-001.patch, HDFS-7339-002.patch, 
 HDFS-7339-003.patch, HDFS-7339-004.patch, HDFS-7339-005.patch, 
 HDFS-7339-006.patch, Meta-striping.jpg, NN-stripping.jpg


 All erasure codec operations center around the concept of _block group_; they 
 are formed in initial encoding and looked up in recoveries and conversions. A 
 lightweight class {{BlockGroup}} is created to record the original and parity 
 blocks in a coding group, as well as a pointer to the codec schema (pluggable 
 codec schemas will be supported in HDFS-7337). With the striping layout, the 
 HDFS client needs to operate on all blocks in a {{BlockGroup}} concurrently. 
 Therefore we propose to extend a file’s inode to switch between _contiguous_ 
 and _striping_ modes, with the current mode recorded in a binary flag. An 
 array of BlockGroups (or BlockGroup IDs) is added, which remains empty for 
 “traditional” HDFS files with contiguous block layout.
 The NameNode creates and maintains {{BlockGroup}} instances through the new 
 {{ECManager}} component; the attached figure has an illustration of the 
 architecture. As a simple example, when a {_Striping+EC_} file is created and 
 written to, it will serve requests from the client to allocate new 
 {{BlockGroups}} and store them under the {{INodeFile}}. In the current phase, 
 {{BlockGroups}} are allocated both in initial online encoding and in the 
 conversion from replication to EC. {{ECManager}} also facilitates the lookup 
 of {{BlockGroup}} information for block recovery work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7653) Block Readers and Writers used in both client side and datanode side

2015-01-21 Thread Li Bo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-7653:

Remaining Estimate: (was: 336h)
 Original Estimate: (was: 336h)

 Block Readers and Writers used in both client side and datanode side
 

 Key: HDFS-7653
 URL: https://issues.apache.org/jira/browse/HDFS-7653
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo

 There're a lot of block read/write operations in HDFS-EC, for example, when 
 client writes a file in striping layout, client has to write several blocks 
 to several different datanodes; if a datanode wants to do an 
 encoding/decoding task, it has to read several blocks from itself and other 
 datanodes, and writes one or more blocks to itself or other datanodes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7353) Raw Erasure Coder API for concrete encoding and decoding

2015-01-21 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286827#comment-14286827
 ] 

Kai Zheng commented on HDFS-7353:
-

Sure. Let me update my patch.

 Raw Erasure Coder API for concrete encoding and decoding
 

 Key: HDFS-7353
 URL: https://issues.apache.org/jira/browse/HDFS-7353
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Kai Zheng
Assignee: Kai Zheng
 Fix For: HDFS-EC

 Attachments: HDFS-7353-v1.patch


 This is to abstract and define raw erasure coder API across different codes 
 algorithms like RS, XOR and etc. Such API can be implemented by utilizing 
 various library support, such as Intel ISA library and Jerasure library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7633) When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime throws IllegalArgumentException

2015-01-21 Thread Walter Su (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-7633:

Attachment: (was: h7633_20150116.patch)

 When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime 
 throws IllegalArgumentException
 

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7633) When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime throws IllegalArgumentException

2015-01-21 Thread Walter Su (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-7633:

Attachment: HDFS-7633.patch

regenerating patch with 'git diff'

 When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime 
 throws IllegalArgumentException
 

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7653) Block Readers and Writers used in both client side and datanode side

2015-01-21 Thread Li Bo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286851#comment-14286851
 ] 

Li Bo commented on HDFS-7653:
-

This patch includes block readers and writers that will be used by client and 
datanode. Because there exists BlockReader interface, we named a new interface 
called UnifiedBlockReader. Its sub classes include: ClientBlockReader(read a 
block from client), LocalDatanodeBlockReader(read a local block, used inside 
datanode), RemoteDatanodeBlockReader(read a block from remote datanode, used 
inside datanode). BlockWriter's sub classes include: ClientBlockWriter(write a 
block form client), LocalDatanodeBlockWriter(write a block locally, used inside 
datanode), RemoteDatanodeBlockWriter(write a block to a remote datanode, used 
inside datanode). Two unit tests are also uploaded.
Welcome to your feedback.

 Block Readers and Writers used in both client side and datanode side
 

 Key: HDFS-7653
 URL: https://issues.apache.org/jira/browse/HDFS-7653
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: BlockReadersWriters.patch


 There're a lot of block read/write operations in HDFS-EC, for example, when 
 client writes a file in striping layout, client has to write several blocks 
 to several different datanodes; if a datanode wants to do an 
 encoding/decoding task, it has to read several blocks from itself and other 
 datanodes, and writes one or more blocks to itself or other datanodes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7430:
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Committed to trunk.  It took 3 months, and 12 revisions, but we got it!  Thanks 
for the reviews, Andrew.

As per Nicholas' suggestion, let's let this soak for a week or two in trunk 
before we backport to branch-2.  If we spot any bugs this will give us a chance 
to fix them.

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 3.0.0

 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3107) HDFS truncate

2015-01-21 Thread Plamen Jeliazkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated HDFS-3107:
---
Attachment: HDFS-3107.15_branch2.patch

Attaching patch for branch-2 port. Ran tests locally with latest editsStored 
already attached to this JIRA. Everything passed locally.

 HDFS truncate
 -

 Key: HDFS-3107
 URL: https://issues.apache.org/jira/browse/HDFS-3107
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Reporter: Lei Chang
Assignee: Plamen Jeliazkov
 Fix For: 3.0.0

 Attachments: HDFS-3107-13.patch, HDFS-3107-14.patch, 
 HDFS-3107-15.patch, HDFS-3107-HDFS-7056-combined.patch, HDFS-3107.008.patch, 
 HDFS-3107.15_branch2.patch, HDFS-3107.patch, HDFS-3107.patch, 
 HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
 HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
 HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, HDFS_truncate.pdf, 
 HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
 editsStored, editsStored.xml

   Original Estimate: 1,344h
  Remaining Estimate: 1,344h

 Systems with transaction support often need to undo changes made to the 
 underlying storage when a transaction is aborted. Currently HDFS does not 
 support truncate (a standard Posix operation) which is a reverse operation of 
 append, which makes upper layer applications use ugly workarounds (such as 
 keeping track of the discarded byte range per file in a separate metadata 
 store, and periodically running a vacuum process to rewrite compacted files) 
 to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7633) When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime throws IllegalArgumentException

2015-01-21 Thread Walter Su (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286901#comment-14286901
 ] 

Walter Su commented on HDFS-7633:
-


{color:red}-1 overall{color}.  

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version ) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.


 When Datanode has too many blocks, BlockPoolSliceScanner.getNewBlockScanTime 
 throws IllegalArgumentException
 

 Key: HDFS-7633
 URL: https://issues.apache.org/jira/browse/HDFS-7633
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Walter Su
Assignee: Walter Su
Priority: Minor
 Attachments: HDFS-7633.patch


 issue:
 When Total blocks of one of my DNs reaches 33554432, It refuses to accept 
 more blocks, this is the ERROR.
 2015-01-16 15:21:44,571 | ERROR | DataXceiver for client  at /172.1.1.8:50490 
 [Receiving block 
 BP-1976278848-172.1.1.2-1419846518085:blk_1221043436_147936990] | 
 datasight-198:25009:DataXceiver error processing WRITE_BLOCK operation  src: 
 /172.1.1.8:50490 dst: /172.1.1.11:25009 | 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250)
 java.lang.IllegalArgumentException: n must be positive
 at java.util.Random.nextInt(Random.java:300)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime(BlockPoolSliceScanner.java:263)
 at 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.addBlock(BlockPoolSliceScanner.java:276)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.addBlock(DataBlockScanner.java:193)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.closeBlock(DataNode.java:1733)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:765)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
 analysis:
 in function 
 org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.getNewBlockScanTime()
 when blockMap.size() is too big,
 Math.max(blockMap.size(),1)  * 600  is int type, and negtive
 Math.max(blockMap.size(),1) * 600 * 1000L is long type, and negtive
 (int)period  is Integer.MIN_VALUE
 Math.abs((int)period) is Integer.MIN_VALUE , which is negtive
 DFSUtil.getRandom().nextInt(periodInt)  will thows IllegalArgumentException
 I use Java HotSpot (build 1.7.0_05-b05)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3443) Unable to catch up edits during standby to active switch due to NPE

2015-01-21 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286911#comment-14286911
 ] 

Vinayakumar B commented on HDFS-3443:
-

Thanks [~szetszwo] for reviews and commit

 Unable to catch up edits during standby to active switch due to NPE
 ---

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
 Fix For: 2.6.1

 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286837#comment-14286837
 ] 

Andrew Wang commented on HDFS-7430:
---

+1 again if you need it, this patch has been through the wringer review wise 
and I think it's good to go. The failed tests look unrelated to me, but would 
appreciate a double check.

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-21 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-3519:
--
Attachment: HDFS-3519-branch-2.patch

Thanks, Chris. Here is the patch for branch-2.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, 
 HDFS-3519-branch-2.patch, HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7430) Rewrite the BlockScanner to use O(1) memory and use multiple threads

2015-01-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286896#comment-14286896
 ] 

Yi Liu commented on HDFS-7430:
--

{quote}
As per Nicholas' suggestion, let's let this soak for a week or two in trunk 
before we backport to branch-2. If we spot any bugs this will give us a chance 
to fix them.
{quote}
Thanks Colin, It's a nice work, and I'd like to be one of the volunteers to do 
some tests.

 Rewrite the BlockScanner to use O(1) memory and use multiple threads
 

 Key: HDFS-7430
 URL: https://issues.apache.org/jira/browse/HDFS-7430
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 3.0.0

 Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch, 
 HDFS-7430.004.patch, HDFS-7430.005.patch, HDFS-7430.006.patch, 
 HDFS-7430.007.patch, HDFS-7430.008.patch, HDFS-7430.009.patch, 
 HDFS-7430.010.patch, HDFS-7430.011.patch, HDFS-7430.012.patch, memory.png


 We should update the BlockScanner to use a constant amount of memory by 
 keeping track of what block was scanned last, rather than by tracking the 
 scan status of all blocks in memory.  Also, instead of having just one 
 thread, we should have a verification thread per hard disk (or other volume), 
 scanning at a configurable rate of bytes per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >