[jira] [Commented] (HDFS-5439) Fix TestPendingReplication
[ https://issues.apache.org/jira/browse/HDFS-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814688#comment-13814688 ] Junping Du commented on HDFS-5439: -- Thanks Arpit for explanation. Patch looks good to me. Verify it fix the test of TestPendingReplication and some other test failures (i.e. TestBlockReport) now, so +1. However, I think we still need to fix StorageReceivedDeletedBlocks with consistently using storageUuID or DatanodeUUID to initiate it (now you can find it get messed up in different places). Fix TestPendingReplication -- Key: HDFS-5439 URL: https://issues.apache.org/jira/browse/HDFS-5439 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: Heterogeneous Storage (HDFS-2832) Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: Heterogeneous Storage (HDFS-2832) Attachments: HDFS-5439-demo1.patch, h5439.04.patch {{TestPendingReplication}} fails with the following exception: {code} java.lang.AssertionError: expected:4 but was:3 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hdfs.server.blockmanagement.TestPendingReplication.testBlockReceived(TestPendingReplication.java:186) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5458) Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs
[ https://issues.apache.org/jira/browse/HDFS-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814788#comment-13814788 ] Hudson commented on HDFS-5458: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #384 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/384/]) HDFS-5458. Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs. Contributed by Mike Mellenthin. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1539091) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs -- Key: HDFS-5458 URL: https://issues.apache.org/jira/browse/HDFS-5458 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Mike Mellenthin Fix For: 2.2.1 Attachments: HDFS-5458-1.patch Saw a stacktrace of datanode startup with a bad volume, where even listing directories would throw an IOException. The failed volume threshold was set to 1, but it would fatally error out in {{File#getCanonicalPath}} in {{getDataDirsFromURIs}}: {code} File dir = new File(dirURI.getPath()); try { dataNodeDiskChecker.checkDir(localFS, new Path(dir.toURI())); dirs.add(dir); } catch (IOException ioe) { LOG.warn(Invalid + DFS_DATANODE_DATA_DIR_KEY + + dir + : , ioe); invalidDirs.append(\).append(dir.getCanonicalPath()).append(\ ); } {code} Since {{getCanonicalPath}} can need to do I/O and thus throw an IOException, this catch clause doesn't properly protect startup from a failed volume. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5458) Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs
[ https://issues.apache.org/jira/browse/HDFS-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814860#comment-13814860 ] Hudson commented on HDFS-5458: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1601 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1601/]) HDFS-5458. Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs. Contributed by Mike Mellenthin. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1539091) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs -- Key: HDFS-5458 URL: https://issues.apache.org/jira/browse/HDFS-5458 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Mike Mellenthin Fix For: 2.2.1 Attachments: HDFS-5458-1.patch Saw a stacktrace of datanode startup with a bad volume, where even listing directories would throw an IOException. The failed volume threshold was set to 1, but it would fatally error out in {{File#getCanonicalPath}} in {{getDataDirsFromURIs}}: {code} File dir = new File(dirURI.getPath()); try { dataNodeDiskChecker.checkDir(localFS, new Path(dir.toURI())); dirs.add(dir); } catch (IOException ioe) { LOG.warn(Invalid + DFS_DATANODE_DATA_DIR_KEY + + dir + : , ioe); invalidDirs.append(\).append(dir.getCanonicalPath()).append(\ ); } {code} Since {{getCanonicalPath}} can need to do I/O and thus throw an IOException, this catch clause doesn't properly protect startup from a failed volume. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5458) Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs
[ https://issues.apache.org/jira/browse/HDFS-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814879#comment-13814879 ] Hudson commented on HDFS-5458: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1575 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1575/]) HDFS-5458. Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs. Contributed by Mike Mellenthin. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1539091) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java Datanode failed volume threshold ignored if exception is thrown in getDataDirsFromURIs -- Key: HDFS-5458 URL: https://issues.apache.org/jira/browse/HDFS-5458 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Mike Mellenthin Fix For: 2.2.1 Attachments: HDFS-5458-1.patch Saw a stacktrace of datanode startup with a bad volume, where even listing directories would throw an IOException. The failed volume threshold was set to 1, but it would fatally error out in {{File#getCanonicalPath}} in {{getDataDirsFromURIs}}: {code} File dir = new File(dirURI.getPath()); try { dataNodeDiskChecker.checkDir(localFS, new Path(dir.toURI())); dirs.add(dir); } catch (IOException ioe) { LOG.warn(Invalid + DFS_DATANODE_DATA_DIR_KEY + + dir + : , ioe); invalidDirs.append(\).append(dir.getCanonicalPath()).append(\ ); } {code} Since {{getCanonicalPath}} can need to do I/O and thus throw an IOException, this catch clause doesn't properly protect startup from a failed volume. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5411) Update Bookkeeper dependency to 4.2.1
[ https://issues.apache.org/jira/browse/HDFS-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814951#comment-13814951 ] Robert Rati commented on HDFS-5411: --- 4.2.2 causes compilation and test issues. There will need to be additional work in addition to this patch to port to 4.2.2. Update Bookkeeper dependency to 4.2.1 - Key: HDFS-5411 URL: https://issues.apache.org/jira/browse/HDFS-5411 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Robert Rati Priority: Minor Attachments: HDFS-5411.patch Update the bookkeeper dependency to 4.2.1. This eases compilation on Fedora platforms -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5469) Add configuration property for the sub-directroy export path
Brandon Li created HDFS-5469: Summary: Add configuration property for the sub-directroy export path Key: HDFS-5469 URL: https://issues.apache.org/jira/browse/HDFS-5469 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Brandon Li Assignee: Brandon Li Currently only HDFS root is exported. Adding this property is the first step to support sub-directory mounting. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5469) Add configuration property for the sub-directroy export path
[ https://issues.apache.org/jira/browse/HDFS-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815052#comment-13815052 ] Brandon Li commented on HDFS-5469: -- Mounting sub-directory is especially useful for some Windows NFS clients, which can't mount root export possibly due to its own bug. Add configuration property for the sub-directroy export path Key: HDFS-5469 URL: https://issues.apache.org/jira/browse/HDFS-5469 Project: Hadoop HDFS Issue Type: Sub-task Components: nfs Reporter: Brandon Li Assignee: Brandon Li Currently only HDFS root is exported. Adding this property is the first step to support sub-directory mounting. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5461) fallback to non-ssr(local short circuit reads) while oom detected
[ https://issues.apache.org/jira/browse/HDFS-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815149#comment-13815149 ] Todd Lipcon commented on HDFS-5461: --- Adding some kind of limit makes sense. But, I'm curious why we ended up with 7GB worth of buffers in the referenced HBase ticket. It's because each open stream holds a buffer, and we have hundreds of open streams? Without direct buffers, wouldn't we just end up with a similar amount of memory usage in byte[] buffers, and OOME on the non-native heap? fallback to non-ssr(local short circuit reads) while oom detected - Key: HDFS-5461 URL: https://issues.apache.org/jira/browse/HDFS-5461 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0, 2.2.0 Reporter: Liang Xie Currently, the DirectBufferPool used by ssr feature seems doesn't have a upper-bound limit except DirectMemory VM option. So there's a risk to encounter direct memory oom. see HBASE-8143 for example. IMHO, maybe we could improve it a bit: 1) detect OOM or reach a setting up-limit from caller, then fallback to non-ssr 2) add a new metric about current raw consumed direct memory size. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-5326: --- Attachment: HDFS-5326.004.patch add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815326#comment-13815326 ] Colin Patrick McCabe commented on HDFS-5326: * {{TestClientNamenodeProtocolServerSideTranslatorPB}}: we don't need this test any more since validation is done server-side. Rationale: I would rather keep the validation in one place than have it spread out across client and server. It *must* be on the server, since we can't trust arbitrary clients, so let's just put it all there and unit test it well. * {{TestOfflineEditsViewer}}: I think this test failure happened because jenkins didn't apply the git binary diff to the {{editsStored}} file. I don't think the version of GNU patch used by jenkins supports git binary diffs. We've seen this in the past when updating this test. * added modifyPBCD test to {{TestPathBasedCacheRequests}}. * Fix bug where we were assuming that all modify requests came with a path. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815325#comment-13815325 ] Andrew Wang commented on HDFS-5326: --- Hey Colin, thanks for another mondo patch. I like the overall idea here, and I like the new PB code and user-facing API. All of the comments here are minor, I'll do another pass on an updated patch but expect to quickly be +1: Nitty: * IdNotFoundException: Can we say what this is used for than just EINVAL? * Update hdfs-default.xml too with changed config name * ClientProtocol still revers to cache descriptor in removePathBasedCacheDescriptor * Reusing PBCD for the {{filterInfo}} in {{DFS#listPBCD}} is a little too loosey goosey for me. Normally a PBCD always has a path and pool set, with just the repl and id being optional. This is a third form of usage, and we'll probably never want to filter on more than path and pool. How about keeping the old method signature? We can still use PBCD after DFS for simplicity if you like. It should probably also be named just {{filter}} too, since the type isn't named {{PBCDInfo}}. * Maybe we should rename CachePoolInfo to CachePool so the public APIs and classes line up, e.g. you {{addPathBasedCacheDirective}} a {{PBCD}}, and {{addCachePool}} a {{CachePool}}. Or we could have a PBCDi class at the risk of shaming on Hacker News ;) If we went with PBCDi, listPBCD would of still take a {{filterInfo}}. * DFS#listPBCD, DFS#addPathBasedCacheDirective javadoc needs to be updated with new params/return values * DFS#removePBCD: can just say id of instead of id id of * PBCD#getId: javadoc says it gets the path, not the id * PBCD.Builder#setId: javadoc param descriptions are off * Organizationally, do you mind moving the new modify stuff in the FSEditLog, Loader, Op, etc, so it goes add/modify/remove for directives, add/modify/remove for pools? Compat isn't a concern yet. * CacheManager has some lines beyond 80 chars due to the new indent Other: * FSN#modifyPBCD, need to move the FSPermissionChecker get and the checkOperation above the retry cache check. We can't throw any exceptions after the retry cache check that don't also set the retry cache state. It's also I think normally first checkOperation then the pc get for consistency. * We should add the new modify directive to DFSTestUtil so it gets tested too * Seems like a lot of these checks in CacheManager are now very similar. Since there's now a try/catch wrapping everything, we no longer need to have the method name in the exception text, and it should also be in the stack trace. So, can we consolidate some of these into shared validation methods that throw generic exceptions? add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815327#comment-13815327 ] Colin Patrick McCabe commented on HDFS-5326: oops, looks like we commented at the same time. patch 4 doesn't address your comments, just the test failures add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815330#comment-13815330 ] Andrew Wang commented on HDFS-5326: --- yea np, I figured the test fixups weren't going to be a big deal. I'll let you commit this one too when it's ready so you can ensure that editsStored is updated correctly. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815366#comment-13815366 ] Colin Patrick McCabe commented on HDFS-5326: bq. IdNotFoundException: Can we say what this is used for than just EINVAL? ok. I added a comment similar to the one in {{PathNotFoundException}} bq. Update hdfs-default.xml too with changed config name good call bq. ClientProtocol still revers to cache descriptor in removePathBasedCacheDescriptor fixed bq. Reusing PBCD for the filterInfo in DFS#listPBCD is a little too loosey goosey for me. Normally a PBCD always has a path and pool set, with just the repl and id being optional. This is a third form of usage, and we'll probably never want to filter on more than path and pool. How about keeping the old method signature? We can still use PBCD after DFS for simplicity if you like. It should probably also be named just filter too, since the type isn't named PBCDInfo. I don't know. I'm kind of worried about the number of {{listPathBasedCacheDirectives}} overloads multiplying, the way the number of {{FileSystem#create}} overloads multiplied. It seems cleaner to have one function that can handle any of these combinations. Filter does seem like a better name than filterInfo, though... bq. Maybe we should rename CachePoolInfo to CachePool so the public APIs and classes line up, e.g. you addPathBasedCacheDirective a PBCD, and addCachePool a CachePool. Or we could have a PBCDi class at the risk of shaming on Hacker News If we went with PBCDi, listPBCD would of still take a filterInfo. I think your instinct is right here. PBCDi is just too long, whatever other merits it has. But let's talk about possible renaming on another JIRA if we can think of something better, since this patch is already kinda big... bq. DFS#listPBCD, DFS#addPathBasedCacheDirective javadoc needs to be updated with new params/return values Done. I also added list all directives visible to us to the Javadoc. Directives in pools that we don't have read permission on will never be listed. bq. DFS#removePBCD: can just say id of instead of id id of ok ok bq. PBCD#getId / setId off fixed bq. CacheManager has some lines beyond 80 chars due to the new indent fixed bq. FSN#modifyPBCD, need to move the FSPermissionChecker get and the checkOperation above the retry cache check. We can't throw any exceptions after the retry cache check that don't also set the retry cache state. It's also I think normally first checkOperation then the pc get for consistency. good catch bq. We should add the new modify directive to DFSTestUtil so it gets tested too ok bq. Seems like a lot of these checks in CacheManager are now very similar. Since there's now a try/catch wrapping everything, we no longer need to have the method name in the exception text, and it should also be in the stack trace. So, can we consolidate some of these into shared validation methods that throw generic exceptions? There sort of aren't as many commonalities as it seems. The add operation checks that everything is set-- nothing can be null. In contrast, modify allows everything to be null, except ID. I feel like trying to factor out methods might make it confusing. The big things, like {{DFSUtil#isValidName}}, are already common code, so I don't feel too bad about it. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-2832) Enable support for heterogeneous storages in HDFS
[ https://issues.apache.org/jira/browse/HDFS-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815376#comment-13815376 ] Konstantin Shvachko commented on HDFS-2832: --- Arpit, I think we just agreed that collisions among UUIDs are possible but have low probability. This is a concern for me. Even though unlikely, a collision if it happens creates a serious problem for the system integrity. Does it concern you? In my previous comment I tried to explain that in distributed case the randomness of it is the main problem. Forget for a moment about PRNGs. Assume that UUID is an incremental counter (such as generation stamp (and now block id)), which is incremented by each node independently but at start up each chooses a randomly number to start from. On a single node ++ can go on without collisions for a long enough time to guarantee I will never see it. Y4K bug is fine with me. But if you take the second node and randomly choose a starting number it could be close to (1000 apart) the starting point of the first node. Then the second node can only generate 1000 storageIDs before colliding with those generated by the other node. The same is with PRNG you just replace ++ with next(). Long period doesn't matter if you choose your starting points randomly. Enable support for heterogeneous storages in HDFS - Key: HDFS-2832 URL: https://issues.apache.org/jira/browse/HDFS-2832 Project: Hadoop HDFS Issue Type: New Feature Affects Versions: 0.24.0 Reporter: Suresh Srinivas Assignee: Suresh Srinivas Attachments: 20130813-HeterogeneousStorage.pdf, h2832_20131023.patch, h2832_20131023b.patch, h2832_20131025.patch, h2832_20131028.patch, h2832_20131028b.patch, h2832_20131029.patch, h2832_20131103.patch, h2832_20131104.patch, h2832_20131105.patch HDFS currently supports configuration where storages are a list of directories. Typically each of these directories correspond to a volume with its own file system. All these directories are homogeneous and therefore identified as a single storage at the namenode. I propose, change to the current model where Datanode * is a * storage, to Datanode * is a collection * of strorages. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
Tsz Wo (Nicholas), SZE created HDFS-5470: Summary: Add back trunk's reportDiff algorithm to HDFS-2832 Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5464) Simplify block report diff calculation
[ https://issues.apache.org/jira/browse/HDFS-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815385#comment-13815385 ] Konstantin Shvachko commented on HDFS-5464: --- you won't argue that the new code is simpler than the existing Agreed on simpler. :-) see if I could come up a better solution Sure would be interesting to see. I doubt much can be done in this respect. We need to find blocks that did not appear in the report: in one pass and with constant memory overhead. May be an interview question. Simplify block report diff calculation -- Key: HDFS-5464 URL: https://issues.apache.org/jira/browse/HDFS-5464 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Priority: Minor Attachments: h5464_20131105.patch, h5464_20131105b.patch, h5464_20131105c.patch The current calculation in BlockManager.reportDiff(..) is unnecessarily complicated. We could simplify the calculation. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-5470: - Attachment: h5470_20131106.patch h5470_20131106.patch: add back the trunk code with storage Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815426#comment-13815426 ] Arpit Agarwal commented on HDFS-5470: - Nicholas, any benefit to making {{DatanodeStorageInfo#BlockIterator}} an inner class? Can it be a static nested class like {{DatanodeDescriptor#BlockIterator}}? Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815434#comment-13815434 ] Tsz Wo (Nicholas), SZE commented on HDFS-5470: -- In trunk, it is a static class with a DatanodeDescriptor field. It is better to make it a non-static and use DatanodeDescriptor.this. Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815436#comment-13815436 ] Arpit Agarwal commented on HDFS-5470: - The patch looks good to me but just curious what is the advantage? The other way it would have a DatanodeDescriptor field initialized in construction. Thanks. Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-5326: --- Attachment: HDFS-5326.006.patch add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815452#comment-13815452 ] Tsz Wo (Nicholas), SZE commented on HDFS-5470: -- The advantage for (non-static) inner class is to access the enclosing class object by using this reference. Just like methods, we may make all methods static and pass the enclosing object as a parameter. I beg you won't think that it is a good design. BTW, Java ArrayList.Itr is also non-static http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/ArrayList.java#ArrayList.Itr Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815456#comment-13815456 ] Arpit Agarwal commented on HDFS-5470: - I was thinking of the extra inner object allocation when it may not be needed by the caller but it makes sense from code simplicity. +1 for the patch. Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-5428: Attachment: HDFS-5428.000.patch Continue the discussion in HDFS-5443 here.. So HDFS-5428.000.patch is a simple patch that implements similar idea mentioned in HDFS-5443: 1) Record extra information in fsimage to indicate INodeFileUC that are only in snapshots. To keep the compatibility we keep the information in the under-construction-files section in fsimage, and just use .snapshot as their paths. 2) Identify these snapshot files while loading fsimage, and temporarily store them in a map in SnapshotManager. 3) When calculating total block number when starting NN, besides the files recorded in the lease map, also deduct the number of files recorded in 2). In general the idea is very similar to Vinay's patch. The difference is that we do not keep and maintain records in the lease map and only handle these files when starting the NN. We can even clear the records in SnapshotManager after computing the total number of blocks. One more thing we may need to handle is that if we remove the 0-sized blocks (HDFS-5443), it is possible that we can have an under-construction file in snapshot while there is no corresponding blockUC for the file. In that case we should not record extra information in fsimage for this kind of INodeFileUC. The current patch is just for demonstration. It can pass the new unit tests in Vinay's patch. If folks think the general idea is ok, we can continue our work based on this patch. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815466#comment-13815466 ] Tsz Wo (Nicholas), SZE commented on HDFS-5470: -- I was thinking of the extra inner object allocation ... In our case, the object won't be null. So it won't have extra object allocation. Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815474#comment-13815474 ] Hadoop QA commented on HDFS-5326: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612445/HDFS-5326.004.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5347//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5347//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5347//console This message is automatically generated. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (HDFS-5470) Add back trunk's reportDiff algorithm to HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE resolved HDFS-5470. -- Resolution: Fixed Fix Version/s: Heterogeneous Storage (HDFS-2832) Hadoop Flags: Reviewed Thanks Aprit for reviewing the patch. I have committed this. Add back trunk's reportDiff algorithm to HDFS-2832 -- Key: HDFS-5470 URL: https://issues.apache.org/jira/browse/HDFS-5470 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Fix For: Heterogeneous Storage (HDFS-2832) Attachments: h5470_20131106.patch As reminded by [~shv] in HDFS-5464, the report diff algorithm uses less memory. It also has a faster running time. We should add it back to the branch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5471) CacheAdmin -listPools fails when pools exist that user does not have permissions to
Stephen Chu created HDFS-5471: - Summary: CacheAdmin -listPools fails when pools exist that user does not have permissions to Key: HDFS-5471 URL: https://issues.apache.org/jira/browse/HDFS-5471 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 3.0.0 Reporter: Stephen Chu When a user does not have read permissions to a cache pool and executes hdfs cacheadmin -listPools the command will error complaining about missing required fields with something like: {code} [schu@hdfs-nfs ~]$ hdfs cacheadmin -listPools Exception in thread main org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): Message missing required fields: ownerName, groupName, mode, weight at com.google.protobuf.AbstractMessage$Builder.newUninitializedMessageException(AbstractMessage.java:770) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ListCachePoolsResponseElementProto$Builder.build(ClientNamenodeProtocolProtos.java:51722) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.listCachePools(ClientNamenodeProtocolServerSideTranslatorPB.java:1200) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2057) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2051) at org.apache.hadoop.hdfs.tools.CacheAdmin$ListCachePoolsCommand.run(CacheAdmin.java:675) at org.apache.hadoop.hdfs.tools.CacheAdmin.run(CacheAdmin.java:85) at org.apache.hadoop.hdfs.tools.CacheAdmin.main(CacheAdmin.java:90) [schu@hdfs-nfs ~]$ {code} In this example, the pool root has 750 permissions, and the root superuser is able to successfully -listPools: {code} [root@hdfs-nfs ~]# hdfs cacheadmin -listPools Found 4 results. NAME OWNER GROUP MODE WEIGHT bar root root rwxr-xr-x 100 foo root root rwxr-xr-x 100 root root root rwxr-x--- 100 schu root root rwxr-xr-x 100 [root@hdfs-nfs ~]# {code} When we modify the root pool to mode 755, schu user can now -listPools successfully without error. {code} [schu@hdfs-nfs ~]$ hdfs cacheadmin -listPools Found 4 results. NAME OWNER GROUP MODE WEIGHT bar root root rwxr-xr-x 100 foo root root rwxr-xr-x 100 root root root rwxr-xr-x 100 schu root root rwxr-xr-x 100 [schu@hdfs-nfs ~]$ {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5252) Stable write is not handled correctly in someplace
[ https://issues.apache.org/jira/browse/HDFS-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-5252: - Attachment: HDFS-5252.002.patch Upload a new patch to address Jing's comments. Also added unit test. Stable write is not handled correctly in someplace -- Key: HDFS-5252 URL: https://issues.apache.org/jira/browse/HDFS-5252 Project: Hadoop HDFS Issue Type: Sub-task Components: nfs Reporter: Brandon Li Assignee: Brandon Li Attachments: HDFS-5252.001.patch, HDFS-5252.002.patch When the client asks for a stable write but the prerequisite writes are not transferred to NFS gateway, the stableness can't be honored. NFS gateway has to treat the write as unstable write and set the flag to UNSTABLE in the write response. One bug was found during test with Ubuntu client when copying one 1KB file. For small files like 1KB file, Ubuntu client does one stable write (with FILE_SYNC flag). However, NFS gateway missed one place where(OpenFileCtx#doSingleWrite) it sends response with the flag NOT updated to UNSTABLE. With this bug, the client thinks the write is on disk and thus doesn't send COMMIT anymore. The following test tries to read the data back and of course fails to do so since the data was not synced. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5252) Stable write is not handled correctly in someplace
[ https://issues.apache.org/jira/browse/HDFS-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815513#comment-13815513 ] Hadoop QA commented on HDFS-5252: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612487/HDFS-5252.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-nfs hadoop-hdfs-project/hadoop-hdfs-nfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5350//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5350//console This message is automatically generated. Stable write is not handled correctly in someplace -- Key: HDFS-5252 URL: https://issues.apache.org/jira/browse/HDFS-5252 Project: Hadoop HDFS Issue Type: Sub-task Components: nfs Reporter: Brandon Li Assignee: Brandon Li Attachments: HDFS-5252.001.patch, HDFS-5252.002.patch When the client asks for a stable write but the prerequisite writes are not transferred to NFS gateway, the stableness can't be honored. NFS gateway has to treat the write as unstable write and set the flag to UNSTABLE in the write response. One bug was found during test with Ubuntu client when copying one 1KB file. For small files like 1KB file, Ubuntu client does one stable write (with FILE_SYNC flag). However, NFS gateway missed one place where(OpenFileCtx#doSingleWrite) it sends response with the flag NOT updated to UNSTABLE. With this bug, the client thinks the write is on disk and thus doesn't send COMMIT anymore. The following test tries to read the data back and of course fails to do so since the data was not synced. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5472) Fix TestDatanodeManager, TestSafeMode and TestNNThroughputBenchmark
[ https://issues.apache.org/jira/browse/HDFS-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-5472: - Attachment: h5472_20131106.patch h5472_20131106.patch: simple fixes Fix TestDatanodeManager, TestSafeMode and TestNNThroughputBenchmark --- Key: HDFS-5472 URL: https://issues.apache.org/jira/browse/HDFS-5472 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5472_20131106.patch - DatanodeDescriptor should be initialized with updateHeartbeat for updating the timestamps. - NNThroughputBenchmark should create DatanodeRegistrations with real datanode UUIDs. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5472) Fix TestDatanodeManager, TestSafeMode and TestNNThroughputBenchmark
Tsz Wo (Nicholas), SZE created HDFS-5472: Summary: Fix TestDatanodeManager, TestSafeMode and TestNNThroughputBenchmark Key: HDFS-5472 URL: https://issues.apache.org/jira/browse/HDFS-5472 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Attachments: h5472_20131106.patch - DatanodeDescriptor should be initialized with updateHeartbeat for updating the timestamps. - NNThroughputBenchmark should create DatanodeRegistrations with real datanode UUIDs. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815533#comment-13815533 ] Andrew Wang commented on HDFS-5326: --- Thanks for the bump, looks almost 100%. Just a few comments: bq. But let's talk about possible renaming on another JIRA if we can think of something better, since this patch is already kinda big... Sure, I'll file it. bq. Organizationally, do you mind moving the new modify stuff in the FSEditLog, Loader, Op, etc, so it goes add/modify/remove for directives, add/modify/remove for pools? Compat isn't a concern yet. This wasn't addressed, could you mind shuffling this around? I guess redoing the opcodes is optional (though appreciated), but I'd like to see all the methods/cases organized. bq. There sort of aren't as many commonalities as it seems. I took a hack at this and it ended up being less code and IMO cleaner. I can do this in a follow-on if you like, but: * Add and modify aren't that different besides the difference in required, optional, and default fields. I just first validate all present fields in the directive, then enforce required fields, then fill in default values. * Modify and remove have the same checks for an existing entry * Add and modify have the same checks for an existing cache pool * All three do write checks to a cache pool, moving this into FSPermissionChecker or a method was an easy savings I think we should still remove the method name from the exception text everywhere (and capitalize like a sentence). Also a few other things here: * need to add a space {code} throw new IOException(addDirective: replication' + replication + throw new IOException(modifyDirective: replication' + replication + {code} * success/fail logs are inconsistently formatted. I'd like something like e.g. methodName: successfully verb directive directive and methodName: failed to verb noun parameters:, e {code} LOG.warn(addDirective + directive + : failed, e); LOG.info(addDirective + directive + : succeeded.); ... LOG.warn(modifyDirective + idString + : error, e); LOG.info(modifyDirective + idString + : applied + directive); ... LOG.warn(removeDirective + id + failed, e); LOG.info(removeDirective + id + : removed); {code} * I feel like we could dedupe the various PC exception texts by throwing the AccessControlException in pc#checkPermission itself. I think it's a straightforward change. * Unrelated, but I noticed that CacheManager#listPBCDs does a pc check without first checking if pc is null, want to fix that here? * I also noticed we have some unused imports in FSEditLog and CacheManager. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5473) Consistent naming of user-visible caching classes and methods
Andrew Wang created HDFS-5473: - Summary: Consistent naming of user-visible caching classes and methods Key: HDFS-5473 URL: https://issues.apache.org/jira/browse/HDFS-5473 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Andrew Wang It's kind of warty that (after HDFS-5326 goes in) DistributedFileSystem has {{*CachePool}} methods take a {{CachePoolInfo}} and {{*PathBasedCacheDirective}} methods that thake a {{PathBasedCacheDirective}}. We should consider renaming {{CachePoolInfo}} to {{CachePool}} for consistency. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815536#comment-13815536 ] sathish commented on HDFS-5428: --- Continue the discussion ( HDFS-5443) here. As we discussed yesterday to verify this patch for (HDFS-5443).With this patch the issue is still reproducing i..e After restart NN is going to safemode.I am not sure where the flow is missing. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815551#comment-13815551 ] Hadoop QA commented on HDFS-5428: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612477/HDFS-5428.000.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5348//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5348//console This message is automatically generated. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815558#comment-13815558 ] Vinay commented on HDFS-5428: - Hi Jing, thanks for posting the simplified patch. Patch looks quite good, making all unit test in my patch pass. Small improvements required to satisfy below points as well. bq. (From issue Description) So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. This problem is still there, because while loading the fsimage, snapshot inodes are not replaced with an UCInode and last block is COMPLETE. In this case after reloading from fsimage we will not be able to read the last block. Replacing such inodes with UCInode is required. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815559#comment-13815559 ] sathish commented on HDFS-5428: --- {quote} But I am little uncomfortable for managing leases for snapshotted files as they are readonly files, no need of leases. If all others ok on that point, I will not object. {quote} After this point ,Uma and me discussed the same points what Jing has mentioned in the HDFS-5428-000.patch. It is better way to maintain the leases regarding the snapshot files in snapshot manager,As the responsibility of lease manager is to maintain the leases for open files for write.with the current implementation snapshots are read only,so there is no need to maintain the leases for snapshotted files in lease manager.so it is better to maintain the leases regarding the snapshotted files in snapshot manager. +1 patch looks good I will verify this patch in my env once. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815562#comment-13815562 ] Vinay commented on HDFS-5428: - I think, current updated patch HDFS-5428.000.patch can solve HDFS-5443. I mean NN will exit from safemode even without removing the 0-sized blocks. But removing 0-sized blocks will be an added advantage. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5443) Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file.
[ https://issues.apache.org/jira/browse/HDFS-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815566#comment-13815566 ] sathish commented on HDFS-5443: --- Thanks Jing for the patch. I verifyied this patch in my env,it is working correctly.This patch is whiping out the zero sized blocks,so NN is coming out of safemode Along with this pacth,if we merge the patch HDFS-5428-v2.patch,i feel it will clear all the problems for the underconstruction files with in the snapshot Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file. Key: HDFS-5443 URL: https://issues.apache.org/jira/browse/HDFS-5443 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0, 2.2.0 Reporter: Uma Maheswara Rao G Assignee: sathish Attachments: 5443-test.patch, HDFS-5443.000.patch This issue is reported by Prakash and Sathish. On looking into the issue following things are happening. . 1) Client added block at NN and just did logsync So, NN has block ID persisted. 2)Before returning addblock response to client take a snapshot for root or parent directories for that file 3) Delete parent directory for that file 4) Now crash the NN with out responding success to client for that addBlock call Now on restart of the Namenode, it will stuck in safemode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815572#comment-13815572 ] Hadoop QA commented on HDFS-5326: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612476/HDFS-5326.006.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5349//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5349//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5349//console This message is automatically generated. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815580#comment-13815580 ] Vinay commented on HDFS-5428: - I tried to update the patch according to my previous comment. But to replace the exact inode we need to have the full snapshot path. in the current case since the full snapshot path is not tracked anywhere we cannot replace the INode. Need a way to track the full path of the snapshot INode and replace the INode with INodeFileUC. under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815581#comment-13815581 ] Vinay commented on HDFS-5428: - I just tried the following to read the file after restart. It failed with BlockMissingException {code} @Test public void testWithCheckpoint() throws Exception { Path path = new Path(/test); doWriteAndAbort(fs, path); fs.delete(new Path(/test/test), true); NameNode nameNode = cluster.getNameNode(); NameNodeAdapter.enterSafeMode(nameNode, false); NameNodeAdapter.saveNamespace(nameNode); NameNodeAdapter.leaveSafeMode(nameNode); cluster.restartNameNode(true); // read snapshot file after restart String test2snapshotPath = Snapshot.getSnapshotPath(path.toString(), s1/test/test2); DFSTestUtil.readFile(fs, new Path(test2snapshotPath)); String test3snapshotPath = Snapshot.getSnapshotPath(path.toString(), s1/test/test3); DFSTestUtil.readFile(fs, new Path(test3snapshotPath)); }{code} under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-5326: --- Attachment: HDFS-5326.007.patch add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch, HDFS-5326.007.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815585#comment-13815585 ] Colin Patrick McCabe commented on HDFS-5326: bq. This wasn't addressed, could you mind shuffling this around? I guess redoing the opcodes is optional (though appreciated), but I'd like to see all the methods/cases organized. I reordered the opcodes. I suppose it does make sense to do. bq. I took a hack at this and it ended up being less code and IMO cleaner. I can do this in a follow-on if you like, but: Let's do this as part of HDFS-5471 if it looks good... similarly with refactoring pc#checkPermission. bq. need to add a space fixed bq. Unrelated, but I noticed that CacheManager#listPBCDs does a pc check without first checking if pc is null, want to fix that here? fxied add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch, HDFS-5326.007.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-5326: --- Attachment: HDFS-5326.007.patch add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch, HDFS-5326.007.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-5326: --- Attachment: (was: HDFS-5326.007.patch) add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch, HDFS-5326.007.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5461) fallback to non-ssr(local short circuit reads) while oom detected
[ https://issues.apache.org/jira/browse/HDFS-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-5461: Attachment: HDFS-5461.txt bq. It's because each open stream holds a buffer, and we have hundreds of open streams? i am not 100% sure, but in my mind, i agree with you, this oom is easy to repro while we have lots of opened storefiles to be read(e.g. compaction can't catch up sometimes) Oh, i see, seems the fallback only meaningful for some config like mine: big Xmx and small MaxDirectMemorySize :) I attached a patch with more logging about using/pooled direct buffer size. In my option, it could be useful probably while online resetting the log level to trace during OOM occur. And add a simple try/catch fallback handle for OOM without introducing any config value, per me, seems this way is more reasonable:) fallback to non-ssr(local short circuit reads) while oom detected - Key: HDFS-5461 URL: https://issues.apache.org/jira/browse/HDFS-5461 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0, 2.2.0 Reporter: Liang Xie Attachments: HDFS-5461.txt Currently, the DirectBufferPool used by ssr feature seems doesn't have a upper-bound limit except DirectMemory VM option. So there's a risk to encounter direct memory oom. see HBASE-8143 for example. IMHO, maybe we could improve it a bit: 1) detect OOM or reach a setting up-limit from caller, then fallback to non-ssr 2) add a new metric about current raw consumed direct memory size. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815641#comment-13815641 ] sathish commented on HDFS-5428: --- vinay as i observed when debugging the scenario along with your patch, There is some path mismatch,when counting the blocks of snapshotfile under construction,due to this it is not removing that blocks from block threshold. {code} String fileSnapshotPath = StringUtils.replaceOnce( file, snapshottableDir, Snapshot.getSnapshotPath(snapshottableDir, Snapshot.getSnapshotName(snapshot))); {code} String util is not replacing the correct path. logs for this 2013-11-07 01:05:15,103 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join java.lang.RuntimeException: java.io.FileNotFoundException: File does not exist: /.snapshot/snap_6ran/_temporary/0/_temporary/attempt_local1866843415_0001_m_00_0/part-m-0 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:5068) (FSNamesystem.java:853) at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:540) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:482) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5443) Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file.
[ https://issues.apache.org/jira/browse/HDFS-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815653#comment-13815653 ] Uma Maheswara Rao G commented on HDFS-5443: --- +1 patch looks good. Thanks Jing, Vinay Sathish for your efforts. Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file. Key: HDFS-5443 URL: https://issues.apache.org/jira/browse/HDFS-5443 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0, 2.2.0 Reporter: Uma Maheswara Rao G Assignee: sathish Attachments: 5443-test.patch, HDFS-5443.000.patch This issue is reported by Prakash and Sathish. On looking into the issue following things are happening. . 1) Client added block at NN and just did logsync So, NN has block ID persisted. 2)Before returning addblock response to client take a snapshot for root or parent directories for that file 3) Delete parent directory for that file 4) Now crash the NN with out responding success to client for that addBlock call Now on restart of the Namenode, it will stuck in safemode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815659#comment-13815659 ] Hadoop QA commented on HDFS-5326: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612514/HDFS-5326.007.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5351//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5351//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5351//console This message is automatically generated. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch, HDFS-5326.007.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5326) add modifyDirective to cacheAdmin
[ https://issues.apache.org/jira/browse/HDFS-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815669#comment-13815669 ] Hadoop QA commented on HDFS-5326: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612516/HDFS-5326.007.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5352//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5352//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5352//console This message is automatically generated. add modifyDirective to cacheAdmin - Key: HDFS-5326 URL: https://issues.apache.org/jira/browse/HDFS-5326 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: 3.0.0 Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HDFS-5326.003.patch, HDFS-5326.004.patch, HDFS-5326.006.patch, HDFS-5326.007.patch We should add a way of modifying cache directives on the command-line, similar to how modifyCachePool works. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5252) Stable write is not handled correctly in someplace
[ https://issues.apache.org/jira/browse/HDFS-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815698#comment-13815698 ] Jing Zhao commented on HDFS-5252: - The new patch looks great to me. +1. Stable write is not handled correctly in someplace -- Key: HDFS-5252 URL: https://issues.apache.org/jira/browse/HDFS-5252 Project: Hadoop HDFS Issue Type: Sub-task Components: nfs Reporter: Brandon Li Assignee: Brandon Li Attachments: HDFS-5252.001.patch, HDFS-5252.002.patch When the client asks for a stable write but the prerequisite writes are not transferred to NFS gateway, the stableness can't be honored. NFS gateway has to treat the write as unstable write and set the flag to UNSTABLE in the write response. One bug was found during test with Ubuntu client when copying one 1KB file. For small files like 1KB file, Ubuntu client does one stable write (with FILE_SYNC flag). However, NFS gateway missed one place where(OpenFileCtx#doSingleWrite) it sends response with the flag NOT updated to UNSTABLE. With this bug, the client thinks the write is on disk and thus doesn't send COMMIT anymore. The following test tries to read the data back and of course fails to do so since the data was not synced. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5474) Deletesnapshot can make Namenode in safemode on NN restarts.
Uma Maheswara Rao G created HDFS-5474: - Summary: Deletesnapshot can make Namenode in safemode on NN restarts. Key: HDFS-5474 URL: https://issues.apache.org/jira/browse/HDFS-5474 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Uma Maheswara Rao G Assignee: sathish When we deletesnapshot, we are deleting the blocks associated to that snapshot and after that we do logsync to editlog about deleteSnapshot. There can be a chance that blocks removed from blocks map but before log sync if there is BR , NN may finds that block does not exist in blocks map and may invalidate that block. As part HB, invalidation info also can go. After this steps if Namenode shutdown before actually do logsync, On restart it will still consider that snapshot Inodes and expect blocks to report from DN. Simple solution is, we should simply move down that blocks removal after logsync only. Similar to delete op. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815718#comment-13815718 ] Vinay commented on HDFS-5428: - Hi [~sathish.gurram] , You are right. replacement is wrong if the snapshottable dir is /. I will update the patch if necessary. ;) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5443) Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file.
[ https://issues.apache.org/jira/browse/HDFS-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815721#comment-13815721 ] Jing Zhao commented on HDFS-5443: - Thanks Uma, Sathish and Vinay! I will commit the patch tomorrow morning in case there is no more comments. Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file. Key: HDFS-5443 URL: https://issues.apache.org/jira/browse/HDFS-5443 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.0.0, 2.2.0 Reporter: Uma Maheswara Rao G Assignee: sathish Attachments: 5443-test.patch, HDFS-5443.000.patch This issue is reported by Prakash and Sathish. On looking into the issue following things are happening. . 1) Client added block at NN and just did logsync So, NN has block ID persisted. 2)Before returning addblock response to client take a snapshot for root or parent directories for that file 3) Delete parent directory for that file 4) Now crash the NN with out responding success to client for that addBlock call Now on restart of the Namenode, it will stuck in safemode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5428) under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode
[ https://issues.apache.org/jira/browse/HDFS-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815727#comment-13815727 ] Jing Zhao commented on HDFS-5428: - bq. But to replace the exact inode we need to have the full snapshot path. in the current case since the full snapshot path is not tracked anywhere we cannot replace the INode. Yeah, in our current implementation it's hard (sometime impossible) to get the full path for a given snapshot inode. Thus it will be hard to replace the whole INodeFile. So here my question is whether it's possible that we just replace the last block of the snapshot INode with a BlockInfoUC (but without replacing the INodeFile with an INodeFileUC)? under construction files deletion after snapshot+checkpoint+nn restart leads nn safemode Key: HDFS-5428 URL: https://issues.apache.org/jira/browse/HDFS-5428 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Vinay Assignee: Vinay Attachments: HDFS-5428-v2.patch, HDFS-5428.000.patch, HDFS-5428.patch 1. allow snapshots under dir /foo 2. create a file /foo/test/bar and start writing to it 3. create a snapshot s1 under /foo after block is allocated and some data has been written to it 4. Delete the directory /foo/test 5. wait till checkpoint or do saveNameSpace 6. restart NN. NN enters to safemode. Analysis: Snapshot nodes loaded from fsimage are always complete and all blocks will be in COMPLETE state. So when the Datanode reports RBW blocks those will not be updated in blocksmap. Some of the FINALIZED blocks will be marked as corrupt due to length mismatch. -- This message was sent by Atlassian JIRA (v6.1#6144)