[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347472#comment-16347472 ]
Rakesh R edited comment on HDFS-10285 at 2/1/18 5:07 AM: --------------------------------------------------------- Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows, please take a look at it. +Comment-1)+ {quote}BlockManager Shouldn’t spsMode be volatile? Although I question why it’s here. {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-2)+ {quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service to the block manager. Please move them out to prove it’s not tightly coupled. {quote} [Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep all the related apis over there. +Comment-3)+ {quote}BPServiceActor Is it actually sending back the moved blocks? Aren’t IBRs sufficient? BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished Again, not sure that a new DN command is necessary, and why does it specifically report back successful moves instead of relying on IBRs? I would actually expect the DN to be completely ignorant of a SPS move vs any other move. {quote} [Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely on this, then it would requires an *extra* check to know whether this new block has occurred due to sps move or others, which will be quite often considering more other ops compares to SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately, each movement finished block can be easily/quickly recognized by the Satisfier in NN side and updates the tracking details. If you agree this *extra* check is not an issue then we would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added to carry the block vs src/target pairs which is much needed for the move operation and we tried to decouple sps code using this command. +Comment-4)+ {quote}DataNode Why isn’t this just a block transfer? How is transferring between DNs any different than across storages? {quote} [Rakesh's reply] I could see Mover is also using {{REPLACE_BLOCK}} call and we just followed same approach in sps also. Am I missing anything here? +Comment-5)+ {quote}DatanodeDescriptor Why use a synchronized linked list to offer/poll instead of BlockingQueue? {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-6)+ {quote}DatanodeManager I know it’s configurable, but realistically, when would you ever want to give storage movement tasks equal footing with under-replication? Is there really a use case for not valuing durability? {quote} [Rakesh's reply] We don't have any particular use case, though. One scenario we thought is, user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up is considered as a high priority. If you feel, this is not a real case then I'm OK to remove this config and SPS will use only the remaining slots always. +Comment-7)+ {quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap. verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap? Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then it gets another datanode report to determine the number of live nodes to decide if it should sleep before processing the next path. The number of nodes from the prior cached view of the world should suffice. {quote} [Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. Actually, we depend on this api for the data node storage types and remaining space details. I think, it requires two different mechanisms for internal and external sps. For internal, how about sps can directly refer {{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are suggesting a cache mechanism. How about, get storageReport once and cache at ExternalContext. This local cache can be refreshed periodically. Say, After every 5mins (just an arbitrary number I put here, if you have some period in mind please suggest), when getDatanodeStorageReport called, cache can be treated as expired and fetch freshly. Within 5mins it can use from cache. Does this make sense to you? Another point we thought of is, right now for checking whether node has good space, its going to NN. With above caching, we could just check the space constraints from this cached report only. We don't got to NN for every DN space checking. If you are fine with above caching implementation, I would also incorporate this point. +Comment-8)+ {quote}DFSUtil DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier }}. These aren’t generally useful methods so why are they in DFSUtil? Why aren’t they in the only calling class StoragePolicySatisfier? {quote} [Rakesh's reply] Agreed, Will do the changes. +Comment-9)+ {quote}FSDirXAttrOp Not fond of the SPS queue updates being edge triggered by xattr changes. Food for thought: why add more rpc methods if the client can just twiddle the xattr? {quote} [Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it will update the xattr and add this inodeId to {{spsPathIds}} Queue. Again, on restart NN, it will check the xattr key and if exists, then add the inodeId to {{spsPathIds}} Queue. This in-memory Queue is maintained for quick look up and is used for tracking(scheduling/completion/retry logics) purpose. With this flow, we decided to place the {{spsPathIds}} entry addition along with the xattr. Please feel free to correct me if I missed anything. +Comment-10)+ {quote}FSNamesystem / NamenodeProtocolTranslatorPB Most of the api changes appear unnecessary. No need for the new getFileInfo when the existing getLocatedFileInfo appears to do exactly what the new method does? No need for the new getFilePath. It’s only used by IntraSPSNameNodeContext#getFileInfo to get the path for an inode number, followed by calling the new getFileInfo cited above. IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that any and all IOEs means FNF which probably isn’t the intention during rpc exceptions. Should be able to replace it all with getLocatedFileInfo(“/.reserved/.inodes/XXX”, false) which avoids changing the public apis. {quote} [Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we have maintained list of {{inodeIds}} rather than the path objects. So, we need to fetch the file details by passing the {{inodeId}} and is the reason for the new getFilePath(inodeId). For the getFileInfo, I've used getLocatedFileInfo API in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}} is keeping Namesystem reference and getFileInfo api is not exists. Can sps keeps FSNamesystem ref instead of its interface ref? We will also do some more research on this part. +Comment-11)+ {quote}HdfsServerConstants The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually contain the word “xattr”? {quote} [Rakesh's reply] Sure, will remove “xattr” word. +Comment-12)+ {quote}NameNode Super trivial but using the plural pronoun “we” in this exception message is odd. Changing the value isn’t a joint activity. For enabling or disabling storage policy satisfier, we must pass either none/internal/external string value only {quote} [Rakesh's reply] oops, sorry for the mistake. Will change it. +Comment-13)+ {quote}StoragePolicySatisfier It appears to make back-to-back calls to hasLowRedundancyBlocks and getFileInfo for every file. Haven’t fully groked the code, but if low redundancy is not the common case, then it shouldn’t be called unless/until needed. It looks like files that are under replicated are re-queued again? {quote} [Rakesh's reply] Yes, under replicated files are queued up again. For external, I can think of another logic by iterating over all the blocks and checks the block location array length. If it is < replication factor, I can mark this file as under replicated. Does this make sense to you? +Comment-14)+ {quote}Appears to be calling getStoragePolicy for every file. It’s not like the byte values change all the time, why call and cache all of them via FSN#getStoragePolicies? {quote} [Rakesh's reply] Below is the current logic we used. Sry, I failed to understand the need of cache all the policies. Could you please elaborate a bit. Thanks! {code:java} ExternalSPSContext.java @Override public BlockStoragePolicy getStoragePolicy(byte policyId) { return createDefaultSuite.getPolicy(policyId); } IntraSPSNameNodeContext.java public BlockStoragePolicy getStoragePolicy(byte policyID) { return blockManager.getStoragePolicy(policyID); } {code} +Comment-15)+ {quote}FSDirSatisfyStoragePolicyOp satisfyStoragePolicy errors if the xattr is already present. Should this be a no-op? A client re-requesting a storage policy correction probably shouldn't fail. unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls addSPSPathId. To avoid race conditions or inconsistent state if the xattr fails, should call addSPSPathId after xattrs are successfully updated. inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the inode isn't a file. Should do file check first to avoid unnecessary computation. In general, not fond of unnecessarily guava. Instead of newArrayListWithCapacity + add, standard Arrays.asList(item) is more succinct. {quote} [Rakesh'r reply] Agreed, will do the changes. +Comment-16)+ {quote}FSDirStatAndListOp Not sure why javadoc was changed to add needLocation. It's already present and now doubled up. {quote} [Rakesh'r reply] Agreed, will correct it. +Comment-17)+ {quote}// TODO: Needs to manage the number of concurrent moves per DataNode. Shouldn't this be fixed now? {quote} [Rakesh'r reply] I think, we have throttling in place at the namenode side, so the amount of work is less coming to the datanode. By default {{dfs.namenode.replication.max-streams-hard-limit}} value is 2, with this only delta move task will go to dn in one heartbeat interval period(3sec time interval). We have skipped the case of configuring a very big value to the hard-limit and is the reason for postponing this task to later. +Comment-18)+ {quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently doing 1000 moves, it's not in a good state, disk io is probably saturated, and this will only make it much worse. 10 is probably more than sufficient. {quote} [Rakesh'r reply] Agreed, will make it to smaller value 10. +Comment-19)+ {quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted, WHY should the heartbeat thread process the operation? That's a terrible idea. {quote} [Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple. Does this make sense to you? {code:java} ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60, TimeUnit.SECONDS, new SynchronousQueue<Runnable>(), new Daemon.DaemonFactory() { private final AtomicInteger threadIndex = new AtomicInteger(0); @Override public Thread newThread(Runnable r) { Thread t = super.newThread(r); t.setName("BlockMoverTask-" + threadIndex.getAndIncrement()); return t; } }); {code} +Comment-20)+ {quote}I was under the impression the DN did an optimized movement between its storages? Does it actually make a connection to itself? If yes, when the DN xceiver pool is very busy, the mover thread pool will exhaust and cause the heartbeat thread to block. {quote} [Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and will do the changes. FYI, initially we have implemented C-DN to track all block moves, but later we had refactored the code and we somehow missed the local movement logic. +Comment-21)+ {quote}BlockStorageMovementTracker Many data structures are riddled with non-threadsafe race conditions and risk of CMEs. Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list of futures is synchronized. However the run loop does an unsynchronized block get, unsynchronized future remove, unsynchronized isEmpty, possibly another unsynchronized get, only then does it do a synchronized remove of the block. The whole chunk of code should be synchronized. Is the problematic moverTaskFutures even needed? It's aggregating futures per-block for seemingly no reason. Why track all the futures at all instead of just relying on the completion service? As best I can tell: It's only used to determine if a future from the completion service should be ignored during shutdown. Shutdown sets the running boolean to false and clears the entire datastructure so why not use the running boolean like a check just a little further down? As synchronization to sleep up to 2 seconds before performing a blocking moverCompletionService.take, but only when it thinks there are no active futures. I'll ignore the missed notify race that the bounded wait masks, but the real question is why not just do the blocking take? Why all the complexity? Am I missing something? BlocksMovementsStatusHandler Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex. blockIdVsMovementStatus is inconsistent synchronized. Does synchronize to return an unmodifiable list which sadly does nothing to protect the caller from CME. handle is iterating over a non-thread safe list. {quote} [Rakesh'r reply] Agreed, will correct it and address the concerns. +Comment-22)+ {quote}StoragePolicySatisifier Should handleException use a double-checked lock to avoid synchronization? Unexpected exceptions should be a rarity, right? Speaking of which, it’s not safe to ignore all Throwable in the run loop! You have no idea if data structures are in a sane or consistent state. {quote} [Rakesh'r reply] Agreed, will do the changes. +Comment-23)+ {quote}FSTreeTraverser {quote} This sps branch doesn't implementing any new logic for tree travesal, it simply uses the slice tree traversal logic implemented via HDFS-10899 jira. We have tried an attempt to make the existing code more generalize so that HDFS-10285, HDFS-12090 and EDEK features can make use of this. I'd appreciate if you could add your comments/thoughts into improvement task HDFS-13095 to correct the existing code in trunk, that would really help us to avoid maintaining those changes in this branch. During the discussion/comments in that jira, if some bug specific to the sps comes up then we will surely fix it to improve the sps logic. {quote} traverseDir may NPE if it's traversing a tree in a snapshot and one of the ancestors is deleted. {quote} Good catch, as per the analysis/discussion in HDFS-13095 this is a special case only to sps and we will fix it in sps. was (Author: rakeshr): Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows, please take a look at it. +Comment-1)+ {quote}BlockManager Shouldn’t spsMode be volatile? Although I question why it’s here. {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-2)+ {quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service to the block manager. Please move them out to prove it’s not tightly coupled. {quote} [Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep all the related apis over there. +Comment-3)+ {quote}BPServiceActor Is it actually sending back the moved blocks? Aren’t IBRs sufficient? BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished Again, not sure that a new DN command is necessary, and why does it specifically report back successful moves instead of relying on IBRs? I would actually expect the DN to be completely ignorant of a SPS move vs any other move. {quote} [Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely on this, then it would requires an *extra* check to know whether this new block has occurred due to sps move or others, which will be quite often considering more other ops compares to SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately, each movement finished block can be easily/quickly recognized by the Satisfier in NN side and updates the tracking details. If you agree this *extra* check is not an issue then we would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added to carry the block vs src/target pairs which is much needed for the move operation and we tried to decouple sps code using this command. +Comment-4)+ {quote}DataNode Why isn’t this just a block transfer? How is transferring between DNs any different than across storages? {quote} [Rakesh's reply] I could see Mover is also using {{REPLACE_BLOCK}} call and we just followed same approach in sps also. Am I missing anything here? +Comment-5)+ {quote}DatanodeDescriptor Why use a synchronized linked list to offer/poll instead of BlockingQueue? {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-6)+ {quote}DatanodeManager I know it’s configurable, but realistically, when would you ever want to give storage movement tasks equal footing with under-replication? Is there really a use case for not valuing durability? {quote} [Rakesh's reply] We don't have any particular use case, though. One scenario we thought is, user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up is considered as a high priority. If you feel, this is not a real case then I'm OK to remove this config and SPS will use only the remaining slots always. +Comment-7)+ {quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap. verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap? Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then it gets another datanode report to determine the number of live nodes to decide if it should sleep before processing the next path. The number of nodes from the prior cached view of the world should suffice. {quote} [Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. Actually, we depend on this api for the data node storage types and remaining space details. I think, it requires two different mechanisms for internal and external sps. For internal, how about sps can directly refer {{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are suggesting a cache mechanism. How about, get storageReport once and cache at ExternalContext. This local cache can be refreshed periodically. Say, After every 5mins (just an arbitrary number I put here, if you have some period in mind please suggest), when getDatanodeStorageReport called, cache can be treated as expired and fetch freshly. Within 5mins it can use from cache. Does this make sense to you? Another point we thought of is, right now for checking whether node has good space, its going to NN. With above caching, we could just check the space constraints from this cached report only. We don't got to NN for every DN space checking. If you are fine with above caching implementation, I would also incorporate this point. +Comment-8)+ {quote}DFSUtil DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier }}. These aren’t generally useful methods so why are they in DFSUtil? Why aren’t they in the only calling class StoragePolicySatisfier? {quote} [Rakesh's reply] Agreed, Will do the changes. +Comment-9)+ {quote}FSDirXAttrOp Not fond of the SPS queue updates being edge triggered by xattr changes. Food for thought: why add more rpc methods if the client can just twiddle the xattr? {quote} [Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it will update the xattr and add this inodeId to {{spsPathIds}} Queue. Again, on restart NN, it will check the xattr key and if exists, then add the inodeId to {{spsPathIds}} Queue. This in-memory Queue is maintained for quick look up and is used for tracking(scheduling/completion/retry logics) purpose. With this flow, we decided to place the {{spsPathIds}} entry addition along with the xattr. Please feel free to correct me if I missed anything. +Comment-10)+ {quote}FSNamesystem / NamenodeProtocolTranslatorPB Most of the api changes appear unnecessary. No need for the new getFileInfo when the existing getLocatedFileInfo appears to do exactly what the new method does? No need for the new getFilePath. It’s only used by IntraSPSNameNodeContext#getFileInfo to get the path for an inode number, followed by calling the new getFileInfo cited above. IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that any and all IOEs means FNF which probably isn’t the intention during rpc exceptions. Should be able to replace it all with getLocatedFileInfo(“/.reserved/.inodes/XXX”, false) which avoids changing the public apis. {quote} [Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we have maintained list of {{inodeIds}} rather than the path objects. So, we need to fetch the file details by passing the {{inodeId}} and is the reason for the new getFilePath(inodeId). For the getFileInfo, I've used getLocatedFileInfo API in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}} is keeping Namesystem reference and getFileInfo api is not exists. Can sps keeps FSNamesystem ref instead of its interface ref? We will also do some more research on this part. +Comment-11)+ {quote}HdfsServerConstants The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually contain the word “xattr”? {quote} [Rakesh's reply] Sure, will remove “xattr” word. +Comment-12)+ {quote}NameNode Super trivial but using the plural pronoun “we” in this exception message is odd. Changing the value isn’t a joint activity. For enabling or disabling storage policy satisfier, we must pass either none/internal/external string value only {quote} [Rakesh's reply] oops, sorry for the mistake. Will change it. +Comment-13)+ {quote}StoragePolicySatisfier It appears to make back-to-back calls to hasLowRedundancyBlocks and getFileInfo for every file. Haven’t fully groked the code, but if low redundancy is not the common case, then it shouldn’t be called unless/until needed. It looks like files that are under replicated are re-queued again? {quote} [Rakesh's reply] Yes, under replicated files are queued up again. For external, I can think of another logic by iterating over all the blocks and checks the block location array length. If it is < replication factor, I can mark this file as under replicated. Does this make sense to you? +Comment-14)+ {quote}Appears to be calling getStoragePolicy for every file. It’s not like the byte values change all the time, why call and cache all of them via FSN#getStoragePolicies? {quote} [Rakesh's reply] Below is the current logic we used. Sry, I failed to understand the need of cache all the policies. Could you please elaborate a bit. Thanks! {code:java} ExternalSPSContext.java @Override public BlockStoragePolicy getStoragePolicy(byte policyId) { return createDefaultSuite.getPolicy(policyId); } IntraSPSNameNodeContext.java public BlockStoragePolicy getStoragePolicy(byte policyID) { return blockManager.getStoragePolicy(policyID); } {code} +Comment-15)+ {quote}FSDirSatisfyStoragePolicyOp satisfyStoragePolicy errors if the xattr is already present. Should this be a no-op? A client re-requesting a storage policy correction probably shouldn't fail. unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls addSPSPathId. To avoid race conditions or inconsistent state if the xattr fails, should call addSPSPathId after xattrs are successfully updated. inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the inode isn't a file. Should do file check first to avoid unnecessary computation. In general, not fond of unnecessarily guava. Instead of newArrayListWithCapacity + add, standard Arrays.asList(item) is more succinct. {quote} [Rakesh'r reply] Agreed, will do the changes. +Comment-16)+ {quote}FSDirStatAndListOp Not sure why javadoc was changed to add needLocation. It's already present and now doubled up. {quote} [Rakesh'r reply] Agreed, will correct it. +Comment-17)+ {quote}// TODO: Needs to manage the number of concurrent moves per DataNode. Shouldn't this be fixed now? {quote} [Rakesh'r reply] I think, we have throttling in place at the namenode side, so the amount of work is less coming to the datanode. By default {{dfs.namenode.replication.max-streams-hard-limit}} value is 2, with this only delta move task will go to dn in one heartbeat interval period(3sec time interval). We have skipped the case of configuring a very big value to the hard-limit and is the reason for postponing this task to later. +Comment-18)+ {quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently doing 1000 moves, it's not in a good state, disk io is probably saturated, and this will only make it much worse. 10 is probably more than sufficient. {quote} [Rakesh'r reply] Agreed, will make it to smaller value 10. +Comment-19)+ {quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted, WHY should the heartbeat thread process the operation? That's a terrible idea. {quote} [Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple. Does this make sense to you? {code:java} ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60, TimeUnit.SECONDS, new SynchronousQueue<Runnable>(), new Daemon.DaemonFactory() { private final AtomicInteger threadIndex = new AtomicInteger(0); @Override public Thread newThread(Runnable r) { Thread t = super.newThread(r); t.setName("BlockMoverTask-" + threadIndex.getAndIncrement()); return t; } }); {code} +Comment-20)+ {quote}I was under the impression the DN did an optimized movement between its storages? Does it actually make a connection to itself? If yes, when the DN xceiver pool is very busy, the mover thread pool will exhaust and cause the heartbeat thread to block. {quote} [Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and will do the changes. FYI, initially we have implemented C-DN to track all block moves, but later we had refactored the code and we somehow missed the local movement logic. +Comment-21)+ {quote}BlockStorageMovementTracker Many data structures are riddled with non-threadsafe race conditions and risk of CMEs. Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list of futures is synchronized. However the run loop does an unsynchronized block get, unsynchronized future remove, unsynchronized isEmpty, possibly another unsynchronized get, only then does it do a synchronized remove of the block. The whole chunk of code should be synchronized. Is the problematic moverTaskFutures even needed? It's aggregating futures per-block for seemingly no reason. Why track all the futures at all instead of just relying on the completion service? As best I can tell: It's only used to determine if a future from the completion service should be ignored during shutdown. Shutdown sets the running boolean to false and clears the entire datastructure so why not use the running boolean like a check just a little further down? As synchronization to sleep up to 2 seconds before performing a blocking moverCompletionService.take, but only when it thinks there are no active futures. I'll ignore the missed notify race that the bounded wait masks, but the real question is why not just do the blocking take? Why all the complexity? Am I missing something? BlocksMovementsStatusHandler Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex. blockIdVsMovementStatus is inconsistent synchronized. Does synchronize to return an unmodifiable list which sadly does nothing to protect the caller from CME. handle is iterating over a non-thread safe list. {quote} [Rakesh'r reply] Agreed, will correct it and address the concerns. +Comment-22)+ {quote}StoragePolicySatisifier Should handleException use a double-checked lock to avoid synchronization? Unexpected exceptions should be a rarity, right? Speaking of which, it’s not safe to ignore all Throwable in the run loop! You have no idea if data structures are in a sane or consistent state. {quote} [Rakesh'r reply] Agreed, will do the changes. +Comment-23)+ {quote}FSTreeTraverser {quote} This sps branch doesn't implementing any new logic for tree travesal, it simply uses the slice tree traversal logic implemented via HDFS-10899 jira. We have tried an attempt to make the existing code more generalize so that HDFS-10285, HDFS-12090 and EDEK features can make use of this. I'd appreciate if you could add your comments/thoughts into improvement task HDFS-13095 to correct the existing code in trunk, that would really help us to avoid maintaining those changes in this branch. > Storage Policy Satisfier in Namenode > ------------------------------------ > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode > Affects Versions: HDFS-10285 > Reporter: Uma Maheswara Rao G > Assignee: Uma Maheswara Rao G > Priority: Major > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-10285-consolidated-merge-patch-04.patch, > HDFS-10285-consolidated-merge-patch-05.patch, > HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org