[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368648#comment-16368648 ] Rakesh R edited comment on HDFS-10285 at 2/18/18 7:37 PM: -- We have worked on the comments and following is quick update. {{Comment-1) }} => DONE via HDFS-13097 {{Comment-2) }} => DONE via HDFS-13097 {{Comment-5)}} => DONE via HDFS-13097 {{Comment-8) }} => DONE via HDFS-13097 {{Comment-10)}} => DONE via HDFS-13110 {{Comment-11)}} => DONE via HDFS-13097 {{Comment-12)}} => DONE via HDFS-13097 {{Comment-13)}} => DONE via HDFS-13110 {{Comment-15)}} => DONE via HDFS-13097 {{Comment-16)}} => DONE via HDFS-13097 {{Comment-18)}} => DONE via HDFS-13097 {{Comment-19)}} => DONE via HDFS-13097 {{Comment-22)}} => DONE via HDFS-13097 *For the below comments*, it would be great to hear your thoughts. Please let me know your feedback on my reply. {{Comment-3)}} => This comment has two parts ibr and data transfer. IBR will be explored and implemented via HDFS-13165 sub-task. But, data transfer part is not concluded yet. How do we incorporate local move into this, currently data transfer is not having such logic. IIUC, DNA_TRANSFER is used to send a copy of a block to another datanode. Also, mover tool uses replaceBlock() for block movement, which already has block movement to a different storage within the same datanode. How abt using {{replaceBlock}} pattern here in sps as well? {{Comment-4)}} => Depends on comment-3 {{Comment-6, Comment-9, Comment-14, Comment-17)}} => Needs to understand more on this. {{Comment-20}} => Depends on comment-3 *In Progress tasks:* {{Comment-3)}} => HDFS-13165, this jira will only implement logic to collects back the moved block via IBR. {{Comment-21)}} => HDFS-13165 {{Comment-7)}} => HDFS-13166 was (Author: rakeshr): We have worked on the comments and following is quick update. {{Comment-1)}} => DONE via HDFS-1309 {{Comment-2)}} => DONE via HDFS-13097 {{Comment-5)}} => DONE via HDFS-13097 {{Comment-8)}} => DONE via HDFS-13097 {{Comment-10)}} => DONE via HDFS-13110 {{Comment-11)}} => DONE via HDFS-13097 {{Comment-12)}} => DONE via HDFS-13097 {{Comment-13)}} => DONE via HDFS-13110 {{Comment-15)}} => DONE via HDFS-13097 {{Comment-16)}} => DONE via HDFS-13097 {{Comment-18)}} => DONE via HDFS-13097 {{Comment-19)}} => DONE via HDFS-13097 {{Comment-22)}} => DONE via HDFS-13097 *For the below comments*, it would be great to hear your thoughts. Please let me know your feedback on my reply. {{Comment-3)}} => This comment has two parts ibr and data transfer. IBR will be explored and implemented via HDFS-13165 sub-task. But, data transfer part is not concluded yet. How do we incorporate local move into this, currently data transfer is not having such logic. IIUC, DNA_TRANSFER is used to send a copy of a block to another datanode. Also, mover tool uses replaceBlock() for block movement, which already has block movement to a different storage within the same datanode. How abt using \{{replaceBlock}} pattern here in sps as well? {{Comment-4}} => Depends on comment-3 {{Comment-6, Comment-9, Comment-14, Comment-17}} => Needs to understand more on this. {{Comment-20}} => Depends on comment-3 *In Progress tasks:* {{Comment-3)}} => HDFS-13165, this jira will only implement logic to collects back the moved block via IBR. {{Comment-21)}} => HDFS-13165 {{Comment-7)}} => HDFS-13166 > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-10285-consolidated-merge-patch-04.patch, > HDFS-10285-consolidated-merge-patch-05.patch, > HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472 ] Rakesh R edited comment on HDFS-10285 at 2/1/18 1:58 PM: - Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows, please take a look at it. +Comment-1)+ {quote}BlockManager Shouldn’t spsMode be volatile? Although I question why it’s here. {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-2)+ {quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service to the block manager. Please move them out to prove it’s not tightly coupled. {quote} [Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep all the related apis over there. +Comment-3)+ {quote}BPServiceActor Is it actually sending back the moved blocks? Aren’t IBRs sufficient? BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished Again, not sure that a new DN command is necessary, and why does it specifically report back successful moves instead of relying on IBRs? I would actually expect the DN to be completely ignorant of a SPS move vs any other move. {quote} [Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely on this, then it would requires an *extra* check to know whether this new block has occurred due to sps move or others, which will be quite often considering more other ops compares to SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately, each movement finished block can be easily/quickly recognized by the Satisfier in NN side and updates the tracking details. If you agree this *extra* check is not an issue then we would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added to carry the block vs src/target pairs which is much needed for the move operation and return result back. On a second thought, if we change the response flow via IBR, then I understand we can think of using the block transfer flow. Following is my analysis on that, - I could see, DNA_TRANSFER is used to send a copy of a block to another datanode. In our case, we need to support {{local_move}} and needs to incorporate this {{local_move}} to this block transfer logic, I hope that is fine ? - secondly, presently we used {{replaceBlock}} and this has additional {{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this hint and namenode will delete the over replicated block eventually on next block report. Is that fine? Meantime, I will do changes based on {{ibr/data transfer}} for the internal sps movement and analyse more. {code:java} // notify name node final Replica r = blockReceiver.getReplica(); datanode.notifyNamenodeReceivedBlock( block, delHint, r.getStorageUuid(), r.isOnTransientStorage()); LOG.info("Moved " + block + " from " + peer.getRemoteAddressString() + ", delHint=" + delHint); {code} - So, for the internal, will use the {{transfer block}} and external, will use {{replace block}}, am I missing anything? +Comment-4)+ {quote}DataNode Why isn’t this just a block transfer? How is transferring between DNs any different than across storages? {quote} [Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this case over there. Thanks! +Comment-5)+ {quote}DatanodeDescriptor Why use a synchronized linked list to offer/poll instead of BlockingQueue? {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-6)+ {quote}DatanodeManager I know it’s configurable, but realistically, when would you ever want to give storage movement tasks equal footing with under-replication? Is there really a use case for not valuing durability? {quote} [Rakesh's reply] We don't have any particular use case, though. One scenario we thought is, user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up is considered as a high priority. If you feel, this is not a real case then I'm OK to remove this config and SPS will use only the remaining slots always. +Comment-7)+ {quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap. verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap? Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap.
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472 ] Rakesh R edited comment on HDFS-10285 at 2/1/18 1:52 PM: - Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows, please take a look at it. +Comment-1)+ {quote}BlockManager Shouldn’t spsMode be volatile? Although I question why it’s here. {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-2)+ {quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service to the block manager. Please move them out to prove it’s not tightly coupled. {quote} [Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep all the related apis over there. +Comment-3)+ {quote}BPServiceActor Is it actually sending back the moved blocks? Aren’t IBRs sufficient? BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished Again, not sure that a new DN command is necessary, and why does it specifically report back successful moves instead of relying on IBRs? I would actually expect the DN to be completely ignorant of a SPS move vs any other move. {quote} [Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely on this, then it would requires an *extra* check to know whether this new block has occurred due to sps move or others, which will be quite often considering more other ops compares to SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately, each movement finished block can be easily/quickly recognized by the Satisfier in NN side and updates the tracking details. If you agree this *extra* check is not an issue then we would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added to carry the block vs src/target pairs which is much needed for the move operation and return result back. On a second thought, if we change the response flow via IBR, then I understand we can think of using the block transfer flow. Following is my analysis on that, - I could see, DNA_TRANSFER is used to send a copy of a block to another datanode. In our case, we need to support {{local_move}} and needs to incorporate this {{local_move}} to this block transfer logic, I hope that is fine ? - secondly, presently we used {{replaceBlock}} and this has additional {{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this hint and namenode will delete the over replicated block eventually on next block report. Any thoughts? {code:java} // notify name node final Replica r = blockReceiver.getReplica(); datanode.notifyNamenodeReceivedBlock( block, delHint, r.getStorageUuid(), r.isOnTransientStorage()); LOG.info("Moved " + block + " from " + peer.getRemoteAddressString() + ", delHint=" + delHint); {code} - So, for the internal, will use the {{transfer block}} and external, will use {{replace block}}, am I missing anything? +Comment-4)+ {quote}DataNode Why isn’t this just a block transfer? How is transferring between DNs any different than across storages? {quote} [Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this case over there. Thanks! +Comment-5)+ {quote}DatanodeDescriptor Why use a synchronized linked list to offer/poll instead of BlockingQueue? {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-6)+ {quote}DatanodeManager I know it’s configurable, but realistically, when would you ever want to give storage movement tasks equal footing with under-replication? Is there really a use case for not valuing durability? {quote} [Rakesh's reply] We don't have any particular use case, though. One scenario we thought is, user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up is considered as a high priority. If you feel, this is not a real case then I'm OK to remove this config and SPS will use only the remaining slots always. +Comment-7)+ {quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap. verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap? Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then it gets another datanode
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472 ] Rakesh R edited comment on HDFS-10285 at 2/1/18 1:49 PM: - Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows, please take a look at it. +Comment-1)+ {quote}BlockManager Shouldn’t spsMode be volatile? Although I question why it’s here. {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-2)+ {quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service to the block manager. Please move them out to prove it’s not tightly coupled. {quote} [Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep all the related apis over there. +Comment-3)+ {quote}BPServiceActor Is it actually sending back the moved blocks? Aren’t IBRs sufficient? BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished Again, not sure that a new DN command is necessary, and why does it specifically report back successful moves instead of relying on IBRs? I would actually expect the DN to be completely ignorant of a SPS move vs any other move. {quote} [Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely on this, then it would requires an *extra* check to know whether this new block has occurred due to sps move or others, which will be quite often considering more other ops compares to SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately, each movement finished block can be easily/quickly recognized by the Satisfier in NN side and updates the tracking details. If you agree this *extra* check is not an issue then we would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added to carry the block vs src/target pairs which is much needed for the move operation and return result back. On a second thought, if we change the response flow via IBR, then I understand we can think of using the block transfer flow. Following is my analysis on that, - I could see, DNA_TRANSFER is used to send a copy of a block to another datanode. In our case, we need to support {{local_move}} and needs to incorporate this {{local_move}} to this block transfer logic, I hope that is fine ? - secondly, presently we used {{replaceBlock}} and this has additional {{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this hint and namenode will eventually delete the over replicated block eventually. Any thoughts? {code} // notify name node final Replica r = blockReceiver.getReplica(); datanode.notifyNamenodeReceivedBlock( block, delHint, r.getStorageUuid(), r.isOnTransientStorage()); LOG.info("Moved " + block + " from " + peer.getRemoteAddressString() + ", delHint=" + delHint); {code} - For the internal, will use the {{transfer block}} and external, will use {{replace block}}, am I missing anything? +Comment-4)+ {quote}DataNode Why isn’t this just a block transfer? How is transferring between DNs any different than across storages? {quote} [Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this case over there. Thanks! +Comment-5)+ {quote}DatanodeDescriptor Why use a synchronized linked list to offer/poll instead of BlockingQueue? {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-6)+ {quote}DatanodeManager I know it’s configurable, but realistically, when would you ever want to give storage movement tasks equal footing with under-replication? Is there really a use case for not valuing durability? {quote} [Rakesh's reply] We don't have any particular use case, though. One scenario we thought is, user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up is considered as a high priority. If you feel, this is not a real case then I'm OK to remove this config and SPS will use only the remaining slots always. +Comment-7)+ {quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap. verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap? Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then it gets another datanode report to determine the
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472 ] Rakesh R edited comment on HDFS-10285 at 2/1/18 5:07 AM: - Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows, please take a look at it. +Comment-1)+ {quote}BlockManager Shouldn’t spsMode be volatile? Although I question why it’s here. {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-2)+ {quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service to the block manager. Please move them out to prove it’s not tightly coupled. {quote} [Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep all the related apis over there. +Comment-3)+ {quote}BPServiceActor Is it actually sending back the moved blocks? Aren’t IBRs sufficient? BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished Again, not sure that a new DN command is necessary, and why does it specifically report back successful moves instead of relying on IBRs? I would actually expect the DN to be completely ignorant of a SPS move vs any other move. {quote} [Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely on this, then it would requires an *extra* check to know whether this new block has occurred due to sps move or others, which will be quite often considering more other ops compares to SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately, each movement finished block can be easily/quickly recognized by the Satisfier in NN side and updates the tracking details. If you agree this *extra* check is not an issue then we would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added to carry the block vs src/target pairs which is much needed for the move operation and we tried to decouple sps code using this command. +Comment-4)+ {quote}DataNode Why isn’t this just a block transfer? How is transferring between DNs any different than across storages? {quote} [Rakesh's reply] I could see Mover is also using {{REPLACE_BLOCK}} call and we just followed same approach in sps also. Am I missing anything here? +Comment-5)+ {quote}DatanodeDescriptor Why use a synchronized linked list to offer/poll instead of BlockingQueue? {quote} [Rakesh's reply] Agreed, will do the changes. +Comment-6)+ {quote}DatanodeManager I know it’s configurable, but realistically, when would you ever want to give storage movement tasks equal footing with under-replication? Is there really a use case for not valuing durability? {quote} [Rakesh's reply] We don't have any particular use case, though. One scenario we thought is, user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up is considered as a high priority. If you feel, this is not a real case then I'm OK to remove this config and SPS will use only the remaining slots always. +Comment-7)+ {quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap. verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap? Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then it gets another datanode report to determine the number of live nodes to decide if it should sleep before processing the next path. The number of nodes from the prior cached view of the world should suffice. {quote} [Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. Actually, we depend on this api for the data node storage types and remaining space details. I think, it requires two different mechanisms for internal and external sps. For internal, how about sps can directly refer {{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are suggesting a cache mechanism. How about, get storageReport once and cache at ExternalContext. This local cache can be refreshed periodically. Say, After every 5mins (just an arbitrary number I put here, if you have some period in mind please suggest), when getDatanodeStorageReport called, cache can be treated as expired and fetch freshly. Within 5mins it can use from cache. Does this make sense to you? Another point we thought of is, right now for checking whether
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347575#comment-16347575 ] Surendra Singh Lilhore edited comment on HDFS-10285 at 1/31/18 8:38 PM: Thanks [~daryn] for reviews. Created Part1 Jira HDFS-13097 to fix few comments was (Author: surendrasingh): Thanks [~daryn] for reviews. Create Part1 Jira HDFS-13097 to fix few comments > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-10285-consolidated-merge-patch-04.patch, > HDFS-10285-consolidated-merge-patch-05.patch, > HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300516#comment-16300516 ] Virajith Jalaparti edited comment on HDFS-10285 at 12/21/17 8:16 PM: - Hi [~umamaheswararao], Thanks for the meeting summary. Having both the options (SPS within NN and SPS as service) would be great. bq. it may be necessary to start SPS RPC with its own IP:port (within NN or outside), so clients can always talk to SPS on that port, irrespective of where its running. Quick question about this -- what are clients you are referring to here? Are these for admin operations? If the SPS is going to run outside the NN, I would think it is going to be decoupled from/not depend on the FSNamesystem lock and the NN-DN heartbeat protocol. The current implementation/design has a tight coupling between the SPS and both these components. was (Author: virajith): Hi [~umamaheswararao], Thanks for the meeting summary. Having both the options (SPS within NN and SPS as service) would be great. bq. it may be necessary to start SPS RPC with its own IP:port (within NN or outside), so clients can always talk to SPS on that port, irrespective of where its running. Quick question about this: what are clients you are referring to here? Are these for admin operations? If the SPS is going to run outside the NN, I would think it is going to be decoupled from/not depend on the FSNamesystem lock and the NN-DN heartbeat protocol. The current implementation/design has a tight coupling between the SPS and both these components. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300516#comment-16300516 ] Virajith Jalaparti edited comment on HDFS-10285 at 12/21/17 8:15 PM: - Hi [~umamaheswararao], Thanks for the meeting summary. Having both the options (SPS within NN and SPS as service) would be great. bq. it may be necessary to start SPS RPC with its own IP:port (within NN or outside), so clients can always talk to SPS on that port, irrespective of where its running. Quick question about this: what are clients you are referring to here? Are these for admin operations? If the SPS is going to run outside the NN, I would think it is going to be decoupled from/not depend on the FSNamesystem lock and the NN-DN heartbeat protocol. The current implementation/design has a tight coupling between the SPS and both these components. was (Author: virajith): Hi [~umamaheswararao], Thanks for the meeting summary. I agree that having both the options (SPS within NN and SPS as service) would be great to have. bq. it may be necessary to start SPS RPC with its own IP:port (within NN or outside), so clients can always talk to SPS on that port, irrespective of where its running. Quick question about this: what are clients you are referring to here? Are these for admin operations? If the SPS is going to run outside the NN, I would think it is going to be decoupled from/not depend on the FSNamesystem lock and the NN-DN heartbeat protocol. The current implementation/design has a tight coupling between the SPS and both these components. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284692#comment-16284692 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 12/9/17 10:33 AM: -- {quote} Conceptually, blocks violating the HSM policy are a form of mis-replication that doesn't satisfy a placement policy – which would truly prevent performance issues if the feature isn't needed. The NN's repl monitor ignorantly handles the moves as a low priority transfer (if/since it's sufficiently replicated). The changes to the NN are minimalistic. DNs need to support/honor storages in transfer requests. Transfers to itself become moves. Now HSM "just works", eventually, similar to increasing the repl factor. {quote} Thank you for the proposal. This seems reasonable to me about the scoping IIUC. To be clear, the current SPS intends only for satisfying the basic HSM feature. {quote} An external SPS can provide fancier policies for accelerating the processing for those users like hbase. {quote} The fancier policy implementation proposals are at HDFS-7343(smart storage management), but not as part SPS proposal. So, if I understand correctly, what you are saying is, Namenode can handle satisfying storages by using RM kind of mechanism how it does for replication. Yes, the current SPS, does in the similar fashion except the way finding movements when policy changes, instead SPS schedule the movements on {{#satisfyStoragePolicy(path)}} API in this phase. In general, it can keep the mismatches logic in RM. Here, RM itself will do the storage policy mismatch and schedule blocks movement command, like how replication command sent in the replication case. Since RM is a critical service, we thought of not touching this and we tried to optimize a bit in SPS considering its semantics. Let me try to explain little more about that. Actually for storage mismatches block finding should happen collectively for replica set as policy is for a block replicas set. One another point, how SPS collects {{to_be_storage_movement_needed_blocks}} : to simplify things, we have planned to expose new API, where user can specify a path then internally we trigger satisfy only to that path. To handle retries/cluster restarts we save in Xattr until SPS finishes its work. Xattr overhead must be minimal by doing some deduplication (1), I will explain below. So, instead of loading all blocks into memory for mismatch check, what we did in SPS is, load of blocks blocks when we really checking them. When SPS invoked to satisfy, we track only file InodeID. In replication case, it is good to track at block level, because any single replica can be missed, no need to check for other blocks in that file. In SPS case, if policy changes, it applies for all blocks in that file, so, it makes sense to just track file id in queues. The general usage would be, the directories where user set the storage policy changes are qualified for SPS processing. The recommendation for storage policies is to set as optimally as possible, it may be efficient to set on directories instead of setting on individual files until its really necessary. This is to avoid more number of Xattrs in HSM. Since SPS picks for the same directories to satisfy, at directory Q, we keep only the InodeIDs (long) list, on which SPS intends to work for satisfying the mismatches blocks. It will not recursively load files/blocks under that dir into memory immediately. SPS thread will pick elements from intermediate Q , file Inode to process. This intermediate Q capacity bounded to 1000 elements. Front Q processor fill-up the intermediate Q, only when it has empty slots. Otherwise it will not load files Inode under the directory. So, unnecessarily we will not load every file-Inode id into memory. "Once the mismatches identified, for the set of blocks in a file, it will add into Datanode descriptors as NN-to-DN commands, this is exactly same as ReplicationMonitor. Then DN receive this commands and move the blocks, similar to transfer. Same as you explained. So, conceptually the approach is exactly same as RM with little optimizations like throttling." When assigning the tasks to DN, it will respect the durability tasks. If replication tasks are already pending then it will give preference to them, it will not assign SPS block movements as high priority tasks. *How can we keep Xattr load minimal:* (1) The Xatter what we are adding just to indicate the directory for SPS processing if any restarts happened, otherwise it will be hard to scan entire name node to find mismatches. The Xatttr object has NameSpace enum, String name, byte[] value. In this case enum and name will be same for any directory we set and value will be null. It is just like constant object. Now we can create only Xattr object per NN and use the same object ref for
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283102#comment-16283102 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 12/8/17 8:44 AM: - {quote} Here's a rhetorical question: If managing multiple services is hard, why not bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the same process? Or ZK so HA is easier to deploy/manage? {quote} Few of my thoughts on this question, each of these projects build for their own purpose, with its own spec, not for just for helping HDFS or any other single project. And none of that projects need to access other project internal data structures. Where as SPS is only functions for HDFS and access internal data structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC APIs. This strikes me to put a question in other way as well, is it make sense to separate ReplicationMonitor as one separate process? is it fine to start EDEK as one separate? is it ok to start other threads (like decommissioning task) as separate processes and co-ordinate via RPC? so that NameSystem class may become very light weight? I think its the value vs cost will decide whether to separate or merge into single. Coming to ZK part, As ZK is not build only for HDFS, I don’t think to have any such thoughts. Its general purpose co-ordination system. Technically we can’t keep monitoring services inside NN, because the worry itself is, NN may die, need failover and so need external process to monitor. Anyway. I think the whole discussion is about services inside a project, but not cross projects itself, IMHO. Here SPS providing only the missing functionality of HSM, that is end-end policy satisfaction. So, IMV, for users it may not worth to manage additional process to achieve that missing functionality for particular feature. {quote} Today, I looked at the code more closely. It can hold the lock (read lock, but still) way too long. Notably, but not limited to, you can’t hold the lock while doing block placement. {quote} Appreciate your review Daryn. I think it should be easy to address. We will make sure to address the comment before merge? is that make sense. {quote} I should start sending bills to everyone who makes this fraudulent claim. . FSDirectory#addToInodeMap imposes a nontrivial performance penalty even when SPS is not enabled. We had to hack out the similar EZ check because it had a noticeable performance impact esp. on startup. However now that we support EZ, I need to revisit optimizing it. {quote} Thanks for review!. Nice find. Fundamentally, if sps disabled we don't even need to load the things into Qs as no one will process them. So, adding condition of checking enabled, can avoid even that enqueuing calls, in disabled case. So, it will end up having one extra bool if check if disabled. With this change, impact should be negligible IIUC. we will take this comment. Thanks {quote} I’m curious why it isn’t just part of the standard replication monitoring. If the DN is told to replicate to itself, it just does the storage movement. {quote} That's a good question. Overall approach is exactly same as RM. RM is has its own q build up for redundancy blocks, and Underreplication scan/check happens at block level, it make sense. Where as in SPS, policy changes for file, so all blocks in that file needs movement and policy check should happen in-co-ordination with replication blocks where they stored currently. So, we track the queues at file level here and scan/check all block for that files together at once. Also , We wanted to provide, on the fly reconfigure feature and we carefully thought that, we don’t want to interfere replication logic should be given more priority than SPS work. While scheduling blocks, we respect xmits counts, they are shared between, RM, SPS for controlling DN load. Assignment priority given to replication/EC blocks, then SPS blocks, when sending tasks to DN. So, as part of impact analysis, we thought of keeping SPS in it's own thread than RM thread would be clean and safer than running in that same loop of RM. was (Author: umamaheswararao): {quote} Here's a rhetorical question: If managing multiple services is hard, why not bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the same process? Or ZK so HA is easier to deploy/manage? {quote} Few of my thoughts on this question, each of these projects build for their own purpose, with its own spec, not for just for helping HDFS or any other single project. And none of that projects need to access other project internal data structures. Where as SPS is only functions for HDFS and access internal data structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC APIs. This strikes me to put a question in other way as well, is it
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283102#comment-16283102 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 12/8/17 7:35 AM: - {quote} Here's a rhetorical question: If managing multiple services is hard, why not bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the same process? Or ZK so HA is easier to deploy/manage? {quote} Few of my thoughts on this question, each of these projects build for their own purpose, with its own spec, not for just for helping HDFS or any other single project. And none of that projects need to access other project internal data structures. Where as SPS is only functions for HDFS and access internal data structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC APIs. This strikes me to put a question in other way as well, is it make sense to separate ReplicationMonitor as one separate process? is it fine to start EDEK as one separate? is it ok to start other threads (like decommissioning task) as separate processes and co-ordinate via RPC? so that NameSystem class may become very light weight? I think its the value vs cost will decide whether to separate or merge into single. Coming to ZK part, As ZK is not build only for HDFS, I don’t think to have any such thoughts. Its general purpose co-ordination system. Technically we can’t keep monitoring services inside NN, because the worry itself is, NN may die, need failover and so need external process to monitor. Anyway. I think the whole discussion is about services inside a project, but not cross projects itself, IMHO. Here SPS providing only the missing functionality of HSM, that is end-end policy satisfaction. So, IMV, for users it may not worth to manage additional process to achieve that missing functionality for particular feature. {quote} Today, I looked at the code more closely. It can hold the lock (read lock, but still) way too long. Notably, but not limited to, you can’t hold the lock while doing block placement. {quote} Appreciate your review Daryn. I think it should be easy to address. We will make sure to address the comment before merge? is that make sense. {quote} I’m curious why it isn’t just part of the standard replication monitoring. If the DN is told to replicate to itself, it just does the storage movement. {quote} That's a good question. Overall approach is exactly same as RM. RM is has its own q build up for redundancy blocks, and Underreplication scan/check happens at block level, it make sense. Where as in SPS, policy changes for file, so all blocks in that file needs movement and policy check should happen in-co-ordination with replication blocks where they stored currently. So, we track the queues at file level here and scan/check all block for that files together at once. Also , We wanted to provide, on the fly reconfigure feature and we carefully thought that, we don’t want to interfere replication logic should be given more priority than SPS work. While scheduling blocks, we respect xmits counts, they are shared between, RM, SPS for controlling DN load. Assignment priority given to replication/EC blocks, then SPS blocks, when sending tasks to DN. So, as part of impact analysis, we thought of keeping SPS in it's own thread than RM thread would be clean and safer than running in that same loop of RM. was (Author: umamaheswararao): {quote} Here's a rhetorical question: If managing multiple services is hard, why not bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the same process? Or ZK so HA is easier to deploy/manage? {quote} Few of my thoughts on this question, each of these projects build for their own purpose, with its own spec, not for just for helping HDFS or any other single project. And none of that projects need to access other project internal data structures. Where as SPS is only functions for HDFS and access internal data structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC APIs. This strikes me to put a question in other way as well, is it make sense to separate ReplicationMonitor as one separate process? is it fine to start EDEK as one separate? is it ok to start other threads (like decommissioning task) as separate processes and co-ordinate via RPC? so that NameSystem class may become very light weight? I think its the value vs cost will decide whether to separate or merge into single. Coming to ZK part, As ZK is not build only for HDFS, I don’t think to have any such thoughts. Its general purpose co-ordination system. Technically we can’t keep monitoring services inside NN, because the worry itself is, NN may die, need failover and so need external process to monitor. Anyway. I think the whole discussion is about services inside a project, but not cross
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281503#comment-16281503 ] Rakesh R edited comment on HDFS-10285 at 12/7/17 1:17 PM: -- Thanks a lot [~anu] for your time and comments. bq. This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and more memory of Namenode Increasing Namenode Q will not help to speedup the block movements. It is the Datanode who does actual block movements and need to tune Datanode bandwidth to speedup the block movements. Hence there is no sense in increasing Namenode Q. Infact, that will simply add up the pending tasks at the Namenode side. Let me try putting the memory usage of Namenode Q: Assume there are 1 million directories and users invoked {{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge data movement and it may not be a regular case. Again, assume without knowing the advantage of increasing the Q size if some unpleasant user set the Q size to a higher value 1,000,000. Each API call, will add an {{Xattr}} to represent the pending movement and NN maintains list of pending dir InodeId to satisfy the policy, which is {{Long}} value. Each Xattr takes 15 chars {{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses {{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to {{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) size. *(1) Xattr entry* Xattr: 12bytes(Object overhead) + 4bytes(String reference) + 4bytes(byte array) = aligned 24bytes. String "system.hdfs.sps": 40bytes(String Object) + 15bytes(chars) = 56bytes. Its not creating new String("system.hdfs.sps") object every time, so ideally 56bytes count need not be counted every time. Still, I'm considering this. byte[]: 4bytes 84 bytes = (aligned 88bytes * 1,000,000) = 83.923MB If we keep SPS outside or inside Namenode, this much memory space will be occupied as xattribute is used to mark the pending item. *(2) Namenode Q* LinkedList entry = 24bytes Long object = 12bytes(Object overhead) + 8bytes = aligned 24bytes -- 48bytes * 1000,000 = 45.78MB -- 46MB approax, which I feel is a smaller percentage and this may occur in the misconfgured scenario where many {{InodeIds}} queued up. Default Q size value will be recommended as 1000 or even 10,000 = 48bytes * 10,000 = 468.75KB. = 469KB to keep the memory usage very less. Again open to change default value (increase/decrease) based on the feedback. Please feel free to correct me if I missed anything. Thanks! bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the "scan and move tools" as an external feature to namenode. I am not able to see any convincing reason for breaking this pattern. - {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from your previous comments, I'm glad that you agreed that CPU is not an issue. Hence scanning is not a concern. If we run SPS outside, it has to put additional RPC calls for the SPS work and again switching of SPS-ha service has to blindly scan the entire namespace to figure out the xattrs. Now, for handling the switching scenarios, we have to come up with some kind of unfair tweaking logic like, write xattr somewhere in the file and new active SPS service should read it from there and continue. With this, I feel to keep the scanning logic at NN. FYI, NN has existing feature EDEK which also does scanning and we reuses the same code in SPS. Also, I'm re-iterating the point that, SPS does not scan the files its own, user has to call API to satisfy a particular file. - {{Moving blocks}} - It is something assigning the responsibility to Datanode. Presently, Namenode has several logic which does block movement - ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added throttling mechanism for the sps block movements also, not to affect the existing data movements. - AFAIK, DiskBalancer is completely run at the Datanode and it looks like Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, which doesn't need any input file paths and it does balancing HDFS cluster based on the utilization. Balancer can run independently as it doesn't
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281503#comment-16281503 ] Rakesh R edited comment on HDFS-10285 at 12/7/17 11:26 AM: --- Thanks a lot [~anu] for your time and comments. bq. This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and more memory of Namenode Increasing Namenode Q will not help to speedup the block movements. It is the Datanode who does actual block movements and need to tune Datanode bandwidth to speedup the block movements. Hence there is no sense in increasing Namenode Q. Infact, that will simply add up the pending tasks at the Namenode side. Let me try putting the memory usage of Namenode Q: Assume there are 1 million directories and users invoked {{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge data movement and it may not be a regular case. Again, assume without knowing the advantage of increasing the Q size if some unpleasant user set the Q size to a higher value 1,000,000. Each API call, will add an {{Xattr}} to represent the pending movement and NN maintains list of pending dir InodeId to satisfy the policy, which is {{Long}} value. Each Xattr takes 15 chars {{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses {{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to {{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) size. *(1) Xattr entry* Xattr: 12bytes(Object overhead) + 4bytes(String reference) + 4bytes(byte array) = 24 String "system.hdfs.sps": 40bytes(String Object) + 15bytes(chars) = 56bytes. Its creating new String objects every time ideally 56bytes count need not be counted every time. Still, I'm considering this. byte[]: 4bytes 84 bytes = (aligned 88bytes * 1,000,000) = 83.923MB If we keep SPS outside or inside Namenode, this much memory space will be occupied as xattribute is used to mark the pending item. *(2) Namenode Q* LinkedList entry = 24bytes Long object = 12bytes(Object overhead) + 8bytes = aligned 24bytes -- 48bytes * 1000,000 = 45.78MB -- 45MB approax, which I feel is a smaller percentage and this may occur in the misconfgured scenario where many {{InodeIds}} queued up. Default Q size value will be recommended as 10,000 = 48bytes * 10,000 = 468.75KB. = 469KB. Please feel free to correct me if I missed anything. Thanks! bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the "scan and move tools" as an external feature to namenode. I am not able to see any convincing reason for breaking this pattern. - {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from your previous comments, I'm glad that you agreed that CPU is not an issue. Hence scanning is not a concern. If we run SPS outside, it has to put additional RPC calls for the SPS work and again switching of SPS-ha service has to blindly scan the entire namespace to figure out the xattrs. Now, for handling the switching scenarios, we have to come up with some kind of unfair tweaking logic like, write xattr somewhere in the file and new active SPS service should read it from there and continue. With this, I feel to keep the scanning logic at NN. FYI, NN has existing feature EDEK which also does scanning and we reuses the same code in SPS. Also, I'm re-iterating the point that, SPS does not scan the files its own, user has to call API to satisfy a particular file. - {{Moving blocks}} - It is something assigning the responsibility to Datanode. Presently, Namenode has several logic which does block movement - ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added throttling mechanism for the sps block movements also, not to affect the existing data movements. - AFAIK, DiskBalancer is completely run at the Datanode and it looks like Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, which doesn't need any input file paths and it does balancing HDFS cluster based on the utilization. Balancer can run independently as it doesn't take any input file path argument and user may not be waiting to finish the balancing work, whereas SPS is exposed to the user via HSM feature. HSM is completely
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281503#comment-16281503 ] Rakesh R edited comment on HDFS-10285 at 12/7/17 8:58 AM: -- Thanks a lot [~anu] for your time and comments. bq. This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and more memory of Namenode Increasing Namenode Q will not help to speedup the block movements. It is the Datanode who does actual block movements and need to tune Datanode bandwidth to speedup the block movements. Hence there is no sense in increasing Namenode Q. Infact, that will simply add up the pending tasks at the Namenode side. Let me try putting the memory usage of Namenode Q: Assume there are 1 million directories and users invoked {{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge data movement and it may not be a regular case. Again, assume without knowing the advantage of increasing the Q size if some unpleasant user set the Q size to a higher value 1,000,000. Each API call, will add an xattr to represent the pending movement and NN maintains list of pending dir InodeId to satisfy the policy, which is Long (IIUC, Long is 8 bytes of Object overhead, plus 8 bytes more for the actual long value). Each Xattr takes 15 chars {{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses {{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to {{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) size: 1,000,000 * (30bytes + 16bytes) = 1000,000 * 46 = 46,000,000bytes = 43.87MB = 44MB approax, which I feel is a smaller percentage and this may occur in the misconfgured scenario where many InodeIds queued up. bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the "scan and move tools" as an external feature to namenode. I am not able to see any convincing reason for breaking this pattern. - {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from your previous comments, I'm glad that you agreed that CPU is not an issue. Hence scanning is not a concern. If we run SPS outside, it has to put additional RPC calls for the SPS work and again switching of SPS-ha service has to blindly scan the entire namespace to figure out the xattrs. Now, for handling the switching scenarios, we have to come up with some kind of unfair tweaking logic like, write xattr somewhere in the file and new active SPS service should read it from there and continue. With this, I feel to keep the scanning logic at NN. FYI, NN has existing feature EDEK which also does scanning and we reuses the same code in SPS. Also, I'm re-iterating the point that, SPS does not scan the files its own, user has to call API to satisfy a particular file. - {{Moving blocks}} - It is something assigning the responsibility to Datanode. Presently, Namenode has several logic which does block movement - ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added throttling mechanism for the sps block movements also, not to affect the existing data movements. - AFAIK, DiskBalancer is completely run at the Datanode and it looks like Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, which doesn't need any input file paths and it does balancing HDFS cluster based on the utilization. Balancer can run independently as it doesn't take any input file path argument and user may not be waiting to finish the balancing work, whereas SPS is exposed to the user via HSM feature. HSM is completely binds to the Namenode, which today only allows users to set the storage policy and changing the state at NN and NN is taking no action to satisfy the policy. For HSM feature, starting another service may be an overhead in reality and HSM adoption may be less. My personal opinion, just because of the Balancer/DiskBalancer running outside it is not a good reason for keeping SPS outside. was (Author: rakeshr): Thanks a lot [~anu] for your time and comments. bq. This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and more memory of Namenode Increasing Namenode Q will not help to speedup the block movements. It is the Datanode
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280634#comment-16280634 ] Anu Engineer edited comment on HDFS-10285 at 12/6/17 6:36 PM: -- @[~andrew.wang] Thanks for the comments. bq. Adding a new service requires adding support in management frameworks like Cloudera Manager or Ambari. This means support for deployment, configuration, monitoring, rolling upgrade, and log collection. I am not very familiar with these tools; I prefer to deploy my clusters without these tools. So help me here a bit, are you suggesting that we should decide if a feature should be inside Namenode or not, based on how inflexible these tools are? Why is it so hard for say, (I am just presuming that you will be more familiar with Cloudera Manager) Cloudera Manager to configure a new service? Isn't the sole purpose of these tools to do this kind of management actions? I am hopeful(again my understanding of these tools are minimal) that these tools already have all the requisite framework in place, and it is not as onerous as you describe to support a daemon that is running in the cluster. IMHO, if we base the decisions on what feature should go into namenode based on the code modification complexity of these tools, I am worried that we are putting an unusually complex burden on Namenode. I suggest that we should do what is the right thing for namenode based on the constraints of our layer and not bother about layers far above us. @[~vinayrpet] Thank you for sharing your perspective. bq. Im coming at this from the standpoint of supporting Cloudera's Hadoop customers. Since I work for Hortonworks, I have a wealth of perspective on how customers tend to use these features. Most customers will start off with this tool as is, then they will discover that queue length is not adequate for the move to happen in a reasonable time, they will increase the queue length and then we will discover that Namenode is running out of memory. Next step, is that they will want us to run SPS based on various policies, like move the blocks if the blocks are older than 3 hours, or if the load on Namenode is less than x, of the number of YARN containers in the cluster is less than X. Slowly but steadily, customers will want complex policies. Here is the kicker, if SPS is inside namenode each time some feature is added we are going to step into this huge argument whether we should have these complex features inside namenode. So experience from Hortonworks customers tells me that we should prioritize scale and future needs of this feature rather than ease of code change for management tools. bq. IIUC, Main concerns to keep SPS (or any other such feature) in NameNode are following. I think you missed a critical argument, all scan and move functions of HDFS today is outside Namenode. I am proposing that we keep it that way. SPS is not unique in any way, and we have a well-known pattern that works. In my mind, management tools like Ambari should be able to address the ease of use part. For people like me who are willing to use the shell, this does not seem to be an additional burden. bq. 1. Locking Load This same process can be done from outside namenode. Hence we are proposing that we move it outside. bq. SPS should have client facing RPC server to learn about the paths to satisfy policy. This comes with lot of deployment overhead as already mentioned above by Andrew. I seriously question this assertion. From a shell perspective, we can check if this config value is set and start this daemon from start-dfs.sh. Why is this such a complicated task for Cloudera Manager or Ambari? I do not buy this argument. How can something that can be done in 5 lines of code in Hadoop, become a complex task that we would want to avoid that code path in Cloudera Manager? I am sorry, That makes no sense to me. bq. if SPS doesnt have its own RPC server, then it needs to scan the targets by checking for xattr recursively from root( / ) directory What prevents us from adding this? We should do what is technically required. The problem I think you are missing is that the current SPS has no policy control of when it should run. But I posit that it is not too far off, that we will have to build various kinds of policies to control it. I am not suggesting that we need to do that before the merge. Being an independent service allows for this kind of flexibility. bq. Memory This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278844#comment-16278844 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 12/5/17 4:52 PM: - [~chris.douglas] do you have some opinion here, how to move forward ? Appreciate others thoughts as well. # One is to keep SPS running in Namenode, as it is done now. This avoids any operational cost and additional process maintenance (against to keeping outside) . A Throttling would try to control the additional over burden on Namenode. SPS is kind of extension to HSM feature for more usability, as HSM is Namenode's feature, it make sense to keep in Namenode itself.. # Other thought is to run SPS out side as an independent process to avoid any burdens on Namenode due to SPS. COmparing to Balancer/DiskBalancers, SPS also moving blocks, so make sense to run as separate process. In other ide, this could increase RPC calls to Namenode for getting meta information of file while processing and for other co-ordinations. And extra process maintenance cost to this additional process for the deployments perspective. Many other points discussed above for more information. was (Author: umamaheswararao): [~chris.douglas] do you have some opinion here to move forward ? Appreciate others thoughts as well. # One is to keep SPS running in Namenode, as it is done now. This avoids any operational cost and additional process maintenance (against to keeping outside) . A Throttling would try to control the additional over burden on Namenode. SPS is kind of extension to HSM feature for more usability, as HSM is Namenode's feature, it make sense to keep in Namenode itself.. # Other thought is to run SPS out side as an independent process to avoid any burdens on Namenode due to SPS. COmparing to Balancer/DiskBalancers, SPS also moving blocks, so make sense to run as separate process. In other ide, this could increase RPC calls to Namenode for getting meta information of file while processing and for other co-ordinations. And extra process maintenance cost to this additional process for the deployments perspective. Many other points discussed above for more information. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail:
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274977#comment-16274977 ] Anu Engineer edited comment on HDFS-10285 at 12/1/17 9:08 PM: -- bq. Is it trivial? I think we still need some type of fencing so there's only one active SPS. Does this use zookeeper, like NN HA? Yes, that would be the simplest approach to getting SPS HA. bq. If there's an SPS failover, how does the new active know where to resume? Once the active knows it is the leader, it can read the state from NN and continue. The issues of continuity are exactly same whether it is inside NN or outside. bq. I'm also wondering how progress is tracked, so we can resume without iterating over significant portions of the namespace. As soon as a block is moved, the move call updates the status of the block move, that is NN is up to date with that info. Each time there is a call to SPS API, NN will keep track of it and the updates after move lets us filter the remaining blocks. bq. I also like centralized control when it comes to coordinating block work. The NN schedules and prioritizes block work on the cluster. Already it's annoying to users to have configure a separate set of resource throttles for the balancer work, and it makes the system less reactive to cluster health events. We'd much rather have a single resource allocation for all cluster maintenance work, which the NN can use however it wants based on its priority. By that argument, Balancer should be the first tool that move into the Namenode and then DiskBalancer. Right now, SPS approach follows what we are doing in HDFS world, that is block moves are achieved thru an async mechanism. If you would like to provide a generic block mover mechanism in Namenode and then port balancer and diskBalancer, you are most welcome. I will be glad to move SPS to that framework when we have it. bq. What is the concern about NN overhead, for this feature in particular? This is similar to what I asked Uma earlier about the coordinator DN; I don't think it meaningfully shifts work off the NN There are a couple of concerns: # Following an established pattern of Balancer, Mover, DiskBalancer etc. # Memory and CPU overhead in Namenode. # Future Directions -- if we have to support more finer mechanisms like smart storage management, moving data into provided block etc. It is better for this to be run as an independent service. And most important, we are just accelerating an SPS future work item, it has been a booked plan to make SPS separate, so we are just achieving that goal before the merge. Nothing fundamentally changes about SPS. was (Author: anu): bq. Is it trivial? I think we still need some type of fencing so there's only one active SPS. Does this use zookeeper, like NN HA? Yes, that would be the simplest approach to getting SPS HA. bq. If there's an SPS failover, how does the new active know where to resume? Once the active knows it is the leader, it can read the state from NN and continue. The issues of continuity are exactly same whether it is inside NN or outside. bq. I'm also wondering how progress is tracked, so we can resume without iterating over significant portions of the namespace. As soon as a block is moved, the move call updates the status of the block move, that is NN is up to date with that info. Each time there is a call to SPS API, NN will keep track of it and the updates after move lets us filter the remaining blocks. bq. I also like centralized control when it comes to coordinating block work. The NN schedules and prioritizes block work on the cluster. Already it's annoying to users to have configure a separate set of resource throttles for the balancer work, and it makes the system less reactive to cluster health events. We'd much rather have a single resource allocation for all cluster maintenance work, which the NN can use however it wants based on its priority. By that argument, Balancer should be the first tool that move into the Namenode and then DiskBalancer. Right now, SPS approach follows what we are doing in HDFS world, that is block moves are achieved thru an async mechanism. If you would like to provide a generic block mover mechanism in Namenode and then port balancer and diskBalancer, you are most welcome. I will be glad to move SPS to that framework when we have it. bq. What is the concern about NN overhead, for this feature in particular? This is similar to what I asked Uma earlier about the coordinator DN; I don't think it meaningfully shifts work off the NN There are a couple of concerns: # Following an established pattern of Balancer, Mover, DiskBalancer etc. # Memory and CPU overhead in Namenode. # Future Directions -- if we have to support more finer mechanisms like smart storage management, moving data into provided block etc. It is better for this to be run as
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220276#comment-16220276 ] Rakesh R edited comment on HDFS-10285 at 10/26/17 10:35 AM: Uploaded another version of SPS design document, tried capturing the details to reflect recent code changes. Welcome feedback! Thank you [~umamaheswararao] for co-authoring the doc. Thank you [~andrew.wang], [~surendrasingh], [~eddyxu], [~xiaochen], [~anoop.hbase], [~ram_krish]] for the useful discussions/comments. was (Author: rakeshr): Uploaded another version of SPS design document, tried capturing the details to reflect recent code changes. Welcome feedback! Thank you [~umamaheswararao] for co-authoring the doc. Thank you [~andrew.wang], [~surendrasingh], [~eddyxu], [~xiaochen], [~anoop.hbase] for the useful discussions/comments. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129525#comment-16129525 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 8/17/17 5:36 AM: - Hi [~andrew.wang] Thank you for helping us a lot in reviews. Really great points. {quote} This would be a user periodically asking for status. From what I know of async API design, callbacks are preferred over polling since it solves the question about how long the server needs to hold the status. I'd be open to any proposal here, I just think the current "isSpsRunning" API is insufficient. Did you end up filing a ticket to track this? {quote} ASYNC API design perspective, I agree, systems would have callback register APIs . I think we don't have that call back mechanism design's in place HDFS. In this particular case, we don't actually process anything for user is waiting, this is just a trigger to system to start some inbuilt functionality. In fact isSpsRunning API was added just for users to make sure inbuilt SPS is not running if they want to run Mover tool explicitly. I filed a JIRA HDFS-12310 to discuss more. I really don't know its a good idea to encourage users to periodically poll on the system for this status. IMO, if movements are really failing(probably some storages are unavailable or some storages failed etc), there is definitely an administrator actions required instead of user component knowing the status and taking actions itself. So, strongly believe reporting failures as metrics will definitely get into admins attention on the system. Since we don't want to enable it as auto movement at first stage, there should be come trigger to start the movement. Some work happening related to async HDFS API at HDFS-9924, probably we could take some design thoughts from there once they are in for status API? Also another argument is that, We already have async fashioned APIs, example delete or setReplication. Even for NN call perspective they may be sync calls, but for user perspective, still lot of work happens asynchronously. If we delete file, it does NN cleanup and add blocks for deletions. All the blocks deletions happens asynchronously. User believe HDFS that data will be cleaned, we don't have status reporting API. if we change the replication, we change it in NN and eventually replication will be triggered, I don't think users will poll on replication is done or not. As Its HDFS functionality to replicate, he just rely on it. If replications are failing, then definitely admin actions required to fix them. Usually admins depends on fsck or metrics. Lets discuss more on that JIRA HDFS-12310? At the end I am not saying we should not have status reporting.I feel that's a good to have requirement. Do you have some use cases on how the application system(ex: Hbase, [~anoopsamjohn] has provided some use cases above to use SPS) reacts on status results? {quote} If I were to paraphrase, the NN is the ultimate arbiter, and the operations being performed by C-DNs are idempotent, so duplicate work gets dropped safely. I think this still makes it harder to reason about from a debugging POV, particularly if we want to extend this to something like EC conversion that might not be idempotent. {quote} Similar to C-DN way only we are doing reconstructions work in EC already. All block group blocks will be reconstructed at on DN. there also that node will be choses loosely. Here we just Named as C-DN and sending more blocks as logical batch(in this case all blocks associated to a file). In EC case, we are send a block group blocks. Coming to idempotent , even today we are just doing in idempotent way in EC-reconstruction. I feel we can definitely handle that cases, as conversion of while file should complete and then only we can convert contiguous blocks to stripe mode at NN. Whoever finish first that will be updated to NN. Once NN already converted the blocks, it should not accept newly converted block groups. But this should be anyway different discussion. I just wanted to pointed out another use case HDFS-12090, I see that JIRA wants to adopt this model to move work. {quote} I like the idea of offloading work in the abstract, but I don't know how much work we really offload in this situation. The NN still needs to track everything at the file level, which is the same order of magnitude as the block level. The NN is still doing blockmanagement and processing IBRs for the block movement. Distributing tracking work to the C-DNs adds latency and makes the system more complicated. {quote} I don't see any extra latencies involved really. Anyway work has to be send to DNs individually. Along with that, we send batch to one DN first, that DN does its work as well as ask other DNs to transfer the blocks. Handling block level still keeps the requirement of tracking
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129525#comment-16129525 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 8/17/17 5:22 AM: - Hi [~andrew.wang] Thank you for helping us a lot in reviews. Really great points. {quote} This would be a user periodically asking for status. From what I know of async API design, callbacks are preferred over polling since it solves the question about how long the server needs to hold the status. I'd be open to any proposal here, I just think the current "isSpsRunning" API is insufficient. Did you end up filing a ticket to track this? {quote} ASYNC API design perspective, I agree, systems would have callback register APIs . I think we don't have that call back mechanism design's in place HDFS. In this particular case, we don't actually process anything for user is waiting, this is just a trigger to system to start some inbuilt functionality. In fact isSpsRunning API was added just for users to make sure inbuilt SPS is not running if they want to run Mover tool explicitly. I filed a JIRA HDFS-12310 to discuss more. I really don't know its a good idea to encourage users to periodically poll on the system for this status. IMO, if movements are really failing(probably some storages are unavailable or some storages failed etc), there is definitely an administrator actions required instead of user component knowing the status and taking actions itself. So, strongly believe reporting failures as metrics will definitely get into admins attention on the system. Since we don't want to enable it as auto movement at first stage, there should be come trigger to start the movement. Some work happening related to async HDFS API at HDFS-9924, probably we could take some design thoughts from there once they are in for status API? Also another argument is that, We already have async fashioned APIs, example delete or setReplication. Even for NN call perspective they may be sync calls, but for user perspective, still lot of work happens asynchronously. If we delete file, it does NN cleanup and add blocks for deletions. All the blocks deletions happens asynchronously. User believe HDFS that data will be cleaned, we don't have status reporting API. if we change the replication, we change it in NN and eventually replication will be triggered, I don't think users will poll on replication is done or not. As Its HDFS functionality to replicate, he just rely on it. If replications are failing, then definitely admin actions required to fix them. Usually admins depends on fsck or metrics. Lets discuss more on that JIRA HDFS-12310? At the end I am not saying we should not have status reporting.I feel that's a good to have requirement. Do you have some use cases on how the application system(ex: Hbase, [~anoopsamjohn] has provided some useless above to use SPS) reacts on status results? {quote} If I were to paraphrase, the NN is the ultimate arbiter, and the operations being performed by C-DNs are idempotent, so duplicate work gets dropped safely. I think this still makes it harder to reason about from a debugging POV, particularly if we want to extend this to something like EC conversion that might not be idempotent. {quote} Similar to C-DN way only we are doing reconstructions work in EC already. All block group blocks will be reconstructed at on DN. there also that node will be choses loosely. Here we just Named as C-DN and sending more blocks as logical batch(in this case all blocks associated to a file). In EC case, we are send a block group blocks. Coming to idempotent , even today we are just doing in idempotent way in EC-reconstruction. I feel we can definitely handle that cases, as conversion of while file should complete and then only we can convert contiguous blocks to stripe mode at NN. Whoever finish first that will be updated to NN. Once NN already converted the blocks, it should not accept newly converted block groups. But this should be anyway different discussion. I just wanted to pointed out another use case HDFS-12090, I see that JIRA wants to adopt this model to move work. {quote} I like the idea of offloading work in the abstract, but I don't know how much work we really offload in this situation. The NN still needs to track everything at the file level, which is the same order of magnitude as the block level. The NN is still doing blockmanagement and processing IBRs for the block movement. Distributing tracking work to the C-DNs adds latency and makes the system more complicated. {quote} I don't see any extra latencies involved really. Anyway work has to be send to DNs individually. Along with that, we send batch to one DN first, that DN does its work as well as ask other DNs to transfer the blocks. Handling block level still keeps the requirement of tracking
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106283#comment-16106283 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 7/31/17 5:46 PM: - [~andrew.wang] Thanks a lot Andrew for spending time on review and for very valuable comments. Please find my replies to the comments/questions. {quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for async APIs is status reporting. "-isSpsRunning" doesn't give much insight. How does a client track the progress of their request? How are errors propagated? A client like HBase can't read the NN log to find a stacktrace. Section 5.3 lists some possible errors for block movement on the DN. It might be helpful to think about NN-side errors too: out of quota, out of capacity, other BPP failure, slow/stuck SPS tasks, etc. {quote} Interesting question and we thought about this, but its pretty hard to communicate back to user about statuses. IMO, this async api is basically a facility to user to trigger HDFS to start satisfying the blocks as per the storage policy set. Example if we enable automatic movements in future, errors status will not be reported to users. Its HDFS responsibility to satisfy as possible as when policy changed. One possible way for admins to notice the failures would be via metrics reporting. I am also thinking to provide option in fsck command to check the current pending/in-progress status. I understand, this kind of status tracking may be useful in the case of SSM kind of systems to act upon, say raising alarm alerts etc. But HBase kind of system may not take any action from its business logic even of movement statuses are failures. Right now, HDFS itself will keep retry until it satisfies. {quote}It might be helpful to think about NN-side errors too: out of quota, out of capacity, other BPP failure, slow/stuck SPS tasks, etc.{quote} Sure, let me think on this if there are possible conditions. Ideally SPS does not deal with namespace change (except in adding Xattr for internal use purpose), but it does data movement to different volumes at DN. We will think to collect possible metrics from NN side as well specifically in ERROR conditions. {quote}Rather than using the acronym (which a user might not know), maybe rename "-isSpsRunning" to "-isSatisfierRunning" ?{quote} Make sense, We would change that. {quote}How is leader election done for the C-DN? Is there some kind of lease system so an old C-DN aborts if it can't reach the NN? This prevents split brain.{quote} Here we choose C-DN loosely. We just pick first source in the list. C-DN send back IN_PROGRESS pings for every 5mins( via heartbeat). If no IN_PROGRESS pings and timeout dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will just choose another in C-DN and reschedule. Here Even if older C-DN comes back, on re-registration, we send dropSPSWork request to DNs, that will prevent 2 C-DNs running. {quote}Any plans to trigger the satisfier automatically on events like rename or setStoragePolicy? When I explain HSM to users, they're often surprised that they need to trigger movement manually. Here, it's easier since it's Mover-as-a-service, but still manually triggered. {quote} Actually this is our long term plan. To simplify the solution, we target to implement first phase with manual triggering. Once the current code base performing well and stable enough, In the follow up work, we will work on this task to enable automatic triggering. To avoid missing requirements, I will add this task in followup JIRA. {quote}Docs say that right now the user has to trigger SPS tasks recursively for a directory. Why? I believe the Mover works recursively. xiaojian is doing some work on HDFS-10899 that involves an efficient recursive directory iteration, maybe can take some ideas from there. {quote} IIRC, Actually we intentionally restricted recursive operation. We wanted to be more careful on NN overheads. If some user accidentally calls on root directory, it may trigger lot of unnecessary overlap scans. In Mover case, it is running outside, so all scan overheads will be outside of NN. So, here if user really requires on recursive policy satisfaction, then he can do recursively (this can’t happen accidentally). I agree, allowing recursively will make user much more easier when they need recursive execution. Only constraint we thought was to make operation light weight as much as possible. Also I looked at HDFS-10899, If I understand correctly, Zones are already available and reencryption zone expects to be one of the existing zone. That makes life easier there. {code} + * Re-encrypts the given encryption zone path. If the given path is not the + * root of an encryption zone, an exception is thrown. + */ + XAttr reencryptEncryptionZone(final
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106283#comment-16106283 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 7/31/17 5:40 PM: - [~andrew.wang] Thanks a lot Andrew for spending time on review and for very valuable comments. Please find my replies to the comments/questions. {quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for async APIs is status reporting. "-isSpsRunning" doesn't give much insight. How does a client track the progress of their request? How are errors propagated? A client like HBase can't read the NN log to find a stacktrace. Section 5.3 lists some possible errors for block movement on the DN. It might be helpful to think about NN-side errors too: out of quota, out of capacity, other BPP failure, slow/stuck SPS tasks, etc. {quote} Interesting question and we thought about this, but its pretty hard to communicate back to user about statuses. IMO, this async api is basically a facility to user to trigger HDFS to start satisfying the blocks as per the storage policy set. Example if we enable automatic movements in future, errors status will not be reported to users. Its HDFS responsibility to satisfy as possible as when policy changed. One possible way for admins to notice the failures would be via metrics reporting. I am also thinking to provide option in fsck command to check the current pending/in-progress status. I understand, this kind of status tracking may be useful in the case of SSM kind of systems to act upon, say raising alarm alerts etc. But HBase kind of system may not take any action from its business logic even of movement statuses are failures. Right now, HDFS itself will keep retry until it satisfies. {quote}It might be helpful to think about NN-side errors too: out of quota, out of capacity, other BPP failure, slow/stuck SPS tasks, etc.{quote} Sure, let me think on this if there are possible conditions. Ideally SPS does not deal with namespace change (except in adding Xattr for internal use purpose), but it does data movement to different volumes at DN. We will think to collect possible metrics from NN side as well specifically in ERROR conditions. {quote}Rather than using the acronym (which a user might not know), maybe rename "-isSpsRunning" to "-isSatisfierRunning" ?{quote} Make sense, We would change that. {quote}How is leader election done for the C-DN? Is there some kind of lease system so an old C-DN aborts if it can't reach the NN? This prevents split brain.{quote} Here we choose C-DN loosely. We just pick first source in the list. C-DN send back IN_PROGRESS pings for every 5mins( via heartbeat). If no IN_PROGRESS pings and timeout dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will just choose another in C-DN and reschedule. Here Even if older C-DN comes back, on re-registration, we send dropSPSWork request to DNs, that will prevent 2 C-DNs running. {quote}Any plans to trigger the satisfier automatically on events like rename or setStoragePolicy? When I explain HSM to users, they're often surprised that they need to trigger movement manually. Here, it's easier since it's Mover-as-a-service, but still manually triggered. {quote} Actually this is our long term plan. To simplify the solution, we target to implement first phase with manual triggering. Once the current code base performing well and stable enough, In the follow up work, we will work on this task to enable automatic triggering. To avoid missing requirements, I will add this task in followup JIRA. {quote}Docs say that right now the user has to trigger SPS tasks recursively for a directory. Why? I believe the Mover works recursively. xiaojian is doing some work on HDFS-10899 that involves an efficient recursive directory iteration, maybe can take some ideas from there. {quote} IIRC, Actually we intentionally restricted recursive operation. We wanted to be more careful on NN overheads. If some user accidentally calls on root directory, it may trigger lot of unnecessary overlap scans. In Mover case, it is running outside, so all scan overheads will be outside of NN. So, here if user really requires on recursive policy satisfaction, then he can do recursively (this can’t happen accidentally). I agree, allowing recursively will make user much more easier when they need recursive execution. Only constraint we thought was to make operation light weight as much as possible. Also I looked at HDFS-10899, If I understand correctly, Zones are already available and reencryption zone expects to be one of the existing zone. That makes life easier there. {code} + * Re-encrypts the given encryption zone path. If the given path is not the + * root of an encryption zone, an exception is thrown. + */ + XAttr reencryptEncryptionZone(final
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104722#comment-16104722 ] Rakesh R edited comment on HDFS-10285 at 7/28/17 9:44 AM: -- Thanks a lot [~andrew.wang] for the reviews. bq. Should dfs.storage.policy.satisfier.activate default to false for now? During NN startup, SPS feature initializes satisfy thread and stays idle until user tells to satisfy storage policy for a given path. So, there won't be any overhead to the Namenode process as the thread is idle. Also, the system test report(attached to the jira) and Jenkins unit testing results shows the feature is stable. Since we have dynamic switch on/off mechanism via reconfig, I'm OK to disable by default if there is any concern, should I ? bq. Might also rename this to "enabled" rather than "activate" to align with other previous config keys. Agreed, I will raise a sub-task and change the configuration name asap. bq. What happens during a rolling upgrade? Will DNs ignore the unknown message, and NN handle this correctly? Yes, DN will ignore the unknown message. On the othe other side, NN will wait for certain configured amount of time for the block movement response. After this period, if there is no response NN will retry scheduling the block movement. So, no issues with rolling upgrade. bq. On downgrade, I assume the xattrs just stay there ignored. Yes, exactly it will be ignored. was (Author: rakeshr): Thanks a lot [~andrew.wang] for the reviews. bq. Should dfs.storage.policy.satisfier.activate default to false for now? During NN startup, SPS feature initializes satisfy thread and will stay idle until user tells to satisfy storage policy for a given path. So, there won't be much overhead to the Namenode process as the thread will be idle. Also, the system test report(attached to the jira) and Jenkins unit testing results shows the feature is stable. Since we have dynamic switch on/off mechanism via reconfig, I'm OK to disable by default if there is any concern, should I ? bq. Might also rename this to "enabled" rather than "activate" to align with other previous config keys. Agreed, I will raise a sub-task and change the configuration name asap. bq. What happens during a rolling upgrade? Will DNs ignore the unknown message, and NN handle this correctly? Yes, DN will ignore the unknown message. On the othe other side, NN will wait for certain configured amount of time for the block movement response. After this period, if there is no response NN will retry scheduling the block movement. So, no issues with rolling upgrade. bq. On downgrade, I assume the xattrs just stay there ignored. Yes, exactly it will be ignored. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction.
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087186#comment-16087186 ] Rakesh R edited comment on HDFS-10285 at 7/16/17 5:09 AM: -- Thank you all the contributors in making this feature. I have finished a pass rebasing all the changes made in HDFS-10285 sub-tasks. Uploading the consolidated patch to the umbrella jira to get the QA report. was (Author: rakeshr): Thank you all the contributors in making this feature. I have finished a pass rebasing all the changes made in HDFS-10285 sub-tasks. Uploading the consolidated patch to the umbrella jira to get the QA report. Thanks [~umamaheswararao] for the offline discussions. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: HDFS-10285 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403588#comment-15403588 ] Yuanbo Liu edited comment on HDFS-10285 at 8/2/16 8:17 AM: --- [~umamaheswararao] Great proposal, thanks for your work. I have two questions about your design: 1. {quote} When user calls satisfyStoragePolicy(src) API {quote} Is this api only available for java program, or when user uses this command {code} hdfs storagepolicies -setStoragePolicy -path -policy {code} this api is invoked by default ? 2. what if inodes exsit in toBeSatisfiedStoragePolicyList meanwhile "mover tool" takes effect on the directory which contains those inodes? was (Author: yuanbo): [~umamaheswararao] Great proposal, thanks for your work. I have two question about your design: 1. {quote} When user calls satisfyStoragePolicy(src) API {quote} Is this api only available for java program, or when user uses this command {code} hdfs storagepolicies -setStoragePolicy -path -policy {code} this api is invoked by default ? 2. what if inodes exsit in toBeSatisfiedStoragePolicyList meanwhile "mover tool" takes effect on the directory which contains those inodes? > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: 2.7.2 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: Storage-Policy-Satisfier-in-HDFS-May10.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281164#comment-15281164 ] Uma Maheswara Rao G edited comment on HDFS-10285 at 5/13/16 4:51 AM: - Attached the initial version of document. Please help in review and we can improve the document based on feedbacks. Thanks [~rakeshr] for co-authoring on the design doc. Thanks [~anoopsamjohn], [~drankye], [~ram_krish],[~jingcheng...@intel.com] for helping on reviews. Thanks, Uma & Rakesh was (Author: umamaheswararao): Attached the initial version of document. Please help in review and we can improve the document based on feedbacks. > Storage Policy Satisfier in Namenode > > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: 2.7.2 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G > Attachments: Storage-Policy-Satisfier-in-HDFS-May10.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org