[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2018-02-18 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368648#comment-16368648
 ] 

Rakesh R edited comment on HDFS-10285 at 2/18/18 7:37 PM:
--

We have worked on the comments and following is quick update. 

{{Comment-1) }} => DONE via HDFS-13097
 {{Comment-2) }} => DONE via HDFS-13097
 {{Comment-5)}}   => DONE via HDFS-13097
 {{Comment-8) }} => DONE via HDFS-13097
 {{Comment-10)}} => DONE via HDFS-13110
 {{Comment-11)}} => DONE via HDFS-13097
 {{Comment-12)}} => DONE via HDFS-13097
 {{Comment-13)}} => DONE via HDFS-13110
 {{Comment-15)}} => DONE via HDFS-13097
 {{Comment-16)}} => DONE via HDFS-13097
 {{Comment-18)}} => DONE via HDFS-13097
 {{Comment-19)}} => DONE via HDFS-13097
 {{Comment-22)}} => DONE via HDFS-13097

 *For the below comments*, it would be great to hear your thoughts. Please let 
me know your feedback on my reply.
 {{Comment-3)}} => This comment has two parts ibr and data transfer. IBR will 
be explored and implemented via HDFS-13165 sub-task. But, data transfer part is 
not concluded yet. How do we incorporate local move into this, currently data 
transfer is not having such logic. IIUC, DNA_TRANSFER is used to send a copy of 
a block to another datanode. Also, mover tool uses replaceBlock() for block 
movement, which already has block movement to a different storage within the 
same datanode. How abt using {{replaceBlock}} pattern here in sps as well?
 {{Comment-4)}} => Depends on comment-3
 {{Comment-6, Comment-9, Comment-14, Comment-17)}} => Needs to understand more 
on this.
 {{Comment-20}} => Depends on comment-3

*In Progress tasks:*
 {{Comment-3)}} => HDFS-13165, this jira will only implement logic to collects 
back the moved block via IBR.
 {{Comment-21)}} => HDFS-13165
 {{Comment-7)}} => HDFS-13166


was (Author: rakeshr):
We have worked on the comments and following is quick update. 

{{Comment-1)}} => DONE via HDFS-1309
 {{Comment-2)}} => DONE via HDFS-13097
 {{Comment-5)}} => DONE via HDFS-13097
 {{Comment-8)}} => DONE via HDFS-13097
 {{Comment-10)}} => DONE via HDFS-13110
 {{Comment-11)}} => DONE via HDFS-13097
 {{Comment-12)}} => DONE via HDFS-13097
 {{Comment-13)}} => DONE via HDFS-13110
 {{Comment-15)}} => DONE via HDFS-13097
 {{Comment-16)}} => DONE via HDFS-13097
 {{Comment-18)}} => DONE via HDFS-13097
 {{Comment-19)}} => DONE via HDFS-13097
 {{Comment-22)}} => DONE via HDFS-13097

 *For the below comments*, it would be great to hear your thoughts. Please let 
me know your feedback on my reply.
 {{Comment-3)}} => This comment has two parts ibr and data transfer. IBR will 
be explored and implemented via HDFS-13165 sub-task. But, data transfer part is 
not concluded yet. How do we incorporate local move into this, currently data 
transfer is not having such logic. IIUC, DNA_TRANSFER is used to send a copy of 
a block to another datanode. Also, mover tool uses replaceBlock() for block 
movement, which already has block movement to a different storage within the 
same datanode. How abt using \{{replaceBlock}} pattern here in sps as well?
 {{Comment-4}} => Depends on comment-3
 {{Comment-6, Comment-9, Comment-14, Comment-17}} => Needs to understand more 
on this.
 {{Comment-20}} => Depends on comment-3

*In Progress tasks:*
 {{Comment-3)}} => HDFS-13165, this jira will only implement logic to collects 
back the moved block via IBR.
 {{Comment-21)}} => HDFS-13165
 {{Comment-7)}} => HDFS-13166

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-10285-consolidated-merge-patch-04.patch, 
> HDFS-10285-consolidated-merge-patch-05.patch, 
> HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2018-02-01 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472
 ] 

Rakesh R edited comment on HDFS-10285 at 2/1/18 1:58 PM:
-

Thank you very much [~daryn] for your time and useful comments/thoughts. My 
reply follows, please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the 
SPS service to the block manager. Please move them out to prove it’s not 
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create 
{{StoragePolicySatisfyManager}} and keep all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it 
specifically report back successful moves instead of relying on IBRs? I would 
actually expect the DN to be completely ignorant of a SPS move vs any other 
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. 
If sps rely on this, then it would requires an *extra* check to know whether 
this new block has occurred due to sps move or others, which will be quite 
often considering more other ops compares to SPSBlockMove op. Currently, it is 
sending back {{blksMovementsFinished}} list separately, each movement finished 
block can be easily/quickly recognized by the Satisfier in NN side and updates 
the tracking details. If you agree this *extra* check is not an issue then we 
would be happy to implement the IBR approach. Secondly, 
BlockStorageMovementCommand is added to carry the block vs src/target pairs 
which is much needed for the move operation and return result back. On a second 
thought, if we change the response flow via IBR, then I understand we can think 
of using the block transfer flow. Following is my analysis on that,
 - I could see, DNA_TRANSFER is used to send a copy of a block to another 
datanode. In our case, we need to support {{local_move}} and needs to 
incorporate this {{local_move}} to this block transfer logic, I hope that is 
fine ?
 - secondly, presently we used {{replaceBlock}} and this has additional 
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this 
hint and namenode will delete the over replicated block eventually on next 
block report. Is that fine? Meantime, I will do changes based on {{ibr/data 
transfer}} for the internal sps movement and analyse more.
{code:java}
// notify name node
final Replica r = blockReceiver.getReplica();
datanode.notifyNamenodeReceivedBlock(
block, delHint, r.getStorageUuid(), r.isOnTransientStorage());

LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
+ ", delHint=" + delHint);
{code}

 - So, for the internal, will use the {{transfer block}} and external, will use 
{{replace block}}, am I missing anything?

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any 
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this 
case over there. Thanks!

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give 
storage movement tasks equal footing with under-replication? Is there really a 
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we 
thought is, user configured SSDs and filled up quickly. In that case, there 
could be situations that cleaning-up is considered as a high priority. If you 
feel, this is not a real case then I'm OK to remove this config and SPS will 
use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport 
is already a very bad method that should be avoided for anything but jmx – even 
then it’s a concern. I eliminated calls to it years ago. All it takes is a 
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of 
time. Beyond that, the response is going to be pretty large and tagging all the 
storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem 
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its 
storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned 
earlier, this is NOT cheap. 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2018-02-01 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472
 ] 

Rakesh R edited comment on HDFS-10285 at 2/1/18 1:52 PM:
-

Thank you very much [~daryn] for your time and useful comments/thoughts. My 
reply follows, please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the 
SPS service to the block manager. Please move them out to prove it’s not 
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create 
{{StoragePolicySatisfyManager}} and keep all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it 
specifically report back successful moves instead of relying on IBRs? I would 
actually expect the DN to be completely ignorant of a SPS move vs any other 
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. 
If sps rely on this, then it would requires an *extra* check to know whether 
this new block has occurred due to sps move or others, which will be quite 
often considering more other ops compares to SPSBlockMove op. Currently, it is 
sending back {{blksMovementsFinished}} list separately, each movement finished 
block can be easily/quickly recognized by the Satisfier in NN side and updates 
the tracking details. If you agree this *extra* check is not an issue then we 
would be happy to implement the IBR approach. Secondly, 
BlockStorageMovementCommand is added to carry the block vs src/target pairs 
which is much needed for the move operation and return result back. On a second 
thought, if we change the response flow via IBR, then I understand we can think 
of using the block transfer flow. Following is my analysis on that,
 - I could see, DNA_TRANSFER is used to send a copy of a block to another 
datanode. In our case, we need to support {{local_move}} and needs to 
incorporate this {{local_move}} to this block transfer logic, I hope that is 
fine ?
 - secondly, presently we used {{replaceBlock}} and this has additional 
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this 
hint and namenode will delete the over replicated block eventually on next 
block report. Any thoughts?
{code:java}
// notify name node
final Replica r = blockReceiver.getReplica();
datanode.notifyNamenodeReceivedBlock(
block, delHint, r.getStorageUuid(), r.isOnTransientStorage());

LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
+ ", delHint=" + delHint);
{code}

 - So, for the internal, will use the {{transfer block}} and external, will use 
{{replace block}}, am I missing anything?

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any 
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this 
case over there. Thanks!

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give 
storage movement tasks equal footing with under-replication? Is there really a 
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we 
thought is, user configured SSDs and filled up quickly. In that case, there 
could be situations that cleaning-up is considered as a high priority. If you 
feel, this is not a real case then I'm OK to remove this config and SPS will 
use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport 
is already a very bad method that should be avoided for anything but jmx – even 
then it’s a concern. I eliminated calls to it years ago. All it takes is a 
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of 
time. Beyond that, the response is going to be pretty large and tagging all the 
storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem 
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its 
storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned 
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached 
state of the world. Then it gets another datanode 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2018-02-01 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472
 ] 

Rakesh R edited comment on HDFS-10285 at 2/1/18 1:49 PM:
-

Thank you very much [~daryn] for your time and useful comments/thoughts. My 
reply follows, please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the 
SPS service to the block manager. Please move them out to prove it’s not 
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create 
{{StoragePolicySatisfyManager}} and keep all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it 
specifically report back successful moves instead of relying on IBRs? I would 
actually expect the DN to be completely ignorant of a SPS move vs any other 
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. 
If sps rely on this, then it would requires an *extra* check to know whether 
this new block has occurred due to sps move or others, which will be quite 
often considering more other ops compares to SPSBlockMove op. Currently, it is 
sending back {{blksMovementsFinished}} list separately, each movement finished 
block can be easily/quickly recognized by the Satisfier in NN side and updates 
the tracking details. If you agree this *extra* check is not an issue then we 
would be happy to implement the IBR approach. Secondly, 
BlockStorageMovementCommand is added to carry the block vs src/target pairs 
which is much needed for the move operation and return result back. On a second 
thought, if we change the response flow via IBR, then I understand we can think 
of using the block transfer flow. Following is my analysis on that,
- I could see, DNA_TRANSFER is used to send a copy of a block to another 
datanode. In our case, we need to support {{local_move}} and needs to 
incorporate this {{local_move}} to this block transfer logic, I hope that is 
fine ?
- secondly, presently we used {{replaceBlock}} and this has additional 
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this 
hint and namenode will eventually delete the over replicated block eventually. 
Any thoughts?
{code}
// notify name node
final Replica r = blockReceiver.getReplica();
datanode.notifyNamenodeReceivedBlock(
block, delHint, r.getStorageUuid(), r.isOnTransientStorage());

LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
+ ", delHint=" + delHint);
{code}
- For the internal, will use the {{transfer block}} and external, will use 
{{replace block}}, am I missing anything?

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any 
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this 
case over there. Thanks!

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give 
storage movement tasks equal footing with under-replication? Is there really a 
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we 
thought is, user configured SSDs and filled up quickly. In that case, there 
could be situations that cleaning-up is considered as a high priority. If you 
feel, this is not a real case then I'm OK to remove this config and SPS will 
use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport 
is already a very bad method that should be avoided for anything but jmx – even 
then it’s a concern. I eliminated calls to it years ago. All it takes is a 
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of 
time. Beyond that, the response is going to be pretty large and tagging all the 
storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem 
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its 
storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned 
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached 
state of the world. Then it gets another datanode report to determine the 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2018-01-31 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347472#comment-16347472
 ] 

Rakesh R edited comment on HDFS-10285 at 2/1/18 5:07 AM:
-

Thank you very much [~daryn] for your time and useful comments/thoughts. My 
reply follows, please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the 
SPS service to the block manager. Please move them out to prove it’s not 
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create 
{{StoragePolicySatisfyManager}} and keep all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it 
specifically report back successful moves instead of relying on IBRs? I would 
actually expect the DN to be completely ignorant of a SPS move vs any other 
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. 
If sps rely on this, then it would requires an *extra* check to know whether 
this new block has occurred due to sps move or others, which will be quite 
often considering more other ops compares to SPSBlockMove op. Currently, it is 
sending back {{blksMovementsFinished}} list separately, each movement finished 
block can be easily/quickly recognized by the Satisfier in NN side and updates 
the tracking details. If you agree this *extra* check is not an issue then we 
would be happy to implement the IBR approach. Secondly, 
BlockStorageMovementCommand is added to carry the block vs src/target pairs 
which is much needed for the move operation and we tried to decouple sps code 
using this command. 

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any 
different than across storages?
{quote}
[Rakesh's reply] I could see Mover is also using {{REPLACE_BLOCK}} call and we 
just followed same approach in sps also. Am I missing anything here?

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give 
storage movement tasks equal footing with under-replication? Is there really a 
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we 
thought is, user configured SSDs and filled up quickly. In that case, there 
could be situations that cleaning-up is considered as a high priority. If you 
feel, this is not a real case then I'm OK to remove this config and SPS will 
use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport 
is already a very bad method that should be avoided for anything but jmx – even 
then it’s a concern. I eliminated calls to it years ago. All it takes is a 
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of 
time. Beyond that, the response is going to be pretty large and tagging all the 
storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem 
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its 
storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned 
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached 
state of the world. Then it gets another datanode report to determine the 
number of live nodes to decide if it should sleep before processing the next 
path. The number of nodes from the prior cached view of the world should 
suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. 
Actually, we depend on this api for the data node storage types and remaining 
space details. I think, it requires two different mechanisms for internal and 
external sps. For internal, how about sps can directly refer 
{{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are 
suggesting a cache mechanism. How about, get storageReport once and cache at 
ExternalContext. This local cache can be refreshed periodically. Say, After 
every 5mins (just an arbitrary number I put here, if you have some period in 
mind please suggest), when getDatanodeStorageReport called, cache can be 
treated as expired and fetch freshly. Within 5mins it can use from cache. Does 
this make sense to you?

Another point we thought of is, right now for checking whether 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2018-01-31 Thread Surendra Singh Lilhore (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347575#comment-16347575
 ] 

Surendra Singh Lilhore edited comment on HDFS-10285 at 1/31/18 8:38 PM:


Thanks [~daryn] for reviews.

Created Part1 Jira HDFS-13097 to fix few comments


was (Author: surendrasingh):
Thanks [~daryn] for reviews.

Create Part1 Jira HDFS-13097 to fix few comments

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-10285-consolidated-merge-patch-04.patch, 
> HDFS-10285-consolidated-merge-patch-05.patch, 
> HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-21 Thread Virajith Jalaparti (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300516#comment-16300516
 ] 

Virajith Jalaparti edited comment on HDFS-10285 at 12/21/17 8:16 PM:
-

Hi [~umamaheswararao], Thanks for the meeting summary. Having both the options 
(SPS within NN and SPS as service) would be great.

bq. it may be necessary to start SPS RPC with its own IP:port (within NN or 
outside), so clients can always talk to SPS on that port, irrespective of where 
its running.

Quick question about this -- what are clients you are referring to here? Are 
these for admin operations? If the SPS is going to run outside the NN, I would 
think it is going to be decoupled from/not depend on the FSNamesystem lock and 
the NN-DN heartbeat protocol. The current implementation/design has a tight 
coupling between the SPS and both these components.


was (Author: virajith):
Hi [~umamaheswararao], Thanks for the meeting summary. Having both the options 
(SPS within NN and SPS as service) would be great.

bq. it may be necessary to start SPS RPC with its own IP:port (within NN or 
outside), so clients can always talk to SPS on that port, irrespective of where 
its running.

Quick question about this: what are clients you are referring to here? Are 
these for admin operations? If the SPS is going to run outside the NN, I would 
think it is going to be decoupled from/not depend on the FSNamesystem lock and 
the NN-DN heartbeat protocol. The current implementation/design has a tight 
coupling between the SPS and both these components.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-21 Thread Virajith Jalaparti (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300516#comment-16300516
 ] 

Virajith Jalaparti edited comment on HDFS-10285 at 12/21/17 8:15 PM:
-

Hi [~umamaheswararao], Thanks for the meeting summary. Having both the options 
(SPS within NN and SPS as service) would be great.

bq. it may be necessary to start SPS RPC with its own IP:port (within NN or 
outside), so clients can always talk to SPS on that port, irrespective of where 
its running.

Quick question about this: what are clients you are referring to here? Are 
these for admin operations? If the SPS is going to run outside the NN, I would 
think it is going to be decoupled from/not depend on the FSNamesystem lock and 
the NN-DN heartbeat protocol. The current implementation/design has a tight 
coupling between the SPS and both these components.


was (Author: virajith):
Hi [~umamaheswararao], Thanks for the meeting summary. I agree that having both 
the options (SPS within NN and SPS as service) would be great to have.

bq. it may be necessary to start SPS RPC with its own IP:port (within NN or 
outside), so clients can always talk to SPS on that port, irrespective of where 
its running.

Quick question about this: what are clients you are referring to here? Are 
these for admin operations? If the SPS is going to run outside the NN, I would 
think it is going to be decoupled from/not depend on the FSNamesystem lock and 
the NN-DN heartbeat protocol. The current implementation/design has a tight 
coupling between the SPS and both these components.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-09 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284692#comment-16284692
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 12/9/17 10:33 AM:
--

{quote}
Conceptually, blocks violating the HSM policy are a form of mis-replication 
that doesn't satisfy a placement policy – which would truly prevent performance 
issues if the feature isn't needed. The NN's repl monitor ignorantly handles 
the moves as a low priority transfer (if/since it's sufficiently replicated). 
The changes to the NN are minimalistic.
DNs need to support/honor storages in transfer requests. Transfers to itself 
become moves. Now HSM "just works", eventually, similar to increasing the repl 
factor.
{quote}
Thank you for the proposal. This seems reasonable to me about the scoping IIUC.
To be clear, the current SPS intends only for satisfying the basic HSM feature. 

{quote}
An external SPS can provide fancier policies for accelerating the processing 
for those users like hbase.
{quote}
The fancier policy implementation proposals are at HDFS-7343(smart storage 
management), but not as part SPS proposal.
 
So, if I understand correctly, what you are saying is, Namenode can handle 
satisfying storages by using RM kind of mechanism how it does for replication. 
Yes, the current SPS, does in the similar fashion except the way finding 
movements when policy changes, instead SPS schedule the movements on 
{{#satisfyStoragePolicy(path)}} API in this phase.
 
In general, it can keep the mismatches logic in RM. Here, RM itself will do the 
storage policy mismatch and schedule blocks movement command, like how 
replication command sent in the replication case. Since RM is a critical 
service, we thought of not touching this and we tried to optimize a bit in SPS 
considering its semantics. Let me try to explain little more about that. 
Actually for storage mismatches block finding should happen collectively for 
replica set as policy is for a block replicas set. One another point, how SPS 
collects  {{to_be_storage_movement_needed_blocks}} : to simplify things, we 
have planned to expose new API, where user can specify a path then internally 
we trigger satisfy only to that path. To handle retries/cluster restarts we 
save in Xattr until SPS finishes its work. Xattr overhead must be minimal by 
doing some deduplication (1), I will explain below.  So, instead of loading all 
blocks into memory for mismatch check, what we did in SPS is, load of blocks 
blocks when we really checking them. When SPS invoked to satisfy, we track only 
file InodeID. In replication case, it is good to track at block level, because 
any single replica can be missed, no need to check for other blocks in that 
file. In SPS case, if policy changes, it applies for all blocks in that file, 
so, it makes sense to just track file id in queues. The general usage would be, 
the directories where user set the storage policy changes are qualified for SPS 
processing. The recommendation for storage policies is to set as optimally as 
possible, it may be efficient to set on directories instead of setting on 
individual files until its really necessary. This is to avoid more number of 
Xattrs in HSM.
 
Since SPS picks for the same directories to satisfy, at directory Q, we keep 
only the InodeIDs (long) list, on which SPS intends to work for satisfying the 
mismatches blocks. It will not recursively load files/blocks under that dir 
into memory immediately. SPS thread will pick elements from intermediate Q , 
file Inode to process. This intermediate Q capacity  bounded to 1000 elements. 
Front Q processor fill-up the intermediate Q, only when it has empty slots. 
Otherwise it will not load files Inode under the directory. So, unnecessarily 
we will not load every file-Inode id into memory.
"Once the mismatches identified, for the set of blocks in a file, it will add 
into Datanode descriptors as NN-to-DN commands, this is exactly same as 
ReplicationMonitor. Then DN receive this commands and move the blocks, similar 
to transfer. Same as you explained. So, conceptually the approach is exactly 
same as RM with little optimizations like throttling."
When assigning the tasks to DN, it will respect the durability tasks. If 
replication tasks are already pending then it will give preference to them, it 
will not assign SPS block movements as high priority tasks.
 
*How can we keep Xattr load minimal:*
(1) The Xatter what we are adding just to indicate the directory for SPS 
processing if any restarts happened, otherwise it will be hard to scan entire 
name node to find mismatches.
The Xatttr object has NameSpace enum, String name, byte[] value.
In this case enum and name will be same for any directory we set and value will 
be null. It is just like constant object. Now we can create only Xattr object 
per NN and use the same object ref for 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-08 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283102#comment-16283102
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 12/8/17 8:44 AM:
-

{quote}
Here's a rhetorical question: If managing multiple services is hard, why not 
bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the 
same process? Or ZK so HA is easier to deploy/manage?
{quote}
Few of my thoughts on this question, each of these projects build for their own 
purpose, with its own spec, not for just for helping HDFS or any other single 
project. And none of that projects need to access other project internal data 
structures. Where as SPS is only functions for HDFS and access internal data 
structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC 
APIs. This strikes me to put a question in other way as well, is it make sense 
to separate ReplicationMonitor as one separate process? is it fine to start 
EDEK as one separate? is it ok to start other threads (like decommissioning 
task) as separate processes and co-ordinate via RPC? so that NameSystem class 
may become very light weight? I think its the value vs cost will decide whether 
to separate or merge into single. 

Coming to ZK part, As ZK is not build only for HDFS, I don’t think to have any 
such thoughts. Its general purpose co-ordination system. Technically we can’t 
keep monitoring services inside NN, because the worry itself is, NN may die, 
need failover and so need external process to monitor. Anyway. I think the 
whole discussion is about services inside a project, but not cross projects 
itself, IMHO.
Here SPS providing only the missing functionality of HSM, that is end-end 
policy satisfaction. So, IMV, for users it may not worth to manage additional 
process to achieve that missing functionality for particular feature.

{quote}
Today, I looked at the code more closely. It can hold the lock (read lock, but 
still) way too long. Notably, but not limited to, you can’t hold the lock while 
doing block placement.
{quote}

Appreciate your review Daryn. I think it should be easy to address. We will 
make sure to address the comment before merge? is that make sense.

{quote}
I should start sending bills to everyone who makes this fraudulent claim. . 
FSDirectory#addToInodeMap imposes a nontrivial performance penalty even when 
SPS is not enabled. We had to hack out the similar EZ check because it had a 
noticeable performance impact esp. on startup. However now that we support EZ, 
I need to revisit optimizing it.
{quote}
Thanks for review!. Nice find. Fundamentally, if sps disabled we don't even 
need to load the things into Qs as no one will process them. So, adding 
condition of checking enabled, can avoid even that enqueuing calls, in disabled 
case. So, it will end up having one extra bool if check if disabled. With this 
change, impact should be negligible IIUC. we will take this comment. Thanks

{quote}
I’m curious why it isn’t just part of the standard replication monitoring. If 
the DN is told to replicate to itself, it just does the storage movement.
{quote}
That's a good question. Overall approach is exactly same as RM. RM is has its 
own q build up for redundancy blocks, and Underreplication scan/check happens 
at block level, it make sense. Where as in SPS, policy changes for file, so all 
blocks in that file needs movement and policy check should happen 
in-co-ordination with replication blocks where they stored currently. So, we 
track the queues at file level here and scan/check all block for that files 
together at once. Also , We wanted to provide, on the fly reconfigure feature 
and we carefully thought that, we don’t want to interfere replication logic 
should be given more priority than SPS work. While scheduling blocks, we 
respect xmits counts, they are shared between, RM, SPS for controlling DN load. 
Assignment priority given to replication/EC blocks, then SPS blocks, when 
sending tasks to DN. So, as part of impact analysis, we thought of keeping SPS 
in it's own thread than RM thread would be clean and safer than running in that 
same loop of RM.


was (Author: umamaheswararao):
{quote}
Here's a rhetorical question: If managing multiple services is hard, why not 
bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the 
same process? Or ZK so HA is easier to deploy/manage?
{quote}
Few of my thoughts on this question, each of these projects build for their own 
purpose, with its own spec, not for just for helping HDFS or any other single 
project. And none of that projects need to access other project internal data 
structures. Where as SPS is only functions for HDFS and access internal data 
structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC 
APIs. This strikes me to put a question in other way as well, is it 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-07 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283102#comment-16283102
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 12/8/17 7:35 AM:
-

{quote}
Here's a rhetorical question: If managing multiple services is hard, why not 
bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the 
same process? Or ZK so HA is easier to deploy/manage?
{quote}
Few of my thoughts on this question, each of these projects build for their own 
purpose, with its own spec, not for just for helping HDFS or any other single 
project. And none of that projects need to access other project internal data 
structures. Where as SPS is only functions for HDFS and access internal data 
structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC 
APIs. This strikes me to put a question in other way as well, is it make sense 
to separate ReplicationMonitor as one separate process? is it fine to start 
EDEK as one separate? is it ok to start other threads (like decommissioning 
task) as separate processes and co-ordinate via RPC? so that NameSystem class 
may become very light weight? I think its the value vs cost will decide whether 
to separate or merge into single. 

Coming to ZK part, As ZK is not build only for HDFS, I don’t think to have any 
such thoughts. Its general purpose co-ordination system. Technically we can’t 
keep monitoring services inside NN, because the worry itself is, NN may die, 
need failover and so need external process to monitor. Anyway. I think the 
whole discussion is about services inside a project, but not cross projects 
itself, IMHO.
Here SPS providing only the missing functionality of HSM, that is end-end 
policy satisfaction. So, IMV, for users it may not worth to manage additional 
process to achieve that missing functionality for particular feature.

{quote}
Today, I looked at the code more closely. It can hold the lock (read lock, but 
still) way too long. Notably, but not limited to, you can’t hold the lock while 
doing block placement.
{quote}

Appreciate your review Daryn. I think it should be easy to address. We will 
make sure to address the comment before merge? is that make sense.


{quote}
I’m curious why it isn’t just part of the standard replication monitoring. If 
the DN is told to replicate to itself, it just does the storage movement.
{quote}
That's a good question. Overall approach is exactly same as RM. RM is has its 
own q build up for redundancy blocks, and Underreplication scan/check happens 
at block level, it make sense. Where as in SPS, policy changes for file, so all 
blocks in that file needs movement and policy check should happen 
in-co-ordination with replication blocks where they stored currently. So, we 
track the queues at file level here and scan/check all block for that files 
together at once. Also , We wanted to provide, on the fly reconfigure feature 
and we carefully thought that, we don’t want to interfere replication logic 
should be given more priority than SPS work. While scheduling blocks, we 
respect xmits counts, they are shared between, RM, SPS for controlling DN load. 
Assignment priority given to replication/EC blocks, then SPS blocks, when 
sending tasks to DN. So, as part of impact analysis, we thought of keeping SPS 
in it's own thread than RM thread would be clean and safer than running in that 
same loop of RM.


was (Author: umamaheswararao):
{quote}
Here's a rhetorical question: If managing multiple services is hard, why not 
bundle oozie, spark, storm, sqoop, kafka, ranger, knox, hive server, etc in the 
same process? Or ZK so HA is easier to deploy/manage?
{quote}
Few of my thoughts on this question, each of these projects build for their own 
purpose, with its own spec, not for just for helping HDFS or any other single 
project. And none of that projects need to access other project internal data 
structures. Where as SPS is only functions for HDFS and access internal data 
structures. Even forcibly separated out, we need to expose ‘for SPS only’ RPC 
APIs. This strikes me to put a question in other way as well, is it make sense 
to separate ReplicationMonitor as one separate process? is it fine to start 
EDEK as one separate? is it ok to start other threads (like decommissioning 
task) as separate processes and co-ordinate via RPC? so that NameSystem class 
may become very light weight? I think its the value vs cost will decide whether 
to separate or merge into single. 

Coming to ZK part, As ZK is not build only for HDFS, I don’t think to have any 
such thoughts. Its general purpose co-ordination system. Technically we can’t 
keep monitoring services inside NN, because the worry itself is, NN may die, 
need failover and so need external process to monitor. Anyway. I think the 
whole discussion is about services inside a project, but not cross 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-07 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281503#comment-16281503
 ] 

Rakesh R edited comment on HDFS-10285 at 12/7/17 1:17 PM:
--

Thanks a lot [~anu] for your time and comments.

bq. This is the most critical concern that I have. In one of the discussions 
with SPS developers, they pointed out to me that they want to make sure an SPS 
move happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily SPS will eat into 
more and more memory of Namenode
Increasing Namenode Q will not help to speedup the block movements. It is the 
Datanode who does actual block movements and need to tune Datanode bandwidth to 
speedup the block movements. Hence there is no sense in increasing Namenode Q. 
Infact, that will simply add up the pending tasks at the Namenode side.

Let me try putting the memory usage of Namenode Q:
Assume there are 1 million directories and users invoked 
{{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge 
data movement and it may not be a regular case. Again, assume without knowing 
the advantage of increasing the Q size if some unpleasant user set the Q size 
to a higher value 1,000,000. Each API call, will add an {{Xattr}} to represent 
the pending movement and NN maintains list of pending dir InodeId to satisfy 
the policy, which is {{Long}} value. Each Xattr takes 15 chars 
{{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses 
{{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to 
{{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) 
size.

*(1) Xattr entry*
Xattr: 12bytes(Object overhead) + 4bytes(String reference) + 4bytes(byte array) 
= aligned 24bytes.
String "system.hdfs.sps": 40bytes(String Object) + 15bytes(chars) = 56bytes. 
Its not creating new String("system.hdfs.sps") object every time, so ideally 
56bytes count need not be counted every time. Still, I'm considering this.
byte[]: 4bytes

84 bytes = (aligned 88bytes * 1,000,000) = 83.923MB

If we keep SPS outside or inside Namenode, this much memory space will be 
occupied as xattribute is used to mark the pending item.

*(2) Namenode Q*
LinkedList entry = 24bytes
Long object = 12bytes(Object overhead) + 8bytes = aligned 24bytes
--
48bytes * 1000,000 = 45.78MB
--

46MB approax, which I feel is a smaller percentage and this may occur in the 
misconfgured scenario where many {{InodeIds}} queued up.

Default Q size value will be recommended as 1000 or even 10,000 = 48bytes * 
10,000 = 468.75KB. = 469KB to keep the memory usage very less. Again open to 
change default value (increase/decrease) based on the feedback.

Please feel free to correct me if I missed anything. Thanks!

bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the 
"scan and move tools" as an external feature to namenode. I am not able to see 
any convincing reason for breaking this pattern.
- {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from 
your previous comments, I'm glad that you agreed that CPU is not an issue. 
Hence scanning is not a concern. If we run SPS outside, it has to put 
additional RPC calls for the SPS work and again switching of SPS-ha service has 
to blindly scan the entire namespace to figure out the xattrs. Now, for 
handling the switching scenarios, we have to come up with some kind of unfair 
tweaking logic like, write xattr somewhere in the file and new active SPS 
service should read it from there and continue. With this, I feel to keep the 
scanning logic at NN. 
FYI, NN has existing feature EDEK which also does scanning and we reuses the 
same code in SPS.
Also, I'm re-iterating the point that, SPS does not scan the files its own, 
user has to call API to satisfy a particular file.

- {{Moving blocks}} - It is something assigning the responsibility to Datanode. 
Presently, Namenode has several logic which does block movement - 
ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added 
throttling mechanism for the sps block movements also, not to affect the 
existing data movements.

- AFAIK, DiskBalancer is completely run at the Datanode and it looks like 
Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, 
which doesn't need any input file paths and it does balancing HDFS cluster 
based on the utilization. Balancer can run independently as it doesn't 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-07 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281503#comment-16281503
 ] 

Rakesh R edited comment on HDFS-10285 at 12/7/17 11:26 AM:
---

Thanks a lot [~anu] for your time and comments.

bq. This is the most critical concern that I have. In one of the discussions 
with SPS developers, they pointed out to me that they want to make sure an SPS 
move happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily SPS will eat into 
more and more memory of Namenode
Increasing Namenode Q will not help to speedup the block movements. It is the 
Datanode who does actual block movements and need to tune Datanode bandwidth to 
speedup the block movements. Hence there is no sense in increasing Namenode Q. 
Infact, that will simply add up the pending tasks at the Namenode side.

Let me try putting the memory usage of Namenode Q:
Assume there are 1 million directories and users invoked 
{{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge 
data movement and it may not be a regular case. Again, assume without knowing 
the advantage of increasing the Q size if some unpleasant user set the Q size 
to a higher value 1,000,000. Each API call, will add an {{Xattr}} to represent 
the pending movement and NN maintains list of pending dir InodeId to satisfy 
the policy, which is {{Long}} value. Each Xattr takes 15 chars 
{{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses 
{{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to 
{{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) 
size.

*(1) Xattr entry*
Xattr: 12bytes(Object overhead) + 4bytes(String reference) + 4bytes(byte array) 
= 24
String "system.hdfs.sps": 40bytes(String Object) + 15bytes(chars) = 56bytes. 
Its creating new String objects every time ideally 56bytes count need not be 
counted every time. Still, I'm considering this.
byte[]: 4bytes

84 bytes = (aligned 88bytes * 1,000,000) = 83.923MB

If we keep SPS outside or inside Namenode, this much memory space will be 
occupied as xattribute is used to mark the pending item.

*(2) Namenode Q*
LinkedList entry = 24bytes
Long object = 12bytes(Object overhead) + 8bytes = aligned 24bytes
--
48bytes * 1000,000 = 45.78MB
--

45MB approax, which I feel is a smaller percentage and this may occur in the 
misconfgured scenario where many {{InodeIds}} queued up.

Default Q size value will be recommended as 10,000 = 48bytes * 10,000 = 
468.75KB. = 469KB.

Please feel free to correct me if I missed anything. Thanks!

bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the 
"scan and move tools" as an external feature to namenode. I am not able to see 
any convincing reason for breaking this pattern.
- {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from 
your previous comments, I'm glad that you agreed that CPU is not an issue. 
Hence scanning is not a concern. If we run SPS outside, it has to put 
additional RPC calls for the SPS work and again switching of SPS-ha service has 
to blindly scan the entire namespace to figure out the xattrs. Now, for 
handling the switching scenarios, we have to come up with some kind of unfair 
tweaking logic like, write xattr somewhere in the file and new active SPS 
service should read it from there and continue. With this, I feel to keep the 
scanning logic at NN. 
FYI, NN has existing feature EDEK which also does scanning and we reuses the 
same code in SPS.
Also, I'm re-iterating the point that, SPS does not scan the files its own, 
user has to call API to satisfy a particular file.

- {{Moving blocks}} - It is something assigning the responsibility to Datanode. 
Presently, Namenode has several logic which does block movement - 
ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added 
throttling mechanism for the sps block movements also, not to affect the 
existing data movements.

- AFAIK, DiskBalancer is completely run at the Datanode and it looks like 
Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, 
which doesn't need any input file paths and it does balancing HDFS cluster 
based on the utilization. Balancer can run independently as it doesn't take any 
input file path argument and user may not be waiting to finish the balancing 
work, whereas SPS is exposed to the user via HSM feature. HSM is completely 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-07 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281503#comment-16281503
 ] 

Rakesh R edited comment on HDFS-10285 at 12/7/17 8:58 AM:
--

Thanks a lot [~anu] for your time and comments.

bq. This is the most critical concern that I have. In one of the discussions 
with SPS developers, they pointed out to me that they want to make sure an SPS 
move happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily SPS will eat into 
more and more memory of Namenode
Increasing Namenode Q will not help to speedup the block movements. It is the 
Datanode who does actual block movements and need to tune Datanode bandwidth to 
speedup the block movements. Hence there is no sense in increasing Namenode Q. 
Infact, that will simply add up the pending tasks at the Namenode side.

Let me try putting the memory usage of Namenode Q:
Assume there are 1 million directories and users invoked 
{{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge 
data movement and it may not be a regular case. Again, assume without knowing 
the advantage of increasing the Q size if some unpleasant user set the Q size 
to a higher value 1,000,000. Each API call, will add an xattr to represent the 
pending movement and NN maintains list of pending dir InodeId to satisfy the 
policy, which is Long (IIUC, Long is 8 bytes of Object overhead, plus 8 bytes 
more for the actual long value). Each Xattr takes 15 chars 
{{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses 
{{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to 
{{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) 
size:

1,000,000 * (30bytes + 16bytes) = 1000,000 * 46 = 46,000,000bytes = 43.87MB = 
44MB approax, which I feel is a smaller percentage and this may occur in the 
misconfgured scenario where many InodeIds queued up.

bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the 
"scan and move tools" as an external feature to namenode. I am not able to see 
any convincing reason for breaking this pattern.
- {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from 
your previous comments, I'm glad that you agreed that CPU is not an issue. 
Hence scanning is not a concern. If we run SPS outside, it has to put 
additional RPC calls for the SPS work and again switching of SPS-ha service has 
to blindly scan the entire namespace to figure out the xattrs. Now, for 
handling the switching scenarios, we have to come up with some kind of unfair 
tweaking logic like, write xattr somewhere in the file and new active SPS 
service should read it from there and continue. With this, I feel to keep the 
scanning logic at NN. 
FYI, NN has existing feature EDEK which also does scanning and we reuses the 
same code in SPS.
Also, I'm re-iterating the point that, SPS does not scan the files its own, 
user has to call API to satisfy a particular file.

- {{Moving blocks}} - It is something assigning the responsibility to Datanode. 
Presently, Namenode has several logic which does block movement - 
ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added 
throttling mechanism for the sps block movements also, not to affect the 
existing data movements.

- AFAIK, DiskBalancer is completely run at the Datanode and it looks like 
Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, 
which doesn't need any input file paths and it does balancing HDFS cluster 
based on the utilization. Balancer can run independently as it doesn't take any 
input file path argument and user may not be waiting to finish the balancing 
work, whereas SPS is exposed to the user via HSM feature. HSM is completely 
binds to the Namenode, which today only allows users to set the storage policy 
and changing the state at NN and NN is taking no action to satisfy the policy. 
For HSM feature, starting another service may be an overhead in reality and HSM 
adoption may be less. My personal opinion, just because of the 
Balancer/DiskBalancer running outside it is not a good reason for keeping SPS 
outside.


was (Author: rakeshr):
Thanks a lot [~anu] for your time and comments.

bq. This is the most critical concern that I have. In one of the discussions 
with SPS developers, they pointed out to me that they want to make sure an SPS 
move happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily SPS will eat into 
more and more memory of Namenode
Increasing Namenode Q will not help to speedup the block movements. It is the 
Datanode 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-06 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280634#comment-16280634
 ] 

Anu Engineer edited comment on HDFS-10285 at 12/6/17 6:36 PM:
--

@[~andrew.wang] Thanks for the comments.
bq. Adding a new service requires adding support in management frameworks like 
Cloudera Manager or Ambari. This means support for deployment, configuration, 
monitoring, rolling upgrade, and log collection. 

I am not very familiar with these tools; I prefer to deploy my clusters without 
these tools. So help me here a bit, are you suggesting that we should decide if 
a feature should be inside Namenode or not, based on how inflexible these tools 
are? Why is it so hard for say, (I am just presuming that you will be more 
familiar with Cloudera Manager) Cloudera Manager to configure a new service? 
Isn't the sole purpose of these tools to do this kind of management actions? 

I am hopeful(again my understanding of these tools are minimal) that these 
tools already have all the requisite framework in place, and it is not as 
onerous as you describe to support a daemon that is running in the cluster.

IMHO, if we base the decisions on what feature should go into namenode based on 
the code modification complexity of these tools, I am worried that we are 
putting an unusually complex burden on Namenode.

I suggest that we should do what is the right thing for namenode based on the 
constraints of our layer and not bother about layers far above us.

@[~vinayrpet] Thank you for sharing your perspective. 

bq. Im coming at this from the standpoint of supporting Cloudera's Hadoop 
customers. 

Since I work for Hortonworks, I have a wealth of perspective on how customers 
tend to use these features. Most customers will start off with this tool as is, 
then they will discover that queue length is not adequate for the move to 
happen in a reasonable time, they will increase the queue length and then we  
will discover that Namenode is running out of memory. Next step, is that they 
will want us to run SPS based on various policies, like move  the blocks if the 
blocks are older than 3 hours, or if the load on Namenode is less than x, of 
the number of YARN containers in the cluster is less than X. 

Slowly but steadily, customers will want complex policies. 

Here is the kicker, if SPS is inside namenode each time some feature is added 
we are going to step into this huge argument whether we should have these 
complex features inside namenode. 

So experience from Hortonworks customers tells me that we should prioritize 
scale and future needs of this feature rather than ease of code change for 
management tools. 


bq. IIUC, Main concerns to keep SPS (or any other such feature) in NameNode are 
following.

I think you missed a critical argument, all scan and move functions of HDFS 
today is outside Namenode. I am proposing that we keep it that way. SPS is not 
unique in any way, and we have a well-known pattern that works. In my mind,  
management tools like Ambari should be able to address the ease of use part. 
For people like me who are willing to use the shell, this does not seem to be 
an additional burden.

bq. 1. Locking Load 
This same process can be done from outside namenode. Hence we are proposing 
that we move it outside.

bq. SPS should have client facing RPC server to learn about the paths to 
satisfy policy. This comes with lot of deployment overhead as already mentioned 
above by Andrew.
I  seriously question this assertion. From a shell perspective, we can check if 
this config value is set and start this daemon from start-dfs.sh. Why is this 
such a complicated task for Cloudera Manager or Ambari? I do not buy this 
argument. How can something that can be done in 5 lines of code in Hadoop, 
become a complex task that we would want to avoid that code path in Cloudera 
Manager? I am sorry, That makes no sense to me.

bq. if SPS doesnt have its own RPC server, then it needs to scan the targets by 
checking for xattr recursively from root( / ) directory

What prevents us from adding this? We should do what is technically required.

The problem I think you are missing is that the current SPS has no policy 
control of when it should run. But I posit that it is not too far off, that we 
will have to build various kinds of policies to control it. I am not suggesting 
that we need to do that before the merge. Being an independent service allows 
for this kind of flexibility.

bq. Memory

This is the most critical concern that I have. In one of the discussions with 
SPS developers, they pointed out to me that they want to make sure an SPS move 
happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily  SPS will eat into 
more and 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-05 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278844#comment-16278844
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 12/5/17 4:52 PM:
-

[~chris.douglas] do you have some opinion here, how to move forward ? 
Appreciate others thoughts as well.
# One is to keep SPS running in Namenode, as it is done now. This avoids any 
operational cost and additional process maintenance (against to keeping 
outside) . A Throttling would try to control the additional over burden on 
Namenode. SPS is kind of extension to HSM feature for more usability, as HSM is 
Namenode's feature, it make sense to keep in Namenode itself..
# Other thought is to run SPS out side as an independent process to avoid any 
burdens on Namenode due to SPS. COmparing to Balancer/DiskBalancers, SPS also 
moving blocks, so make sense to run as separate process. In other ide, this 
could increase RPC calls to Namenode for getting meta information of file while 
processing and for other co-ordinations. And extra process maintenance cost to 
this additional process for the deployments perspective.   
Many other points discussed above for more information.


was (Author: umamaheswararao):
[~chris.douglas] do you have some opinion here to move forward ? Appreciate 
others thoughts as well.
# One is to keep SPS running in Namenode, as it is done now. This avoids any 
operational cost and additional process maintenance (against to keeping 
outside) . A Throttling would try to control the additional over burden on 
Namenode. SPS is kind of extension to HSM feature for more usability, as HSM is 
Namenode's feature, it make sense to keep in Namenode itself..
# Other thought is to run SPS out side as an independent process to avoid any 
burdens on Namenode due to SPS. COmparing to Balancer/DiskBalancers, SPS also 
moving blocks, so make sense to run as separate process. In other ide, this 
could increase RPC calls to Namenode for getting meta information of file while 
processing and for other co-ordinations. And extra process maintenance cost to 
this additional process for the deployments perspective.   
Many other points discussed above for more information.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-12-01 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274977#comment-16274977
 ] 

Anu Engineer edited comment on HDFS-10285 at 12/1/17 9:08 PM:
--

bq. Is it trivial? I think we still need some type of fencing so there's only 
one active SPS. Does this use zookeeper, like NN HA? 
Yes, that would be the simplest approach to getting SPS HA.
bq. If there's an SPS failover, how does the new active know where to resume?
Once the active knows it is the leader, it can read the state from NN and 
continue. The issues of continuity are exactly same whether it is inside NN or 
outside.
bq.  I'm also wondering how progress is tracked, so we can resume without 
iterating over significant portions of the namespace.
As soon as a block is moved, the move call updates the status of the block 
move, that is NN is up to date with that info. Each time there is a call to SPS 
API, NN will keep track of it and the updates after move lets us filter the 
remaining blocks.
bq. I also like centralized control when it comes to coordinating block work. 
The NN schedules and prioritizes block work on the cluster. Already it's 
annoying to users to have configure a separate set of resource throttles for 
the balancer work, and it makes the system less reactive to cluster health 
events. We'd much rather have a single resource allocation for all cluster 
maintenance work, which the NN can use however it wants based on its priority.
By that argument, Balancer should be the first tool that move into the Namenode 
and then DiskBalancer. Right now, SPS approach follows what we are doing in 
HDFS world, that is block moves are achieved thru an async mechanism. If you 
would like to provide a generic block mover mechanism in Namenode and then port 
balancer and diskBalancer, you are most welcome. I will be glad to move SPS to 
that framework when we have it.

bq. What is the concern about NN overhead, for this feature in particular? This 
is similar to what I asked Uma earlier about the coordinator DN; I don't think 
it meaningfully shifts work off the NN
There are a couple of concerns:
 # Following an established pattern of Balancer, Mover, DiskBalancer etc. 
# Memory and CPU overhead in Namenode.
# Future Directions -- if we have to support more finer mechanisms like smart 
storage management, moving data into provided block etc. It is better for this 
to be run as an independent service.


And most important, we are just accelerating an SPS future work item, it has 
been a booked plan to make SPS separate, so we are just achieving that goal 
before the merge. Nothing fundamentally changes about SPS.


was (Author: anu):
bq. Is it trivial? I think we still need some type of fencing so there's only 
one active SPS. Does this use zookeeper, like NN HA? 
Yes, that would be the simplest approach to getting SPS HA.
bq. If there's an SPS failover, how does the new active know where to resume?
Once the active knows it is the leader, it can read the state from NN and 
continue. The issues of continuity are exactly same whether it is inside NN or 
outside.
bq.  I'm also wondering how progress is tracked, so we can resume without 
iterating over significant portions of the namespace.
As soon as a block is moved, the move call updates the status of the block 
move, that is NN is up to date with that info. Each time there is a call to SPS 
API, NN will keep track of it and the updates after move lets us filter the 
remaining blocks.
bq. I also like centralized control when it comes to coordinating block work. 
The NN schedules and prioritizes block work on the cluster. Already it's 
annoying to users to have configure a separate set of resource throttles for 
the balancer work, and it makes the system less reactive to cluster health 
events. We'd much rather have a single resource allocation for all cluster 
maintenance work, which the NN can use however it wants based on its priority.
By that argument, Balancer should be the first tool that move into the Namenode 
and then DiskBalancer. Right now, SPS approach follows what we are doing in 
HDFS world, that is block moves are achieved thru an async mechanism. If you 
would like to provide a generic block mover mechanism in Namenode and then port 
balancer and diskBalancer, you are most welcome. I will be glad to move SPS to 
that framework when we have it.

bq. What is the concern about NN overhead, for this feature in particular? This 
is similar to what I asked Uma earlier about the coordinator DN; I don't think 
it meaningfully shifts work off the NN
There are a couple of concerns:
 # Following an established pattern of Balancer, Mover, DiskBalancer etc. 
# Memory and CPU overhead in Namenode.
# Future Directions -- if we have to support more finer mechanisms like smart 
storage management, moving data into provided block etc. It is better for this 
to be run as 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-10-26 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220276#comment-16220276
 ] 

Rakesh R edited comment on HDFS-10285 at 10/26/17 10:35 AM:


Uploaded another version of SPS design document, tried capturing the details to 
reflect recent code changes. Welcome feedback!

Thank you [~umamaheswararao] for co-authoring the doc.
Thank you [~andrew.wang], [~surendrasingh], [~eddyxu], [~xiaochen], 
[~anoop.hbase], [~ram_krish]] for the useful discussions/comments.


was (Author: rakeshr):
Uploaded another version of SPS design document, tried capturing the details to 
reflect recent code changes. Welcome feedback!

Thank you [~umamaheswararao] for co-authoring the doc.
Thank you [~andrew.wang], [~surendrasingh], [~eddyxu], [~xiaochen], 
[~anoop.hbase] for the useful discussions/comments.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-08-16 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129525#comment-16129525
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 8/17/17 5:36 AM:
-

Hi [~andrew.wang] Thank you for helping us a lot in reviews. Really great 
points.
{quote}
This would be a user periodically asking for status. From what I know of async 
API design, callbacks are preferred over polling since it solves the question 
about how long the server needs to hold the status.
I'd be open to any proposal here, I just think the current "isSpsRunning" API 
is insufficient. Did you end up filing a ticket to track this?
{quote}
ASYNC API design perspective, I agree, systems would have callback register 
APIs . I think we don't have that call back mechanism design's in place HDFS. 
In this particular case, we don't actually process anything for user is 
waiting, this is just a trigger to system to start some inbuilt functionality. 
In fact isSpsRunning API was added just for users to make sure inbuilt SPS is 
not running if they want to run Mover tool explicitly. I filed a JIRA 
HDFS-12310 to discuss more. I really don't know its a good idea to encourage 
users to periodically poll on the system for this status. IMO, if movements are 
really failing(probably some storages are unavailable or some storages failed 
etc), there is definitely an administrator actions required instead of user 
component knowing the status and taking actions itself. So, strongly believe 
reporting failures as metrics will definitely get into admins attention on the 
system. Since we don't want to enable it as auto movement at first stage, there 
should be come trigger to start the movement. Some work happening related to 
async HDFS API at HDFS-9924, probably we could take some design thoughts from 
there once they are in for status API? 
Also another argument is that, We already have async fashioned APIs, example 
delete or setReplication. Even for NN call perspective they may be sync calls, 
but for user perspective, still lot of work happens asynchronously. If we 
delete file, it does NN cleanup and add blocks for deletions. All the blocks 
deletions happens asynchronously. User believe HDFS that data will be cleaned, 
we don't have status reporting API. 
if we change the replication, we change it in NN and eventually replication 
will be triggered, I don't think users will poll on replication is done or not. 
As Its HDFS functionality to replicate, he just rely on it. If replications are 
failing, then definitely admin actions required to fix them.  Usually admins 
depends on fsck or metrics. Lets discuss more on that JIRA HDFS-12310?
At the end I am not saying we should not have status reporting.I feel that's a 
good to have requirement.
Do you have some use cases on how the application system(ex: Hbase, 
[~anoopsamjohn] has provided some use cases above to use SPS) reacts on status 
results? 

{quote}
If I were to paraphrase, the NN is the ultimate arbiter, and the operations 
being performed by C-DNs are idempotent, so duplicate work gets dropped safely. 
I think this still makes it harder to reason about from a debugging POV, 
particularly if we want to extend this to something like EC conversion that 
might not be idempotent.
{quote}
Similar to C-DN way only we are doing reconstructions work in EC already. All 
block group blocks will be reconstructed at on DN. there also that node will be 
choses loosely. Here we just Named as C-DN and sending more blocks as logical 
batch(in this case all blocks associated to a file). In EC case, we are send a 
block group blocks. Coming to idempotent , even today we are just doing in 
idempotent way in EC-reconstruction. I feel we can definitely handle that 
cases, as conversion of while file should complete and then only we can convert 
contiguous blocks to stripe mode at NN. Whoever finish first that will be 
updated to NN. Once NN already converted the blocks, it should not accept newly 
converted block groups. But this should be anyway different discussion. I just 
wanted to pointed out another use case  HDFS-12090, I see that JIRA wants to 
adopt this model to move work.

{quote}
I like the idea of offloading work in the abstract, but I don't know how much 
work we really offload in this situation. The NN still needs to track 
everything at the file level, which is the same order of magnitude as the block 
level. The NN is still doing blockmanagement and processing IBRs for the block 
movement. Distributing tracking work to the C-DNs adds latency and makes the 
system more complicated.
{quote}
I don't see any extra latencies involved really. Anyway work has to be send to 
DNs individually. Along with that, we send batch to one DN first, that DN does 
its work as well as ask other DNs to transfer the blocks. Handling block level 
still keeps the requirement of tracking 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-08-16 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129525#comment-16129525
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 8/17/17 5:22 AM:
-

Hi [~andrew.wang] Thank you for helping us a lot in reviews. Really great 
points.
{quote}
This would be a user periodically asking for status. From what I know of async 
API design, callbacks are preferred over polling since it solves the question 
about how long the server needs to hold the status.
I'd be open to any proposal here, I just think the current "isSpsRunning" API 
is insufficient. Did you end up filing a ticket to track this?
{quote}
ASYNC API design perspective, I agree, systems would have callback register 
APIs . I think we don't have that call back mechanism design's in place HDFS. 
In this particular case, we don't actually process anything for user is 
waiting, this is just a trigger to system to start some inbuilt functionality. 
In fact isSpsRunning API was added just for users to make sure inbuilt SPS is 
not running if they want to run Mover tool explicitly. I filed a JIRA 
HDFS-12310 to discuss more. I really don't know its a good idea to encourage 
users to periodically poll on the system for this status. IMO, if movements are 
really failing(probably some storages are unavailable or some storages failed 
etc), there is definitely an administrator actions required instead of user 
component knowing the status and taking actions itself. So, strongly believe 
reporting failures as metrics will definitely get into admins attention on the 
system. Since we don't want to enable it as auto movement at first stage, there 
should be come trigger to start the movement. Some work happening related to 
async HDFS API at HDFS-9924, probably we could take some design thoughts from 
there once they are in for status API? 
Also another argument is that, We already have async fashioned APIs, example 
delete or setReplication. Even for NN call perspective they may be sync calls, 
but for user perspective, still lot of work happens asynchronously. If we 
delete file, it does NN cleanup and add blocks for deletions. All the blocks 
deletions happens asynchronously. User believe HDFS that data will be cleaned, 
we don't have status reporting API. 
if we change the replication, we change it in NN and eventually replication 
will be triggered, I don't think users will poll on replication is done or not. 
As Its HDFS functionality to replicate, he just rely on it. If replications are 
failing, then definitely admin actions required to fix them.  Usually admins 
depends on fsck or metrics. Lets discuss more on that JIRA HDFS-12310?
At the end I am not saying we should not have status reporting.I feel that's a 
good to have requirement.
Do you have some use cases on how the application system(ex: Hbase, 
[~anoopsamjohn] has provided some useless above to use SPS) reacts on status 
results? 

{quote}
If I were to paraphrase, the NN is the ultimate arbiter, and the operations 
being performed by C-DNs are idempotent, so duplicate work gets dropped safely. 
I think this still makes it harder to reason about from a debugging POV, 
particularly if we want to extend this to something like EC conversion that 
might not be idempotent.
{quote}
Similar to C-DN way only we are doing reconstructions work in EC already. All 
block group blocks will be reconstructed at on DN. there also that node will be 
choses loosely. Here we just Named as C-DN and sending more blocks as logical 
batch(in this case all blocks associated to a file). In EC case, we are send a 
block group blocks. Coming to idempotent , even today we are just doing in 
idempotent way in EC-reconstruction. I feel we can definitely handle that 
cases, as conversion of while file should complete and then only we can convert 
contiguous blocks to stripe mode at NN. Whoever finish first that will be 
updated to NN. Once NN already converted the blocks, it should not accept newly 
converted block groups. But this should be anyway different discussion. I just 
wanted to pointed out another use case  HDFS-12090, I see that JIRA wants to 
adopt this model to move work.

{quote}
I like the idea of offloading work in the abstract, but I don't know how much 
work we really offload in this situation. The NN still needs to track 
everything at the file level, which is the same order of magnitude as the block 
level. The NN is still doing blockmanagement and processing IBRs for the block 
movement. Distributing tracking work to the C-DNs adds latency and makes the 
system more complicated.
{quote}
I don't see any extra latencies involved really. Anyway work has to be send to 
DNs individually. Along with that, we send batch to one DN first, that DN does 
its work as well as ask other DNs to transfer the blocks. Handling block level 
still keeps the requirement of tracking 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-07-31 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106283#comment-16106283
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 7/31/17 5:46 PM:
-

[~andrew.wang] Thanks a lot Andrew for spending time on review and for very 
valuable comments.

Please find my replies to the comments/questions.
{quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for 
async APIs is status reporting. "-isSpsRunning" doesn't give much insight. 
How does a client track the progress of their request? How are errors 
propagated? A client like HBase can't read the NN log to find a stacktrace. 
Section 5.3 lists some possible errors for block movement on the DN. It might 
be helpful to think about NN-side errors too: out of quota, out of capacity, 
other BPP failure, slow/stuck SPS tasks, etc.
{quote}
Interesting question and we thought about this, but its pretty hard to 
communicate back to user about statuses. IMO, this async api is basically a 
facility to user to trigger HDFS to start satisfying the blocks as per the 
storage policy set. Example if we enable automatic movements in future, errors 
status will not be reported to users. Its HDFS responsibility to satisfy as 
possible as when policy changed.
One possible way for admins to notice the failures would be via metrics 
reporting. I am also thinking to provide option in fsck command to check the 
current pending/in-progress status. I understand, this kind of status tracking 
may be useful in the case of SSM kind of systems to act upon, say raising alarm 
alerts etc. But HBase kind of system may not take any action from its business 
logic even of movement statuses are failures. Right now, HDFS itself will keep 
retry until it satisfies. 
{quote}It might be helpful to think about NN-side errors too: out of quota, out 
of capacity, other BPP failure, slow/stuck SPS tasks, etc.{quote}
Sure, let me think on this if there are possible conditions. Ideally SPS does 
not deal with namespace change (except in adding Xattr for internal use 
purpose), but it does data movement to different volumes at DN. We will think 
to collect possible metrics from NN side as well specifically in ERROR 
conditions.

{quote}Rather than using the acronym (which a user might not know), maybe 
rename "-isSpsRunning" to "-isSatisfierRunning" ?{quote}
Make sense, We would change that.

{quote}How is leader election done for the C-DN? Is there some kind of lease 
system so an old C-DN aborts if it can't reach the NN? This prevents split 
brain.{quote}
Here we choose C-DN loosely. We just pick first source in the list. C-DN send 
back IN_PROGRESS pings for every 5mins( via heartbeat). If no IN_PROGRESS pings 
and timeout
dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will 
just choose another in C-DN and reschedule. Here Even if older C-DN comes back, 
on re-registration, 
we send dropSPSWork request to DNs, that will prevent 2 C-DNs running.

{quote}Any plans to trigger the satisfier automatically on events like rename 
or setStoragePolicy? When I explain HSM to users, they're often surprised that 
they need to trigger movement manually. Here, it's easier since it's 
Mover-as-a-service, but still manually triggered.
{quote}
Actually this is our long term plan. To simplify the solution, we target to 
implement first phase with manual triggering. Once the current code base 
performing well and stable enough, In the follow up work, we will work on this 
task to enable automatic triggering. To avoid missing requirements, I will add 
this task in followup JIRA. 


{quote}Docs say that right now the user has to trigger SPS tasks recursively 
for a directory. Why? I believe the Mover works recursively. xiaojian is doing 
some work on HDFS-10899 that involves an efficient recursive directory 
iteration, maybe can take some ideas from there.
{quote}
IIRC, Actually we intentionally restricted recursive operation. We wanted to be 
more careful on NN overheads. If some user accidentally calls on root 
directory, it may trigger lot of unnecessary overlap scans.
In Mover case, it is running outside, so all scan overheads will be outside of 
NN. So, here if user really requires on recursive policy satisfaction, then he 
can do recursively (this can’t happen accidentally).
I agree, allowing recursively will make user much more easier when they need 
recursive execution. Only constraint we thought was to make operation light 
weight as much as possible.  

Also I looked at HDFS-10899, If I understand correctly, Zones are already 
available and reencryption zone expects to be one of the existing zone. That 
makes life easier there. {code}
+   * Re-encrypts the given encryption zone path. If the given path is not the
+   * root of an encryption zone, an exception is thrown.
+   */
+  XAttr reencryptEncryptionZone(final 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-07-31 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106283#comment-16106283
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 7/31/17 5:40 PM:
-

[~andrew.wang] Thanks a lot Andrew for spending time on review and for very 
valuable comments.

Please find my replies to the comments/questions.
{quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for 
async APIs is status reporting. "-isSpsRunning" doesn't give much insight. 
How does a client track the progress of their request? How are errors 
propagated? A client like HBase can't read the NN log to find a stacktrace. 
Section 5.3 lists some possible errors for block movement on the DN. It might 
be helpful to think about NN-side errors too: out of quota, out of capacity, 
other BPP failure, slow/stuck SPS tasks, etc.
{quote}
Interesting question and we thought about this, but its pretty hard to 
communicate back to user about statuses. IMO, this async api is basically a 
facility to user to trigger HDFS to start satisfying the blocks as per the 
storage policy set. Example if we enable automatic movements in future, errors 
status will not be reported to users. Its HDFS responsibility to satisfy as 
possible as when policy changed.
One possible way for admins to notice the failures would be via metrics 
reporting. I am also thinking to provide option in fsck command to check the 
current pending/in-progress status. I understand, this kind of status tracking 
may be useful in the case of SSM kind of systems to act upon, say raising alarm 
alerts etc. But HBase kind of system may not take any action from its business 
logic even of movement statuses are failures. Right now, HDFS itself will keep 
retry until it satisfies. 
{quote}It might be helpful to think about NN-side errors too: out of quota, out 
of capacity, other BPP failure, slow/stuck SPS tasks, etc.{quote}
Sure, let me think on this if there are possible conditions. Ideally SPS does 
not deal with namespace change (except in adding Xattr for internal use 
purpose), but it does data movement to different volumes at DN. We will think 
to collect possible metrics from NN side as well specifically in ERROR 
conditions.

{quote}Rather than using the acronym (which a user might not know), maybe 
rename "-isSpsRunning" to "-isSatisfierRunning" ?{quote}
Make sense, We would change that.

{quote}How is leader election done for the C-DN? Is there some kind of lease 
system so an old C-DN aborts if it can't reach the NN? This prevents split 
brain.{quote}
Here we choose C-DN loosely. We just pick first source in the list. C-DN send 
back IN_PROGRESS pings for every 5mins( via heartbeat). If no IN_PROGRESS pings 
and timeout
dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will 
just choose another in C-DN and reschedule. Here Even if older C-DN comes back, 
on re-registration, 
we send dropSPSWork request to DNs, that will prevent 2 C-DNs running.

{quote}Any plans to trigger the satisfier automatically on events like rename 
or setStoragePolicy? When I explain HSM to users, they're often surprised that 
they need to trigger movement manually. Here, it's easier since it's 
Mover-as-a-service, but still manually triggered.
{quote}
Actually this is our long term plan. To simplify the solution, we target to 
implement first phase with manual triggering. Once the current code base 
performing well and stable enough, In the follow up work, we will work on this 
task to enable automatic triggering. To avoid missing requirements, I will add 
this task in followup JIRA. 


{quote}Docs say that right now the user has to trigger SPS tasks recursively 
for a directory. Why? I believe the Mover works recursively. xiaojian is doing 
some work on HDFS-10899 that involves an efficient recursive directory 
iteration, maybe can take some ideas from there.
{quote}
IIRC, Actually we intentionally restricted recursive operation. We wanted to be 
more careful on NN overheads. If some user accidentally calls on root 
directory, it may trigger lot of unnecessary overlap scans.
In Mover case, it is running outside, so all scan overheads will be outside of 
NN. So, here if user really requires on recursive policy satisfaction, then he 
can do recursively (this can’t happen accidentally).
I agree, allowing recursively will make user much more easier when they need 
recursive execution. Only constraint we thought was to make operation light 
weight as much as possible.  

Also I looked at HDFS-10899, If I understand correctly, Zones are already 
available and reencryption zone expects to be one of the existing zone. That 
makes life easier there. {code}
+   * Re-encrypts the given encryption zone path. If the given path is not the
+   * root of an encryption zone, an exception is thrown.
+   */
+  XAttr reencryptEncryptionZone(final 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-07-28 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104722#comment-16104722
 ] 

Rakesh R edited comment on HDFS-10285 at 7/28/17 9:44 AM:
--

Thanks a lot [~andrew.wang] for the reviews.

bq. Should dfs.storage.policy.satisfier.activate default to false for now?
During NN startup, SPS feature initializes satisfy thread and stays idle until 
user tells to satisfy storage policy for a given path. So, there won't be any 
overhead to the Namenode process as the thread is idle. Also, the system test 
report(attached to the jira) and Jenkins unit testing results shows the feature 
is stable. Since we have dynamic switch on/off mechanism via reconfig, I'm OK 
to disable by default if there is any concern, should I ?

bq. Might also rename this to "enabled" rather than "activate" to align with 
other previous config keys.
Agreed, I will raise a sub-task and change the configuration name asap.

bq. What happens during a rolling upgrade? Will DNs ignore the unknown message, 
and NN handle this correctly?
Yes, DN will ignore the unknown message. On the othe other side, NN will wait 
for certain configured amount of time for the block movement response. After 
this period, if there is no response NN will retry scheduling the block 
movement. So, no issues with rolling upgrade.

bq. On downgrade, I assume the xattrs just stay there ignored.
Yes, exactly it will be ignored.


was (Author: rakeshr):
Thanks a lot [~andrew.wang] for the reviews.

bq. Should dfs.storage.policy.satisfier.activate default to false for now?
During NN startup, SPS feature initializes satisfy thread and will stay idle 
until user tells to satisfy storage policy for a given path. So, there won't be 
much overhead to the Namenode process as the thread will be idle. Also, the 
system test report(attached to the jira) and Jenkins unit testing results shows 
the feature is stable. Since we have dynamic switch on/off mechanism via 
reconfig, I'm OK to disable by default if there is any concern, should I ?

bq. Might also rename this to "enabled" rather than "activate" to align with 
other previous config keys.
Agreed, I will raise a sub-task and change the configuration name asap.

bq. What happens during a rolling upgrade? Will DNs ignore the unknown message, 
and NN handle this correctly?
Yes, DN will ignore the unknown message. On the othe other side, NN will wait 
for certain configured amount of time for the block movement response. After 
this period, if there is no response NN will retry scheduling the block 
movement. So, no issues with rolling upgrade.

bq. On downgrade, I assume the xattrs just stay there ignored.
Yes, exactly it will be ignored.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. 

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2017-07-15 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087186#comment-16087186
 ] 

Rakesh R edited comment on HDFS-10285 at 7/16/17 5:09 AM:
--

Thank you all the contributors in making this feature. I have finished a pass 
rebasing all the changes made in HDFS-10285 sub-tasks. Uploading the 
consolidated patch to the umbrella jira to get the QA report.


was (Author: rakeshr):
Thank you all the contributors in making this feature. I have finished a pass 
rebasing all the changes made in HDFS-10285 sub-tasks. Uploading the 
consolidated patch to the umbrella jira to get the QA report.

Thanks [~umamaheswararao] for the offline discussions.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: HDFS-10285
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2016-08-02 Thread Yuanbo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403588#comment-15403588
 ] 

Yuanbo Liu edited comment on HDFS-10285 at 8/2/16 8:17 AM:
---

[~umamaheswararao] Great proposal, thanks for your work.
I have two questions about your design:
1.
{quote}
When user calls satisfyStoragePolicy(src) API
{quote}
Is this api only available for java program, or when user uses this command 
{code}
hdfs storagepolicies -setStoragePolicy -path  -policy 
{code}
this api is invoked by default ?
2.
what if inodes exsit in toBeSatisfiedStoragePolicyList meanwhile "mover tool" 
takes effect on the directory which contains those inodes?


was (Author: yuanbo):
[~umamaheswararao] Great proposal, thanks for your work.
I have two question about your design:
1.
{quote}
When user calls satisfyStoragePolicy(src) API
{quote}
Is this api only available for java program, or when user uses this command 
{code}
hdfs storagepolicies -setStoragePolicy -path  -policy 
{code}
this api is invoked by default ?
2.
what if inodes exsit in toBeSatisfiedStoragePolicyList meanwhile "mover tool" 
takes effect on the directory which contains those inodes?

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: 2.7.2
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

2016-05-12 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281164#comment-15281164
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 5/13/16 4:51 AM:
-

Attached the initial version of document. Please help in review and we can 
improve the document based on feedbacks.

Thanks [~rakeshr] for co-authoring on the design doc. Thanks [~anoopsamjohn], 
[~drankye], [~ram_krish],[~jingcheng...@intel.com] for helping on reviews.

Thanks,
Uma & Rakesh


was (Author: umamaheswararao):
Attached the initial version of document. Please help in review and we can 
improve the document based on feedbacks.

> Storage Policy Satisfier in Namenode
> 
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Affects Versions: 2.7.2
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org