[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

Rakesh R (JIRA) Thu, 01 Feb 2018 05:53:37 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347472#comment-16347472
 ]


Rakesh R edited comment on HDFS-10285 at 2/1/18 1:52 PM:
---------------------------------------------------------

Thank you very much [~daryn] for your time and useful comments/thoughts. My 
reply follows, please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the 
SPS service to the block manager. Please move them out to prove it’s not 
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create 
{{StoragePolicySatisfyManager}} and keep all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it 
specifically report back successful moves instead of relying on IBRs? I would 
actually expect the DN to be completely ignorant of a SPS move vs any other 
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. 
If sps rely on this, then it would requires an *extra* check to know whether 
this new block has occurred due to sps move or others, which will be quite 
often considering more other ops compares to SPSBlockMove op. Currently, it is 
sending back {{blksMovementsFinished}} list separately, each movement finished 
block can be easily/quickly recognized by the Satisfier in NN side and updates 
the tracking details. If you agree this *extra* check is not an issue then we 
would be happy to implement the IBR approach. Secondly, 
BlockStorageMovementCommand is added to carry the block vs src/target pairs 
which is much needed for the move operation and return result back. On a second 
thought, if we change the response flow via IBR, then I understand we can think 
of using the block transfer flow. Following is my analysis on that,
 - I could see, DNA_TRANSFER is used to send a copy of a block to another 
datanode. In our case, we need to support {{local_move}} and needs to 
incorporate this {{local_move}} to this block transfer logic, I hope that is 
fine ?
 - secondly, presently we used {{replaceBlock}} and this has additional 
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this 
hint and namenode will delete the over replicated block eventually on next 
block report. Any thoughts?
{code:java}
        // notify name node
        final Replica r = blockReceiver.getReplica();
        datanode.notifyNamenodeReceivedBlock(
            block, delHint, r.getStorageUuid(), r.isOnTransientStorage());
        
        LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
            + ", delHint=" + delHint);
{code}

 - So, for the internal, will use the {{transfer block}} and external, will use 
{{replace block}}, am I missing anything?

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any 
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this 
case over there. Thanks!

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give 
storage movement tasks equal footing with under-replication? Is there really a 
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we 
thought is, user configured SSDs and filled up quickly. In that case, there 
could be situations that cleaning-up is considered as a high priority. If you 
feel, this is not a real case then I'm OK to remove this config and SPS will 
use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport 
is already a very bad method that should be avoided for anything but jmx – even 
then it’s a concern. I eliminated calls to it years ago. All it takes is a 
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of 
time. Beyond that, the response is going to be pretty large and tagging all the 
storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem 
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its 
storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned 
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached 
state of the world. Then it gets another datanode report to determine the 
number of live nodes to decide if it should sleep before processing the next 
path. The number of nodes from the prior cached view of the world should 
suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. 
Actually, we depend on this api for the data node storage types and remaining 
space details. I think, it requires two different mechanisms for internal and 
external sps. For internal, how about sps can directly refer 
{{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are 
suggesting a cache mechanism. How about, get storageReport once and cache at 
ExternalContext. This local cache can be refreshed periodically. Say, After 
every 5mins (just an arbitrary number I put here, if you have some period in 
mind please suggest), when getDatanodeStorageReport called, cache can be 
treated as expired and fetch freshly. Within 5mins it can use from cache. Does 
this make sense to you?

Another point we thought of is, right now for checking whether node has good 
space, its going to NN. With above caching, we could just check the space 
constraints from this cached report only. We don't got to NN for every DN space 
checking. If you are fine with above caching implementation, I would also 
incorporate this point.

+Comment-8)+
{quote}DFSUtil
 DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier
 }}. These aren’t generally useful methods so why are they in DFSUtil? Why 
aren’t they in the only calling class StoragePolicySatisfier?
{quote}
[Rakesh's reply] Agreed, Will do the changes.

+Comment-9)+
{quote}FSDirXAttrOp
 Not fond of the SPS queue updates being edge triggered by xattr changes. Food 
for thought: why add more rpc methods if the client can just twiddle the xattr?
{quote}
[Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it 
will update the xattr and add this inodeId to {{spsPathIds}} Queue. Again, on 
restart NN, it will check the xattr key and if exists, then add the inodeId to 
{{spsPathIds}} Queue. This in-memory Queue is maintained for quick look up and 
is used for tracking(scheduling/completion/retry logics) purpose. With this 
flow, we decided to place the {{spsPathIds}} entry addition along with the 
xattr. Please feel free to correct me if I missed anything.

+Comment-10)+
{quote}FSNamesystem / NamenodeProtocolTranslatorPB
 Most of the api changes appear unnecessary.

No need for the new getFileInfo when the existing getLocatedFileInfo appears to 
do exactly what the new method does?

No need for the new getFilePath. It’s only used by 
IntraSPSNameNodeContext#getFileInfo to get the path for an inode number, 
followed by calling the new getFileInfo cited above.

IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that 
any and all IOEs means FNF which probably isn’t the intention during rpc 
exceptions.

Should be able to replace it all with 
getLocatedFileInfo(“/.reserved/.inodes/XXX”, false) which avoids changing the 
public apis.
{quote}
[Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we 
have maintained list of {{inodeIds}} rather than the path objects. So, we need 
to fetch the file details by passing the {{inodeId}} and is the reason for the 
new getFilePath(inodeId). For the getFileInfo, I've used getLocatedFileInfo API 
in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}} is keeping 
Namesystem reference and getFileInfo api is not exists. Can sps keeps 
FSNamesystem ref instead of its interface ref?
 We will also do some more research on this part.

+Comment-11)+
{quote}HdfsServerConstants
 The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually 
contain the word “xattr”?
{quote}
[Rakesh's reply] Sure, will remove “xattr” word.

+Comment-12)+
{quote}NameNode
 Super trivial but using the plural pronoun “we” in this exception message is 
odd. Changing the value isn’t a joint activity.

For enabling or disabling storage policy satisfier, we must pass either 
none/internal/external string value only
{quote}
[Rakesh's reply] oops, sorry for the mistake. Will change it.

+Comment-13)+
{quote}StoragePolicySatisfier
 It appears to make back-to-back calls to hasLowRedundancyBlocks and 
getFileInfo for every file. Haven’t fully groked the code, but if low 
redundancy is not the common case, then it shouldn’t be called unless/until 
needed. It looks like files that are under replicated are re-queued again?
{quote}
[Rakesh's reply] Yes, under replicated files are queued up again. For external, 
I can think of another logic by iterating over all the blocks and checks the 
block location array length. If it is < replication factor, I can mark this 
file as under replicated. Does this make sense to you?

+Comment-14)+
{quote}Appears to be calling getStoragePolicy for every file. It’s not like the 
byte values change all the time, why call and cache all of them via 
FSN#getStoragePolicies?
{quote}
[Rakesh's reply] Below is the current logic we used. Sry, I failed to 
understand the need of cache all the policies. Could you please elaborate a 
bit. Thanks!
{code:java}
ExternalSPSContext.java
  @Override
  public BlockStoragePolicy getStoragePolicy(byte policyId) {
    return createDefaultSuite.getPolicy(policyId);
  }

IntraSPSNameNodeContext.java
    public BlockStoragePolicy getStoragePolicy(byte policyID) {
    return blockManager.getStoragePolicy(policyID);
  }
{code}
+Comment-15)+
{quote}FSDirSatisfyStoragePolicyOp
 satisfyStoragePolicy errors if the xattr is already present. Should this be a 
no-op? A client re-requesting a storage policy correction probably shouldn't 
fail.

unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls 
addSPSPathId. To avoid race conditions or inconsistent state if the xattr 
fails, should call addSPSPathId after xattrs are successfully updated.

inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the 
inode isn't a file. Should do file check first to avoid unnecessary computation.

In general, not fond of unnecessarily guava. Instead of 
newArrayListWithCapacity + add, standard Arrays.asList(item) is more succinct.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-16)+
{quote}FSDirStatAndListOp
 Not sure why javadoc was changed to add needLocation. It's already present and 
now doubled up.
{quote}
[Rakesh'r reply] Agreed, will correct it.

+Comment-17)+
{quote}// TODO: Needs to manage the number of concurrent moves per DataNode.

Shouldn't this be fixed now?
{quote}
[Rakesh'r reply] I think, we have throttling in place at the namenode side, so 
the amount of work is less coming to the datanode. By default 
{{dfs.namenode.replication.max-streams-hard-limit}} value is 2, with this only 
delta move task will go to dn in one heartbeat interval period(3sec time 
interval). We have skipped the case of configuring a very big value to the 
hard-limit and is the reason for postponing this task to later.

+Comment-18)+
{quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently 
doing 1000 moves, it's not in a good state, disk io is probably saturated, and 
this will only make it much worse. 10 is probably more than sufficient.
{quote}
[Rakesh'r reply] Agreed, will make it to smaller value 10.

+Comment-19)+
{quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted, 
WHY should the heartbeat thread process the operation? That's a terrible idea.
{quote}
[Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple. 
Does this make sense to you?
{code:java}
    ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60,
        TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
        new Daemon.DaemonFactory() {
          private final AtomicInteger threadIndex = new AtomicInteger(0);
          @Override
          public Thread newThread(Runnable r) {
            Thread t = super.newThread(r);
            t.setName("BlockMoverTask-" + threadIndex.getAndIncrement());
            return t;
          }
        });
{code}
+Comment-20)+
{quote}I was under the impression the DN did an optimized movement between its 
storages? Does it actually make a connection to itself? If yes, when the DN 
xceiver pool is very busy, the mover thread pool will exhaust and cause the 
heartbeat thread to block.
{quote}
[Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and 
will do the changes. FYI, initially we have implemented C-DN to track all block 
moves, but later we had refactored the code and we somehow missed the local 
movement logic.

+Comment-21)+
{quote}BlockStorageMovementTracker
 Many data structures are riddled with non-threadsafe race conditions and risk 
of CMEs.

Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list 
of futures is synchronized. However the run loop does an unsynchronized block 
get, unsynchronized future remove, unsynchronized isEmpty, possibly another 
unsynchronized get, only then does it do a synchronized remove of the block. 
The whole chunk of code should be synchronized.

Is the problematic moverTaskFutures even needed? It's aggregating futures 
per-block for seemingly no reason. Why track all the futures at all instead of 
just relying on the completion service? As best I can tell:

It's only used to determine if a future from the completion service should be 
ignored during shutdown. Shutdown sets the running boolean to false and clears 
the entire datastructure so why not use the running boolean like a check just a 
little further down?
 As synchronization to sleep up to 2 seconds before performing a blocking 
moverCompletionService.take, but only when it thinks there are no active 
futures. I'll ignore the missed notify race that the bounded wait masks, but 
the real question is why not just do the blocking take?
 Why all the complexity? Am I missing something?

BlocksMovementsStatusHandler
 Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex. 
blockIdVsMovementStatus is inconsistent synchronized. Does synchronize to 
return an unmodifiable list which sadly does nothing to protect the caller from 
CME.

handle is iterating over a non-thread safe list.
{quote}
[Rakesh'r reply] Agreed, will correct it and address the concerns.

+Comment-22)+
{quote}StoragePolicySatisifier
 Should handleException use a double-checked lock to avoid synchronization? 
Unexpected exceptions should be a rarity, right?
 Speaking of which, it’s not safe to ignore all Throwable in the run loop! You 
have no idea if data structures are in a sane or consistent state.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-23)+
{quote}FSTreeTraverser
{quote}
This sps branch doesn't implementing any new logic for tree travesal, it simply 
uses the slice tree traversal logic implemented via HDFS-10899 jira. We have 
tried an attempt to make the existing code more generalize so that HDFS-10285, 
HDFS-12090 and EDEK features can make use of this. I'd appreciate if you could 
add your comments/thoughts into improvement task HDFS-13095 to correct the 
existing code in trunk, that would really help us to avoid maintaining those 
changes in this branch. During the discussion/comments in that jira, if some 
bug specific to the sps comes up then we will surely fix it to improve the sps 
logic.
{quote}traverseDir may NPE if it's traversing a tree in a snapshot and one of 
the ancestors is deleted.
{quote}
Good catch, as per the analysis/discussion in HDFS-13095 this is a special case 
only to sps and we will fix it in sps.


was (Author: rakeshr):
Thank you very much [~daryn] for your time and useful comments/thoughts. My 
reply follows, please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the 
SPS service to the block manager. Please move them out to prove it’s not 
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create 
{{StoragePolicySatisfyManager}} and keep all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it 
specifically report back successful moves instead of relying on IBRs? I would 
actually expect the DN to be completely ignorant of a SPS move vs any other 
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. 
If sps rely on this, then it would requires an *extra* check to know whether 
this new block has occurred due to sps move or others, which will be quite 
often considering more other ops compares to SPSBlockMove op. Currently, it is 
sending back {{blksMovementsFinished}} list separately, each movement finished 
block can be easily/quickly recognized by the Satisfier in NN side and updates 
the tracking details. If you agree this *extra* check is not an issue then we 
would be happy to implement the IBR approach. Secondly, 
BlockStorageMovementCommand is added to carry the block vs src/target pairs 
which is much needed for the move operation and return result back. On a second 
thought, if we change the response flow via IBR, then I understand we can think 
of using the block transfer flow. Following is my analysis on that,
- I could see, DNA_TRANSFER is used to send a copy of a block to another 
datanode. In our case, we need to support {{local_move}} and needs to 
incorporate this {{local_move}} to this block transfer logic, I hope that is 
fine ?
- secondly, presently we used {{replaceBlock}} and this has additional 
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this 
hint and namenode will eventually delete the over replicated block eventually. 
Any thoughts?
{code}
        // notify name node
        final Replica r = blockReceiver.getReplica();
        datanode.notifyNamenodeReceivedBlock(
            block, delHint, r.getStorageUuid(), r.isOnTransientStorage());
        
        LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
            + ", delHint=" + delHint);
{code}
- For the internal, will use the {{transfer block}} and external, will use 
{{replace block}}, am I missing anything?

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any 
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this 
case over there. Thanks!

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give 
storage movement tasks equal footing with under-replication? Is there really a 
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we 
thought is, user configured SSDs and filled up quickly. In that case, there 
could be situations that cleaning-up is considered as a high priority. If you 
feel, this is not a real case then I'm OK to remove this config and SPS will 
use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport 
is already a very bad method that should be avoided for anything but jmx – even 
then it’s a concern. I eliminated calls to it years ago. All it takes is a 
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of 
time. Beyond that, the response is going to be pretty large and tagging all the 
storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem 
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its 
storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned 
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached 
state of the world. Then it gets another datanode report to determine the 
number of live nodes to decide if it should sleep before processing the next 
path. The number of nodes from the prior cached view of the world should 
suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. 
Actually, we depend on this api for the data node storage types and remaining 
space details. I think, it requires two different mechanisms for internal and 
external sps. For internal, how about sps can directly refer 
{{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are 
suggesting a cache mechanism. How about, get storageReport once and cache at 
ExternalContext. This local cache can be refreshed periodically. Say, After 
every 5mins (just an arbitrary number I put here, if you have some period in 
mind please suggest), when getDatanodeStorageReport called, cache can be 
treated as expired and fetch freshly. Within 5mins it can use from cache. Does 
this make sense to you?

Another point we thought of is, right now for checking whether node has good 
space, its going to NN. With above caching, we could just check the space 
constraints from this cached report only. We don't got to NN for every DN space 
checking. If you are fine with above caching implementation, I would also 
incorporate this point.

+Comment-8)+
{quote}DFSUtil
 DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier
 }}. These aren’t generally useful methods so why are they in DFSUtil? Why 
aren’t they in the only calling class StoragePolicySatisfier?
{quote}
[Rakesh's reply] Agreed, Will do the changes.

+Comment-9)+
{quote}FSDirXAttrOp
 Not fond of the SPS queue updates being edge triggered by xattr changes. Food 
for thought: why add more rpc methods if the client can just twiddle the xattr?
{quote}
[Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it 
will update the xattr and add this inodeId to {{spsPathIds}} Queue. Again, on 
restart NN, it will check the xattr key and if exists, then add the inodeId to 
{{spsPathIds}} Queue. This in-memory Queue is maintained for quick look up and 
is used for tracking(scheduling/completion/retry logics) purpose. With this 
flow, we decided to place the {{spsPathIds}} entry addition along with the 
xattr. Please feel free to correct me if I missed anything.

+Comment-10)+
{quote}FSNamesystem / NamenodeProtocolTranslatorPB
 Most of the api changes appear unnecessary.

No need for the new getFileInfo when the existing getLocatedFileInfo appears to 
do exactly what the new method does?

No need for the new getFilePath. It’s only used by 
IntraSPSNameNodeContext#getFileInfo to get the path for an inode number, 
followed by calling the new getFileInfo cited above.

IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that 
any and all IOEs means FNF which probably isn’t the intention during rpc 
exceptions.

Should be able to replace it all with 
getLocatedFileInfo(“/.reserved/.inodes/XXX”, false) which avoids changing the 
public apis.
{quote}
[Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we 
have maintained list of {{inodeIds}} rather than the path objects. So, we need 
to fetch the file details by passing the {{inodeId}} and is the reason for the 
new getFilePath(inodeId). For the getFileInfo, I've used getLocatedFileInfo API 
in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}} is keeping 
Namesystem reference and getFileInfo api is not exists. Can sps keeps 
FSNamesystem ref instead of its interface ref?
 We will also do some more research on this part.

+Comment-11)+
{quote}HdfsServerConstants
 The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually 
contain the word “xattr”?
{quote}
[Rakesh's reply] Sure, will remove “xattr” word.

+Comment-12)+
{quote}NameNode
 Super trivial but using the plural pronoun “we” in this exception message is 
odd. Changing the value isn’t a joint activity.

For enabling or disabling storage policy satisfier, we must pass either 
none/internal/external string value only
{quote}
[Rakesh's reply] oops, sorry for the mistake. Will change it.

+Comment-13)+
{quote}StoragePolicySatisfier
 It appears to make back-to-back calls to hasLowRedundancyBlocks and 
getFileInfo for every file. Haven’t fully groked the code, but if low 
redundancy is not the common case, then it shouldn’t be called unless/until 
needed. It looks like files that are under replicated are re-queued again?
{quote}
[Rakesh's reply] Yes, under replicated files are queued up again. For external, 
I can think of another logic by iterating over all the blocks and checks the 
block location array length. If it is < replication factor, I can mark this 
file as under replicated. Does this make sense to you?

+Comment-14)+
{quote}Appears to be calling getStoragePolicy for every file. It’s not like the 
byte values change all the time, why call and cache all of them via 
FSN#getStoragePolicies?
{quote}
[Rakesh's reply] Below is the current logic we used. Sry, I failed to 
understand the need of cache all the policies. Could you please elaborate a 
bit. Thanks!
{code:java}
ExternalSPSContext.java
  @Override
  public BlockStoragePolicy getStoragePolicy(byte policyId) {
    return createDefaultSuite.getPolicy(policyId);
  }

IntraSPSNameNodeContext.java
    public BlockStoragePolicy getStoragePolicy(byte policyID) {
    return blockManager.getStoragePolicy(policyID);
  }
{code}
+Comment-15)+
{quote}FSDirSatisfyStoragePolicyOp
 satisfyStoragePolicy errors if the xattr is already present. Should this be a 
no-op? A client re-requesting a storage policy correction probably shouldn't 
fail.

unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls 
addSPSPathId. To avoid race conditions or inconsistent state if the xattr 
fails, should call addSPSPathId after xattrs are successfully updated.

inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the 
inode isn't a file. Should do file check first to avoid unnecessary computation.

In general, not fond of unnecessarily guava. Instead of 
newArrayListWithCapacity + add, standard Arrays.asList(item) is more succinct.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-16)+
{quote}FSDirStatAndListOp
 Not sure why javadoc was changed to add needLocation. It's already present and 
now doubled up.
{quote}
[Rakesh'r reply] Agreed, will correct it.

+Comment-17)+
{quote}// TODO: Needs to manage the number of concurrent moves per DataNode.

Shouldn't this be fixed now?
{quote}
[Rakesh'r reply] I think, we have throttling in place at the namenode side, so 
the amount of work is less coming to the datanode. By default 
{{dfs.namenode.replication.max-streams-hard-limit}} value is 2, with this only 
delta move task will go to dn in one heartbeat interval period(3sec time 
interval). We have skipped the case of configuring a very big value to the 
hard-limit and is the reason for postponing this task to later.

+Comment-18)+
{quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently 
doing 1000 moves, it's not in a good state, disk io is probably saturated, and 
this will only make it much worse. 10 is probably more than sufficient.
{quote}
[Rakesh'r reply] Agreed, will make it to smaller value 10.

+Comment-19)+
{quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted, 
WHY should the heartbeat thread process the operation? That's a terrible idea.
{quote}
[Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple. 
Does this make sense to you?
{code:java}
    ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60,
        TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
        new Daemon.DaemonFactory() {
          private final AtomicInteger threadIndex = new AtomicInteger(0);
          @Override
          public Thread newThread(Runnable r) {
            Thread t = super.newThread(r);
            t.setName("BlockMoverTask-" + threadIndex.getAndIncrement());
            return t;
          }
        });
{code}
+Comment-20)+
{quote}I was under the impression the DN did an optimized movement between its 
storages? Does it actually make a connection to itself? If yes, when the DN 
xceiver pool is very busy, the mover thread pool will exhaust and cause the 
heartbeat thread to block.
{quote}
[Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and 
will do the changes. FYI, initially we have implemented C-DN to track all block 
moves, but later we had refactored the code and we somehow missed the local 
movement logic.

+Comment-21)+
{quote}BlockStorageMovementTracker
 Many data structures are riddled with non-threadsafe race conditions and risk 
of CMEs.

Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list 
of futures is synchronized. However the run loop does an unsynchronized block 
get, unsynchronized future remove, unsynchronized isEmpty, possibly another 
unsynchronized get, only then does it do a synchronized remove of the block. 
The whole chunk of code should be synchronized.

Is the problematic moverTaskFutures even needed? It's aggregating futures 
per-block for seemingly no reason. Why track all the futures at all instead of 
just relying on the completion service? As best I can tell:

It's only used to determine if a future from the completion service should be 
ignored during shutdown. Shutdown sets the running boolean to false and clears 
the entire datastructure so why not use the running boolean like a check just a 
little further down?
 As synchronization to sleep up to 2 seconds before performing a blocking 
moverCompletionService.take, but only when it thinks there are no active 
futures. I'll ignore the missed notify race that the bounded wait masks, but 
the real question is why not just do the blocking take?
 Why all the complexity? Am I missing something?

BlocksMovementsStatusHandler
 Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex. 
blockIdVsMovementStatus is inconsistent synchronized. Does synchronize to 
return an unmodifiable list which sadly does nothing to protect the caller from 
CME.

handle is iterating over a non-thread safe list.
{quote}
[Rakesh'r reply] Agreed, will correct it and address the concerns.

+Comment-22)+
{quote}StoragePolicySatisifier
 Should handleException use a double-checked lock to avoid synchronization? 
Unexpected exceptions should be a rarity, right?
 Speaking of which, it’s not safe to ignore all Throwable in the run loop! You 
have no idea if data structures are in a sane or consistent state.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-23)+
{quote}FSTreeTraverser
{quote}
This sps branch doesn't implementing any new logic for tree travesal, it simply 
uses the slice tree traversal logic implemented via HDFS-10899 jira. We have 
tried an attempt to make the existing code more generalize so that HDFS-10285, 
HDFS-12090 and EDEK features can make use of this. I'd appreciate if you could 
add your comments/thoughts into improvement task HDFS-13095 to correct the 
existing code in trunk, that would really help us to avoid maintaining those 
changes in this branch. During the discussion/comments in that jira, if some 
bug specific to the sps comes up then we will surely fix it to improve the sps 
logic.
{quote}
traverseDir may NPE if it's traversing a tree in a snapshot and one of the 
ancestors is deleted.
{quote}
Good catch, as per the analysis/discussion in HDFS-13095 this is a special case 
only to sps and we will fix it in sps.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>            Priority: Major
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-10285-consolidated-merge-patch-04.patch, 
> HDFS-10285-consolidated-merge-patch-05.patch, 
> HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode

Reply via email to