[
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347472#comment-16347472
]
Rakesh R edited comment on HDFS-10285 at 2/1/18 1:52 PM:
---------------------------------------------------------
Thank you very much [~daryn] for your time and useful comments/thoughts. My
reply follows, please take a look at it.
+Comment-1)+
{quote}BlockManager
Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.
+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the
SPS service to the block manager. Please move them out to prove it’s not
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create
{{StoragePolicySatisfyManager}} and keep all the related apis over there.
+Comment-3)+
{quote}BPServiceActor
Is it actually sending back the moved blocks? Aren’t IBRs sufficient?
BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
Again, not sure that a new DN command is necessary, and why does it
specifically report back successful moves instead of relying on IBRs? I would
actually expect the DN to be completely ignorant of a SPS move vs any other
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes.
If sps rely on this, then it would requires an *extra* check to know whether
this new block has occurred due to sps move or others, which will be quite
often considering more other ops compares to SPSBlockMove op. Currently, it is
sending back {{blksMovementsFinished}} list separately, each movement finished
block can be easily/quickly recognized by the Satisfier in NN side and updates
the tracking details. If you agree this *extra* check is not an issue then we
would be happy to implement the IBR approach. Secondly,
BlockStorageMovementCommand is added to carry the block vs src/target pairs
which is much needed for the move operation and return result back. On a second
thought, if we change the response flow via IBR, then I understand we can think
of using the block transfer flow. Following is my analysis on that,
- I could see, DNA_TRANSFER is used to send a copy of a block to another
datanode. In our case, we need to support {{local_move}} and needs to
incorporate this {{local_move}} to this block transfer logic, I hope that is
fine ?
- secondly, presently we used {{replaceBlock}} and this has additional
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this
hint and namenode will delete the over replicated block eventually on next
block report. Any thoughts?
{code:java}
// notify name node
final Replica r = blockReceiver.getReplica();
datanode.notifyNamenodeReceivedBlock(
block, delHint, r.getStorageUuid(), r.isOnTransientStorage());
LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
+ ", delHint=" + delHint);
{code}
- So, for the internal, will use the {{transfer block}} and external, will use
{{replace block}}, am I missing anything?
+Comment-4)+
{quote}DataNode
Why isn’t this just a block transfer? How is transferring between DNs any
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this
case over there. Thanks!
+Comment-5)+
{quote}DatanodeDescriptor
Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.
+Comment-6)+
{quote}DatanodeManager
I know it’s configurable, but realistically, when would you ever want to give
storage movement tasks equal footing with under-replication? Is there really a
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we
thought is, user configured SSDs and filled up quickly. In that case, there
could be situations that cleaning-up is considered as a high priority. If you
feel, this is not a real case then I'm OK to remove this config and SPS will
use only the remaining slots always.
+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport
is already a very bad method that should be avoided for anything but jmx – even
then it’s a concern. I eliminated calls to it years ago. All it takes is a
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of
time. Beyond that, the response is going to be pretty large and tagging all the
storage reports is not going to be cheap.
verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its
storageMap?
Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached
state of the world. Then it gets another datanode report to determine the
number of live nodes to decide if it should sleep before processing the next
path. The number of nodes from the prior cached view of the world should
suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part.
Actually, we depend on this api for the data node storage types and remaining
space details. I think, it requires two different mechanisms for internal and
external sps. For internal, how about sps can directly refer
{{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are
suggesting a cache mechanism. How about, get storageReport once and cache at
ExternalContext. This local cache can be refreshed periodically. Say, After
every 5mins (just an arbitrary number I put here, if you have some period in
mind please suggest), when getDatanodeStorageReport called, cache can be
treated as expired and fetch freshly. Within 5mins it can use from cache. Does
this make sense to you?
Another point we thought of is, right now for checking whether node has good
space, its going to NN. With above caching, we could just check the space
constraints from this cached report only. We don't got to NN for every DN space
checking. If you are fine with above caching implementation, I would also
incorporate this point.
+Comment-8)+
{quote}DFSUtil
DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier
}}. These aren’t generally useful methods so why are they in DFSUtil? Why
aren’t they in the only calling class StoragePolicySatisfier?
{quote}
[Rakesh's reply] Agreed, Will do the changes.
+Comment-9)+
{quote}FSDirXAttrOp
Not fond of the SPS queue updates being edge triggered by xattr changes. Food
for thought: why add more rpc methods if the client can just twiddle the xattr?
{quote}
[Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it
will update the xattr and add this inodeId to {{spsPathIds}} Queue. Again, on
restart NN, it will check the xattr key and if exists, then add the inodeId to
{{spsPathIds}} Queue. This in-memory Queue is maintained for quick look up and
is used for tracking(scheduling/completion/retry logics) purpose. With this
flow, we decided to place the {{spsPathIds}} entry addition along with the
xattr. Please feel free to correct me if I missed anything.
+Comment-10)+
{quote}FSNamesystem / NamenodeProtocolTranslatorPB
Most of the api changes appear unnecessary.
No need for the new getFileInfo when the existing getLocatedFileInfo appears to
do exactly what the new method does?
No need for the new getFilePath. It’s only used by
IntraSPSNameNodeContext#getFileInfo to get the path for an inode number,
followed by calling the new getFileInfo cited above.
IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that
any and all IOEs means FNF which probably isn’t the intention during rpc
exceptions.
Should be able to replace it all with
getLocatedFileInfo(“/.reserved/.inodes/XXX”, false) which avoids changing the
public apis.
{quote}
[Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we
have maintained list of {{inodeIds}} rather than the path objects. So, we need
to fetch the file details by passing the {{inodeId}} and is the reason for the
new getFilePath(inodeId). For the getFileInfo, I've used getLocatedFileInfo API
in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}} is keeping
Namesystem reference and getFileInfo api is not exists. Can sps keeps
FSNamesystem ref instead of its interface ref?
We will also do some more research on this part.
+Comment-11)+
{quote}HdfsServerConstants
The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually
contain the word “xattr”?
{quote}
[Rakesh's reply] Sure, will remove “xattr” word.
+Comment-12)+
{quote}NameNode
Super trivial but using the plural pronoun “we” in this exception message is
odd. Changing the value isn’t a joint activity.
For enabling or disabling storage policy satisfier, we must pass either
none/internal/external string value only
{quote}
[Rakesh's reply] oops, sorry for the mistake. Will change it.
+Comment-13)+
{quote}StoragePolicySatisfier
It appears to make back-to-back calls to hasLowRedundancyBlocks and
getFileInfo for every file. Haven’t fully groked the code, but if low
redundancy is not the common case, then it shouldn’t be called unless/until
needed. It looks like files that are under replicated are re-queued again?
{quote}
[Rakesh's reply] Yes, under replicated files are queued up again. For external,
I can think of another logic by iterating over all the blocks and checks the
block location array length. If it is < replication factor, I can mark this
file as under replicated. Does this make sense to you?
+Comment-14)+
{quote}Appears to be calling getStoragePolicy for every file. It’s not like the
byte values change all the time, why call and cache all of them via
FSN#getStoragePolicies?
{quote}
[Rakesh's reply] Below is the current logic we used. Sry, I failed to
understand the need of cache all the policies. Could you please elaborate a
bit. Thanks!
{code:java}
ExternalSPSContext.java
@Override
public BlockStoragePolicy getStoragePolicy(byte policyId) {
return createDefaultSuite.getPolicy(policyId);
}
IntraSPSNameNodeContext.java
public BlockStoragePolicy getStoragePolicy(byte policyID) {
return blockManager.getStoragePolicy(policyID);
}
{code}
+Comment-15)+
{quote}FSDirSatisfyStoragePolicyOp
satisfyStoragePolicy errors if the xattr is already present. Should this be a
no-op? A client re-requesting a storage policy correction probably shouldn't
fail.
unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls
addSPSPathId. To avoid race conditions or inconsistent state if the xattr
fails, should call addSPSPathId after xattrs are successfully updated.
inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the
inode isn't a file. Should do file check first to avoid unnecessary computation.
In general, not fond of unnecessarily guava. Instead of
newArrayListWithCapacity + add, standard Arrays.asList(item) is more succinct.
{quote}
[Rakesh'r reply] Agreed, will do the changes.
+Comment-16)+
{quote}FSDirStatAndListOp
Not sure why javadoc was changed to add needLocation. It's already present and
now doubled up.
{quote}
[Rakesh'r reply] Agreed, will correct it.
+Comment-17)+
{quote}// TODO: Needs to manage the number of concurrent moves per DataNode.
Shouldn't this be fixed now?
{quote}
[Rakesh'r reply] I think, we have throttling in place at the namenode side, so
the amount of work is less coming to the datanode. By default
{{dfs.namenode.replication.max-streams-hard-limit}} value is 2, with this only
delta move task will go to dn in one heartbeat interval period(3sec time
interval). We have skipped the case of configuring a very big value to the
hard-limit and is the reason for postponing this task to later.
+Comment-18)+
{quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently
doing 1000 moves, it's not in a good state, disk io is probably saturated, and
this will only make it much worse. 10 is probably more than sufficient.
{quote}
[Rakesh'r reply] Agreed, will make it to smaller value 10.
+Comment-19)+
{quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted,
WHY should the heartbeat thread process the operation? That's a terrible idea.
{quote}
[Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple.
Does this make sense to you?
{code:java}
ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60,
TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
new Daemon.DaemonFactory() {
private final AtomicInteger threadIndex = new AtomicInteger(0);
@Override
public Thread newThread(Runnable r) {
Thread t = super.newThread(r);
t.setName("BlockMoverTask-" + threadIndex.getAndIncrement());
return t;
}
});
{code}
+Comment-20)+
{quote}I was under the impression the DN did an optimized movement between its
storages? Does it actually make a connection to itself? If yes, when the DN
xceiver pool is very busy, the mover thread pool will exhaust and cause the
heartbeat thread to block.
{quote}
[Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and
will do the changes. FYI, initially we have implemented C-DN to track all block
moves, but later we had refactored the code and we somehow missed the local
movement logic.
+Comment-21)+
{quote}BlockStorageMovementTracker
Many data structures are riddled with non-threadsafe race conditions and risk
of CMEs.
Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list
of futures is synchronized. However the run loop does an unsynchronized block
get, unsynchronized future remove, unsynchronized isEmpty, possibly another
unsynchronized get, only then does it do a synchronized remove of the block.
The whole chunk of code should be synchronized.
Is the problematic moverTaskFutures even needed? It's aggregating futures
per-block for seemingly no reason. Why track all the futures at all instead of
just relying on the completion service? As best I can tell:
It's only used to determine if a future from the completion service should be
ignored during shutdown. Shutdown sets the running boolean to false and clears
the entire datastructure so why not use the running boolean like a check just a
little further down?
As synchronization to sleep up to 2 seconds before performing a blocking
moverCompletionService.take, but only when it thinks there are no active
futures. I'll ignore the missed notify race that the bounded wait masks, but
the real question is why not just do the blocking take?
Why all the complexity? Am I missing something?
BlocksMovementsStatusHandler
Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex.
blockIdVsMovementStatus is inconsistent synchronized. Does synchronize to
return an unmodifiable list which sadly does nothing to protect the caller from
CME.
handle is iterating over a non-thread safe list.
{quote}
[Rakesh'r reply] Agreed, will correct it and address the concerns.
+Comment-22)+
{quote}StoragePolicySatisifier
Should handleException use a double-checked lock to avoid synchronization?
Unexpected exceptions should be a rarity, right?
Speaking of which, it’s not safe to ignore all Throwable in the run loop! You
have no idea if data structures are in a sane or consistent state.
{quote}
[Rakesh'r reply] Agreed, will do the changes.
+Comment-23)+
{quote}FSTreeTraverser
{quote}
This sps branch doesn't implementing any new logic for tree travesal, it simply
uses the slice tree traversal logic implemented via HDFS-10899 jira. We have
tried an attempt to make the existing code more generalize so that HDFS-10285,
HDFS-12090 and EDEK features can make use of this. I'd appreciate if you could
add your comments/thoughts into improvement task HDFS-13095 to correct the
existing code in trunk, that would really help us to avoid maintaining those
changes in this branch. During the discussion/comments in that jira, if some
bug specific to the sps comes up then we will surely fix it to improve the sps
logic.
{quote}traverseDir may NPE if it's traversing a tree in a snapshot and one of
the ancestors is deleted.
{quote}
Good catch, as per the analysis/discussion in HDFS-13095 this is a special case
only to sps and we will fix it in sps.
was (Author: rakeshr):
Thank you very much [~daryn] for your time and useful comments/thoughts. My
reply follows, please take a look at it.
+Comment-1)+
{quote}BlockManager
Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.
+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the
SPS service to the block manager. Please move them out to prove it’s not
tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create
{{StoragePolicySatisfyManager}} and keep all the related apis over there.
+Comment-3)+
{quote}BPServiceActor
Is it actually sending back the moved blocks? Aren’t IBRs sufficient?
BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
Again, not sure that a new DN command is necessary, and why does it
specifically report back successful moves instead of relying on IBRs? I would
actually expect the DN to be completely ignorant of a SPS move vs any other
move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes.
If sps rely on this, then it would requires an *extra* check to know whether
this new block has occurred due to sps move or others, which will be quite
often considering more other ops compares to SPSBlockMove op. Currently, it is
sending back {{blksMovementsFinished}} list separately, each movement finished
block can be easily/quickly recognized by the Satisfier in NN side and updates
the tracking details. If you agree this *extra* check is not an issue then we
would be happy to implement the IBR approach. Secondly,
BlockStorageMovementCommand is added to carry the block vs src/target pairs
which is much needed for the move operation and return result back. On a second
thought, if we change the response flow via IBR, then I understand we can think
of using the block transfer flow. Following is my analysis on that,
- I could see, DNA_TRANSFER is used to send a copy of a block to another
datanode. In our case, we need to support {{local_move}} and needs to
incorporate this {{local_move}} to this block transfer logic, I hope that is
fine ?
- secondly, presently we used {{replaceBlock}} and this has additional
{{delHint}} notify mechanism to NN. IIUC {{transfer block}} doesn't have this
hint and namenode will eventually delete the over replicated block eventually.
Any thoughts?
{code}
// notify name node
final Replica r = blockReceiver.getReplica();
datanode.notifyNamenodeReceivedBlock(
block, delHint, r.getStorageUuid(), r.isOnTransientStorage());
LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
+ ", delHint=" + delHint);
{code}
- For the internal, will use the {{transfer block}} and external, will use
{{replace block}}, am I missing anything?
+Comment-4)+
{quote}DataNode
Why isn’t this just a block transfer? How is transferring between DNs any
different than across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this
case over there. Thanks!
+Comment-5)+
{quote}DatanodeDescriptor
Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.
+Comment-6)+
{quote}DatanodeManager
I know it’s configurable, but realistically, when would you ever want to give
storage movement tasks equal footing with under-replication? Is there really a
use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we
thought is, user configured SSDs and filled up quickly. In that case, there
could be situations that cleaning-up is considered as a high priority. If you
feel, this is not a real case then I'm OK to remove this config and SPS will
use only the remaining slots always.
+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport
is already a very bad method that should be avoided for anything but jmx – even
then it’s a concern. I eliminated calls to it years ago. All it takes is a
nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of
time. Beyond that, the response is going to be pretty large and tagging all the
storage reports is not going to be cheap.
verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem
lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its
storageMap?
Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned
earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached
state of the world. Then it gets another datanode report to determine the
number of live nodes to decide if it should sleep before processing the next
path. The number of nodes from the prior cached view of the world should
suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part.
Actually, we depend on this api for the data node storage types and remaining
space details. I think, it requires two different mechanisms for internal and
external sps. For internal, how about sps can directly refer
{{DatanodeManager#datanodeMap}} for every file. For the external, IIUC you are
suggesting a cache mechanism. How about, get storageReport once and cache at
ExternalContext. This local cache can be refreshed periodically. Say, After
every 5mins (just an arbitrary number I put here, if you have some period in
mind please suggest), when getDatanodeStorageReport called, cache can be
treated as expired and fetch freshly. Within 5mins it can use from cache. Does
this make sense to you?
Another point we thought of is, right now for checking whether node has good
space, its going to NN. With above caching, we could just check the space
constraints from this cached report only. We don't got to NN for every DN space
checking. If you are fine with above caching implementation, I would also
incorporate this point.
+Comment-8)+
{quote}DFSUtil
DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier
}}. These aren’t generally useful methods so why are they in DFSUtil? Why
aren’t they in the only calling class StoragePolicySatisfier?
{quote}
[Rakesh's reply] Agreed, Will do the changes.
+Comment-9)+
{quote}FSDirXAttrOp
Not fond of the SPS queue updates being edge triggered by xattr changes. Food
for thought: why add more rpc methods if the client can just twiddle the xattr?
{quote}
[Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it
will update the xattr and add this inodeId to {{spsPathIds}} Queue. Again, on
restart NN, it will check the xattr key and if exists, then add the inodeId to
{{spsPathIds}} Queue. This in-memory Queue is maintained for quick look up and
is used for tracking(scheduling/completion/retry logics) purpose. With this
flow, we decided to place the {{spsPathIds}} entry addition along with the
xattr. Please feel free to correct me if I missed anything.
+Comment-10)+
{quote}FSNamesystem / NamenodeProtocolTranslatorPB
Most of the api changes appear unnecessary.
No need for the new getFileInfo when the existing getLocatedFileInfo appears to
do exactly what the new method does?
No need for the new getFilePath. It’s only used by
IntraSPSNameNodeContext#getFileInfo to get the path for an inode number,
followed by calling the new getFileInfo cited above.
IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that
any and all IOEs means FNF which probably isn’t the intention during rpc
exceptions.
Should be able to replace it all with
getLocatedFileInfo(“/.reserved/.inodes/XXX”, false) which avoids changing the
public apis.
{quote}
[Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we
have maintained list of {{inodeIds}} rather than the path objects. So, we need
to fetch the file details by passing the {{inodeId}} and is the reason for the
new getFilePath(inodeId). For the getFileInfo, I've used getLocatedFileInfo API
in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}} is keeping
Namesystem reference and getFileInfo api is not exists. Can sps keeps
FSNamesystem ref instead of its interface ref?
We will also do some more research on this part.
+Comment-11)+
{quote}HdfsServerConstants
The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually
contain the word “xattr”?
{quote}
[Rakesh's reply] Sure, will remove “xattr” word.
+Comment-12)+
{quote}NameNode
Super trivial but using the plural pronoun “we” in this exception message is
odd. Changing the value isn’t a joint activity.
For enabling or disabling storage policy satisfier, we must pass either
none/internal/external string value only
{quote}
[Rakesh's reply] oops, sorry for the mistake. Will change it.
+Comment-13)+
{quote}StoragePolicySatisfier
It appears to make back-to-back calls to hasLowRedundancyBlocks and
getFileInfo for every file. Haven’t fully groked the code, but if low
redundancy is not the common case, then it shouldn’t be called unless/until
needed. It looks like files that are under replicated are re-queued again?
{quote}
[Rakesh's reply] Yes, under replicated files are queued up again. For external,
I can think of another logic by iterating over all the blocks and checks the
block location array length. If it is < replication factor, I can mark this
file as under replicated. Does this make sense to you?
+Comment-14)+
{quote}Appears to be calling getStoragePolicy for every file. It’s not like the
byte values change all the time, why call and cache all of them via
FSN#getStoragePolicies?
{quote}
[Rakesh's reply] Below is the current logic we used. Sry, I failed to
understand the need of cache all the policies. Could you please elaborate a
bit. Thanks!
{code:java}
ExternalSPSContext.java
@Override
public BlockStoragePolicy getStoragePolicy(byte policyId) {
return createDefaultSuite.getPolicy(policyId);
}
IntraSPSNameNodeContext.java
public BlockStoragePolicy getStoragePolicy(byte policyID) {
return blockManager.getStoragePolicy(policyID);
}
{code}
+Comment-15)+
{quote}FSDirSatisfyStoragePolicyOp
satisfyStoragePolicy errors if the xattr is already present. Should this be a
no-op? A client re-requesting a storage policy correction probably shouldn't
fail.
unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls
addSPSPathId. To avoid race conditions or inconsistent state if the xattr
fails, should call addSPSPathId after xattrs are successfully updated.
inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the
inode isn't a file. Should do file check first to avoid unnecessary computation.
In general, not fond of unnecessarily guava. Instead of
newArrayListWithCapacity + add, standard Arrays.asList(item) is more succinct.
{quote}
[Rakesh'r reply] Agreed, will do the changes.
+Comment-16)+
{quote}FSDirStatAndListOp
Not sure why javadoc was changed to add needLocation. It's already present and
now doubled up.
{quote}
[Rakesh'r reply] Agreed, will correct it.
+Comment-17)+
{quote}// TODO: Needs to manage the number of concurrent moves per DataNode.
Shouldn't this be fixed now?
{quote}
[Rakesh'r reply] I think, we have throttling in place at the namenode side, so
the amount of work is less coming to the datanode. By default
{{dfs.namenode.replication.max-streams-hard-limit}} value is 2, with this only
delta move task will go to dn in one heartbeat interval period(3sec time
interval). We have skipped the case of configuring a very big value to the
hard-limit and is the reason for postponing this task to later.
+Comment-18)+
{quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently
doing 1000 moves, it's not in a good state, disk io is probably saturated, and
this will only make it much worse. 10 is probably more than sufficient.
{quote}
[Rakesh'r reply] Agreed, will make it to smaller value 10.
+Comment-19)+
{quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted,
WHY should the heartbeat thread process the operation? That's a terrible idea.
{quote}
[Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple.
Does this make sense to you?
{code:java}
ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60,
TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
new Daemon.DaemonFactory() {
private final AtomicInteger threadIndex = new AtomicInteger(0);
@Override
public Thread newThread(Runnable r) {
Thread t = super.newThread(r);
t.setName("BlockMoverTask-" + threadIndex.getAndIncrement());
return t;
}
});
{code}
+Comment-20)+
{quote}I was under the impression the DN did an optimized movement between its
storages? Does it actually make a connection to itself? If yes, when the DN
xceiver pool is very busy, the mover thread pool will exhaust and cause the
heartbeat thread to block.
{quote}
[Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and
will do the changes. FYI, initially we have implemented C-DN to track all block
moves, but later we had refactored the code and we somehow missed the local
movement logic.
+Comment-21)+
{quote}BlockStorageMovementTracker
Many data structures are riddled with non-threadsafe race conditions and risk
of CMEs.
Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list
of futures is synchronized. However the run loop does an unsynchronized block
get, unsynchronized future remove, unsynchronized isEmpty, possibly another
unsynchronized get, only then does it do a synchronized remove of the block.
The whole chunk of code should be synchronized.
Is the problematic moverTaskFutures even needed? It's aggregating futures
per-block for seemingly no reason. Why track all the futures at all instead of
just relying on the completion service? As best I can tell:
It's only used to determine if a future from the completion service should be
ignored during shutdown. Shutdown sets the running boolean to false and clears
the entire datastructure so why not use the running boolean like a check just a
little further down?
As synchronization to sleep up to 2 seconds before performing a blocking
moverCompletionService.take, but only when it thinks there are no active
futures. I'll ignore the missed notify race that the bounded wait masks, but
the real question is why not just do the blocking take?
Why all the complexity? Am I missing something?
BlocksMovementsStatusHandler
Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex.
blockIdVsMovementStatus is inconsistent synchronized. Does synchronize to
return an unmodifiable list which sadly does nothing to protect the caller from
CME.
handle is iterating over a non-thread safe list.
{quote}
[Rakesh'r reply] Agreed, will correct it and address the concerns.
+Comment-22)+
{quote}StoragePolicySatisifier
Should handleException use a double-checked lock to avoid synchronization?
Unexpected exceptions should be a rarity, right?
Speaking of which, it’s not safe to ignore all Throwable in the run loop! You
have no idea if data structures are in a sane or consistent state.
{quote}
[Rakesh'r reply] Agreed, will do the changes.
+Comment-23)+
{quote}FSTreeTraverser
{quote}
This sps branch doesn't implementing any new logic for tree travesal, it simply
uses the slice tree traversal logic implemented via HDFS-10899 jira. We have
tried an attempt to make the existing code more generalize so that HDFS-10285,
HDFS-12090 and EDEK features can make use of this. I'd appreciate if you could
add your comments/thoughts into improvement task HDFS-13095 to correct the
existing code in trunk, that would really help us to avoid maintaining those
changes in this branch. During the discussion/comments in that jira, if some
bug specific to the sps comes up then we will surely fix it to improve the sps
logic.
{quote}
traverseDir may NPE if it's traversing a tree in a snapshot and one of the
ancestors is deleted.
{quote}
Good catch, as per the analysis/discussion in HDFS-13095 this is a special case
only to sps and we will fix it in sps.
> Storage Policy Satisfier in Namenode
> ------------------------------------
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Uma Maheswara Rao G
> Priority: Major
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch,
> HDFS-10285-consolidated-merge-patch-01.patch,
> HDFS-10285-consolidated-merge-patch-02.patch,
> HDFS-10285-consolidated-merge-patch-03.patch,
> HDFS-10285-consolidated-merge-patch-04.patch,
> HDFS-10285-consolidated-merge-patch-05.patch,
> HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf,
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
> Storage-Policy-Satisfier-in-HDFS-May10.pdf,
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These
> policies can be set on directory/file to specify the user preference, where
> to store the physical block. When user set the storage policy before writing
> data, then the blocks could take advantage of storage policy preferences and
> stores physical block accordingly.
> If user set the storage policy after writing and completing the file, then
> the blocks would have been written with default storage policy (nothing but
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such
> file names as a list. In some distributed system scenarios (ex: HBase) it
> would be difficult to collect all the files and run the tool as different
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage
> policy file (inherited policy from parent directory) to another storage
> policy effected directory, it will not copy inherited storage policy from
> source. So it will take effect from destination file/dir parent storage
> policy. This rename operation is just a metadata change in Namenode. The
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for
> admins from distributed nodes(ex: region servers) and running the Mover tool.
> Here the proposal is to provide an API from Namenode itself for trigger the
> storage policy satisfaction. A Daemon thread inside Namenode should track
> such calls and process to DN as movement commands.
> Will post the detailed design thoughts document soon.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]