[
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345572#comment-14345572
]
Todd Lipcon edited comment on HDFS-6658 at 3/3/15 7:26 PM:
-----------------------------------------------------------
{quote}
1 the balancer
2 datanode manager start/stop of decommission
3 decommission manager scans
4 removal of failed storages
5 removal of a dead nodes (ie. all its storages)
Balancing and decommissioning already have abysmal effects on performance. We
cannot afford for either to be any worse.
{quote}
Lemme play devil's advocate for a minute (it's always easier to solve these
problems in a JIRA comment compared to actually in the code, so I'll defer to
your judgment here, given I've been pretty inactive the last year or so):
*Balancer*: the main operation it needs to do is "give me some blocks from
storage X which total at least Y bytes", right? Given that, we don't need to
scan the entirety of the blocksmap. We can scan the blocksmap in hash-order
until we accumulate the right number of bytes. Even though we have to "skip
over" blocks from the wrong storage, it's still a linear scan, and scalable in
the total number of blocks. Plus, the balancer is typically removing blocks
from many DNs at the same time (the overloaded ones), and the scan work can
actually be combined pretty easily. I haven't thought about it quite enough,
but my intuition is that, in a typical balancer run, a substantial fraction of
the DNs are acting as "source" nodes, and thus the combined scan may actually
end up being equally efficient as several separate DN-specific scans.
*Decom*: while we don't want decom to affect other NN operations, it doesn't
seem like the actual latency of the decom work on the NN is particularly
important, right? So long as the work can be chunked up or done concurrently to
other work, it's OK if the "start decom" process takes several seconds to scan
the entire block map. Again, this work could be batched across all nodes that
need decommissioning.
Curious if you did any experiments with this, or just back-of-the-envelope math?
FWIW I definitely agree this work could be done separately - didn't mean to
derail your JIRA, just figured this would be an appropriate place to have the
discussion.
was (Author: tlipcon):
{quote}
1 the balancer
2 datanode manager start/stop of decommission
3 decommission manager scans
4 removal of failed storages
5 removal of a dead nodes (ie. all its storages)
Balancing and decommissioning already have abysmal effects on performance. We
cannot afford for either to be any worse.
{code}
Lemme play devil's advocate for a minute (it's always easier to solve these
problems in a JIRA comment compared to actually in the code, so I'll defer to
your judgment here, given I've been pretty inactive the last year or so):
*Balancer*: the main operation it needs to do is "give me some blocks from
storage X which total at least Y bytes", right? Given that, we don't need to
scan the entirety of the blocksmap. We can scan the blocksmap in hash-order
until we accumulate the right number of bytes. Even though we have to "skip
over" blocks from the wrong storage, it's still a linear scan, and scalable in
the total number of blocks. Plus, the balancer is typically removing blocks
from many DNs at the same time (the overloaded ones), and the scan work can
actually be combined pretty easily. I haven't thought about it quite enough,
but my intuition is that, in a typical balancer run, a substantial fraction of
the DNs are acting as "source" nodes, and thus the combined scan may actually
end up being equally efficient as several separate DN-specific scans.
*Decom*: while we don't want decom to affect other NN operations, it doesn't
seem like the actual latency of the decom work on the NN is particularly
important, right? So long as the work can be chunked up or done concurrently to
other work, it's OK if the "start decom" process takes several seconds to scan
the entire block map. Again, this work could be batched across all nodes that
need decommissioning.
Curious if you did any experiments with this, or just back-of-the-envelope math?
FWIW I definitely agree this work could be done separately - didn't mean to
derail your JIRA, just figured this would be an appropriate place to have the
discussion.
> Namenode memory optimization - Block replicas list
> ---------------------------------------------------
>
> Key: HDFS-6658
> URL: https://issues.apache.org/jira/browse/HDFS-6658
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Affects Versions: 2.4.1
> Reporter: Amir Langer
> Assignee: Daryn Sharp
> Attachments: BlockListOptimizationComparison.xlsx, BlocksMap
> redesign.pdf, HDFS-6658.patch, Namenode Memory Optimizations - Block replicas
> list.docx
>
>
> Part of the memory consumed by every BlockInfo object in the Namenode is a
> linked list of block references for every DatanodeStorageInfo (called
> "triplets").
> We propose to change the way we store the list in memory.
> Using primitive integer indexes instead of object references will reduce the
> memory needed for every block replica (when compressed oops is disabled) and
> in our new design the list overhead will be per DatanodeStorageInfo and not
> per block replica.
> see attached design doc. for details and evaluation results.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)