[
https://issues.apache.org/jira/browse/HDFS-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355701#comment-14355701
]
Colin Patrick McCabe commented on HDFS-7846:
--------------------------------------------
bq. Yi asked: For original BlockInfo, it includes triplets which let us easily
get blocks belonging to some DatanodeStorage. In this patch, seems not. We will
use a different approach in other JIRAs? Maybe I need to wait for HDFS-7848.
That's a good question. I've been looking into whether we actually need to
support iterating over all the blocks on a given datanode. I think that this
requirement can be removed for the full block report. We can simply increment
the epoch each time we get an FBR, and ignore blocks which belong to old
epochs. In the background, we can have a scanner thread that collects the
garbage (shrinks block infos or removes block infos that are unreferenced).
This frees up some memory which we would otherwise need to use on the implicit
linked lists. It also makes insertion and deletion easier and slightly faster.
The places where we still need some kind of full iteration are decommissioning
and running the balancer. However, just scanning the hash tables should be
good enough in these use-cases. They are rare operations... they don't happen
that often. Also, they require moving blocks around the cluster, which takes
much more time than iterating over an in-memory hash table, even when there are
a lot of non-relevant slots. We rate-limit decommissioning anyway, and the
limit is low enough that the extra overhead of full table iteration is not
important.
bq. 1. In BlockMap#Builder#build, duplicate check for this.mman != null
Fixed, thanks
bq. 2. Please add InterfaceAudience and InterfaceStability to public class
ok
bq. 3. unneccessary import in new added class files
fixed
bq. Another comment is about block replicas: the replication number of file can
be changed, then we will re-allocate off-heap memory for the blocks of file? A
better way is to allocate separate off-heap memory for the replicas of block,
then we just need to change that part?
This data structure is optimized for minimizing memory consumption, rather than
minimizing copying of memory when things change. I think that's correct
tradeoff to make, since the replication factor of a file doesn't change that
often. We can also allocate space based on the desired number of replicas,
rather than the actual number. So we don't need to re-allocate when one
replica is lost... we only re-allocate when the replication factor itself is
changed.
To make this more concrete, keep in mind that with separate allocations, we'd
have something like this:
fixed size block info -> replica info R1 -> replica info R2 -> replica info R3
Even with a singly (not doubly) linked list, that's still an extra 8 bytes in
the fixed size block info for the pointer to R1, and an extra 8 bytes in each R
for the pointer to the next. It also will tend to fragment the heap a lot more
because we'll be making a lot of small allocations (R1, R2, R3 will be around
24 bytes)
In contrast, having a variable sized block info saves at least 32 bytes because
we have no pointers. It has better cache locality because we're not hopping
all over memory to read the replica list. And copying is only needed when the
replication factor is changed by {{setReplication}} or similar. There is less
fragmentation because we make bigger and fewer calls to {{malloc}}.
bq. charles wrote: Yi mentioned unused imports, but there are also unnecessary
java.lang. {String,ClassCastException}
thanks, removed
bq. BlockId.equals: constructing a ClassCastException, and especially the
resulting call to fillInStackTrace, is an expensive way of checking the type. I
would think instanceof is preferred.
This code path is never expected to actually be taken, though. You shouldn't
be comparing BlockId objects with non-BlockId objects. If Java 1.0 had been
designed with generics, equals would probably be {{equals(T t) where T extends
MyType}}. Anyway I used the Java interface because creating my own version of
equals seemed excessive.
bq. Are you planning on doing something with Shard.name in the future?
it will be used for logging in the future
bq. The indentation of the assignment to htable is off a bit.
it seems ok?
bq. Jenkins will ask you this question, but why no unit tests?
Fair question. I'm trying to write this as incrementally as I can and it
seemed like unit tests could be done in a follow-on. The really large amount
of work will be integrating this into the main NN code
bq. Oh, I forgot to mention there are three places where git apply flags the
patch for adding trailing whitespace.
ok
> Create off-heap BlocksMap and BlockData structures
> --------------------------------------------------
>
> Key: HDFS-7846
> URL: https://issues.apache.org/jira/browse/HDFS-7846
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7836
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-7846-scl.001.patch
>
>
> Create off-heap BlocksMap, BlockInfo, and DataNodeInfo structures. The
> BlocksMap will use the off-heap hash table.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)