[
https://issues.apache.org/jira/browse/HDFS-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177280#comment-14177280
]
Colin Patrick McCabe commented on HDFS-7244:
--------------------------------------------
It's exciting to see progress on this, [~langera]!
There are a few questions we need to figure out here. One is fallback... if
{{ByteBuffer#allocDirect}} is not available on the JVM, what do we do? In my
earlier patch, I simply used {{ByteBuffer#alloc}}. I still like this approach,
but it does mean we can't chase raw pointers when implementing off-heap data
structures. I was trying to address this by using \{ 32-bit slab ID, 32-bit
slab offset \} tuples instead. This does require that we do a lookup in a
{{map<Long, Slab>}} whenever we chase a "pointer", though.
Another approach to fallback is to use raw pointers if they're available, and
\{ slabID, offset\} tuples if they're not. This is faster for the common case
of true off-heaping. The complication here is that theoretically one
{{allocDirect}} calls could fail while another succeeds. If we did this, we'd
probably want to create a configuration key like {{hadoop.use.off.heap}}, and
throw a hard failure whenever this was {{true}} but {{allocDirect}} failed.
What data structures are you planning on using to look up block data in the NN?
I was considering an off-heap hash map implementation.
If you look at the requirements for our BlocksMap, we need:
* fast lookup of \{ 64-bit blockID, string bpId \} to yield all DNs where this
block is replicated
* ability to iterate over all blocks which a DN holds
#1 is not too difficult, but #2 could be tricky. The obvious solution is just
to have a hash map from \{ blockID, bpID \} to a node structure which is a
member of a few implicit linked lists. This does mean the node structure has
variable size, which could be challenging to implement (It's basically the
{{malloc}} problem). There isn't any upper limit on the number of DNs a block
can be on.
A better way might be to have a hash map from \{ blockID, bpID, replicaIndex \}
so that we avoid implicit linked lists. So to find the first replica for
BlockID 123 in bpID "foo", you look up (123, foo, 0)... the second, (123, foo,
1), and so forth.
This also raises a few questions.
* should we create a lookup table for bpids? We clearly don't want to store
the string everywhere, and we can't use Java string interning when doing
off-heap. A 16-bit or 32-bit lookup table from string bpid -> bpid index would
certainly slim this down.
* similar for DNs... how do we identify them? The storage ID is too long to be
practical. The simplest way would be a 64-bit ID where we didn't reuse any
indices. If we have 32-bit or less DN IDs we'll have to figure out some
garbage collection strategy, which could be tricky.
Do you think we'll need a branch for this? I don't have a feeling yet for how
incremental it is. Clearly adding the Slab code can be done in trunk without
destabilizing anything else. I'm not as clear on how difficult the other
subtasks are going to be to do in an "incremental" way.
Do you have some code using the Slab code yet? It might be hard to know
exactly what API we want for Slab until we see how it works in action. Of
course we can always modify it later, but posting a combined patch would give
me a better feel for it.
> Reduce Namenode memory using Flyweight pattern
> ----------------------------------------------
>
> Key: HDFS-7244
> URL: https://issues.apache.org/jira/browse/HDFS-7244
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: Amir Langer
>
> Using the flyweight pattern can dramatically reduce memory usage in the
> Namenode. The pattern also abstracts the actual storage type and allows the
> decision of whether it is off-heap or not and what is the serialisation
> mechanism to be configured per deployment.
> The idea is to move all BlockInfo data (as a first step) to this storage
> using the Flyweight pattern. The cost to doing it will be in higher latency
> when accessing/modifying a block. The idea is that this will be offset with a
> reduction in memory and in the case of off-heap, a dramatic reduction in
> memory (effectively, memory used for BlockInfo would reduce to a very small
> constant value).
> This reduction will also have an huge impact on the latency as GC pauses will
> be reduced considerably and may even end up with better latency results than
> the original code.
> I wrote a stand-alone project as a proof of concept, to show the pattern, the
> data structure we can use and what will be the performance costs of this
> approach.
> see [Slab|https://github.com/langera/slab]
> and [Slab performance
> results|https://github.com/langera/slab/wiki/Performance-Results].
> Slab abstracts the storage, gives several storage implementations and
> implements the flyweight pattern for the application (Namenode in our case).
> The stages to incorporate Slab into the Namenode is outlined in the sub-tasks
> JIRAs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)