[
https://issues.apache.org/jira/browse/HDFS-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106575#comment-14106575
]
Arpit Agarwal commented on HDFS-6581:
-------------------------------------
bq. So I would say, tmpfs is always worse for us. Swapping is just not
something we ever want, and memory limits are something we enforce ourselves,
so tmpfs's features don't help us.
We're in agreement on the relative merits of ramfs vs tmpfs, except that I am
assuming performance-sensitive deployments will run with swap disabled,
negating the disadvantages of tmpfs. However this is a decision that can be
left to the administrator and does not impact the feature design.
[~andrew.wang], responses to your questions below.
{quote}
Related to Colin's point about configuring separate pools of memory on the DN,
I'd really like to see integration with the cache pools from HDFS-4949. Memory
is ideally shareable between HDFS and YARN, and cache pools were designed with
that in mind. Simple storage quotas do not fit as well.
Quotas are also a very rigid policy and can result in under-utilization. Cache
pools are more flexible, and can be extended to support fair share and more
complex policies. Avoiding underutilization seems especially important for a
limited resource like memory.
{quote}
For now, existing diskspace quota checks will apply on block allocation. We
cannot skip this check since the blocks are expected to be written to disk in
quick order. I agree uniting the RAM disk size and
{{dfs.datanode.max.locked.memory}} configurations is desirable. Since tmpfs
grows dynamically perhaps one approach is for the DN to limit \[RAM disk +
locked memory\] usage to the config value. The recommendation to administrators
could be that they set the RAM disk size to the same value as
{{dfs.datanode.max.locked.memory}}. This also allows preferential eviction from
either cache or tmpfs as desired to keep the total locked memory usage within
the limit. I'll need to think through this but I will file a sub-task meanwhile.
bq. Do you have any benchmarks? For the read side, we found checksum overhead
to be substantial, essentially the cost of a copy. If we use tmpfs, it can
swap, so we're forced to calculate checksums at both write and read time. My
guess is also that a normal 1-replication write will be fairly fast because of
the OS buffer cache, so it'd be nice to quantify the potential improvement.
tmpfs has become somewhat of a diversion. Let's assume the administrator
configures either ramfs or tmpfs with swap disabled (our implementation doesn't
care) so we don't have extra checksum generation beyond what we do today. I
would _really_ like to remove even the existing checksum calculation off the
write path for replicas that are being written to memory and have DN compute
checksums when it 'lazy persists' to disk. I spent way more looking into this
than I wanted to and it is hard to do cleanly with the way the write pipeline
is setup today - I can explain the details if you are curious. I am wary of
significant changes to the write pipeline here but this is the first
optimization I want to address after the initial implementation.
bq. There's a mention of LAZY_PERSIST having a config option to unlink corrupt
TMP files. It seems better for this to be per-file rather than NN-wide, since
different clients might want different behavior.
That's a good idea, perhaps via an additional flag per-file. Can we leave the
system-wide option for the initial implementation and change it going forward?
bq. 5.2.2 lists a con of mmaped files as not having control over page
writeback. Is this actually true when using mlock? Also not sure why memory
pressure is worse with mmaped files compared to tmpfs. mmap might make
eviction+SCR nicer too, since you can just drop the mlocks if you want to
evict, and the client has a hope of falling back gracefully.
Memory pressure is worse with mmaped files because we cannot control the timing
of when the pages will be freed. We can evict pages from memory via unmap
faster than the memory manager can write them to disk. tmpfs has better
characteristics, once we run into the configured limit we can just stop
allocating more blocks in memory. A related optimization I'd really like to
have is to use unbuffered IO when writing to block files on disk so we don't
churn buffer cache.
{quote}
Caveat, I'm not sure what the HSM APIs will look like, or how this will be
integrated, so some of these might be out of scope.
Will we support changing a file from DISK storage type to TMP storage type? I
would say no, since cache directives seem better for read caching when
something is already on disk.
Will we support writing a file on both TMP and another storage type? Similar to
the above, it also doesn't feel that useful.
{quote}
We are not setting the storage type on a file. HSM API work (HDFS-5682) has
been getting pushed out, most recently in favor memory storage but I'd like to
revisit it post 2.6. For now there is no dependence on HSM APIs and no concept
of storage type on a file. CCM remains the preferred approach for reads so no
change there.
Thank you for reading the doc and providing feedback.
> Write to single replica in memory
> ---------------------------------
>
> Key: HDFS-6581
> URL: https://issues.apache.org/jira/browse/HDFS-6581
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Arpit Agarwal
> Assignee: Arpit Agarwal
> Attachments: HDFSWriteableReplicasInMemory.pdf
>
>
> Per discussion with the community on HDFS-5851, we will implement writing to
> a single replica in DN memory via DataTransferProtocol.
> This avoids some of the issues with short-circuit writes, which we can
> revisit at a later time.
--
This message was sent by Atlassian JIRA
(v6.2#6252)