[jira] [Commented] (HDFS-6581) Write to single replica in memory

Arpit Agarwal (JIRA) Fri, 22 Aug 2014 00:10:29 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106575#comment-14106575
 ]


Arpit Agarwal commented on HDFS-6581:
-------------------------------------

bq. So I would say, tmpfs is always worse for us. Swapping is just not 
something we ever want, and memory limits are something we enforce ourselves, 
so tmpfs's features don't help us.
We're in agreement on the relative merits of ramfs vs tmpfs, except that I am 
assuming performance-sensitive deployments will run with swap disabled, 
negating the disadvantages of tmpfs. However this is a decision that can be 
left to the administrator and does not impact the feature design.

[~andrew.wang], responses to your questions below.

{quote}
Related to Colin's point about configuring separate pools of memory on the DN, 
I'd really like to see integration with the cache pools from HDFS-4949. Memory 
is ideally shareable between HDFS and YARN, and cache pools were designed with 
that in mind. Simple storage quotas do not fit as well.
Quotas are also a very rigid policy and can result in under-utilization. Cache 
pools are more flexible, and can be extended to support fair share and more 
complex policies. Avoiding underutilization seems especially important for a 
limited resource like memory.
{quote}
For now, existing diskspace quota checks will apply on block allocation. We 
cannot skip this check since the blocks are expected to be written to disk in 
quick order. I agree uniting the RAM disk size and 
{{dfs.datanode.max.locked.memory}} configurations is desirable. Since tmpfs 
grows dynamically perhaps one approach is for the DN to limit \[RAM disk + 
locked memory\] usage to the config value. The recommendation to administrators 
could be that they set the RAM disk size to the same value as 
{{dfs.datanode.max.locked.memory}}. This also allows preferential eviction from 
either cache or tmpfs as desired to keep the total locked memory usage within 
the limit. I'll need to think through this but I will file a sub-task meanwhile.

bq. Do you have any benchmarks? For the read side, we found checksum overhead 
to be substantial, essentially the cost of a copy. If we use tmpfs, it can 
swap, so we're forced to calculate checksums at both write and read time. My 
guess is also that a normal 1-replication write will be fairly fast because of 
the OS buffer cache, so it'd be nice to quantify the potential improvement.
tmpfs has become somewhat of a diversion. Let's assume the administrator 
configures either ramfs or tmpfs with swap disabled (our implementation doesn't 
care) so we don't have extra checksum generation beyond what we do today. I 
would _really_ like to remove even the existing checksum calculation off the 
write path for replicas that are being written to memory and have DN compute 
checksums when it 'lazy persists' to disk. I spent way more looking into this 
than I wanted to and it is hard to do cleanly with the way the write pipeline 
is setup today - I can explain the details if you are curious. I am wary of 
significant changes to the write pipeline here but this is the first 
optimization I want to address after the initial implementation.

bq. There's a mention of LAZY_PERSIST having a config option to unlink corrupt 
TMP files. It seems better for this to be per-file rather than NN-wide, since 
different clients might want different behavior.
That's a good idea, perhaps via an additional flag per-file. Can we leave the 
system-wide option for the initial implementation and change it going forward?

bq. 5.2.2 lists a con of mmaped files as not having control over page 
writeback. Is this actually true when using mlock? Also not sure why memory 
pressure is worse with mmaped files compared to tmpfs. mmap might make 
eviction+SCR nicer too, since you can just drop the mlocks if you want to 
evict, and the client has a hope of falling back gracefully.
Memory pressure is worse with mmaped files because we cannot control the timing 
of when the pages will be freed. We can evict pages from memory via unmap 
faster than the memory manager can write them to disk. tmpfs has better 
characteristics, once we run into the configured limit we can just stop 
allocating more blocks in memory. A related optimization I'd really like to 
have is to use unbuffered IO when writing to block files on disk so we don't 
churn buffer cache.

{quote}
Caveat, I'm not sure what the HSM APIs will look like, or how this will be 
integrated, so some of these might be out of scope.
Will we support changing a file from DISK storage type to TMP storage type? I 
would say no, since cache directives seem better for read caching when 
something is already on disk.
Will we support writing a file on both TMP and another storage type? Similar to 
the above, it also doesn't feel that useful.
{quote}
We are not setting the storage type on a file. HSM API work (HDFS-5682) has 
been getting pushed out, most recently in favor memory storage but I'd like to 
revisit it post 2.6. For now there is no dependence on HSM APIs and no concept 
of storage type on a file. CCM remains the preferred approach for reads so no 
change there.

Thank you for reading the doc and providing feedback.

> Write to single replica in memory
> ---------------------------------
>
>                 Key: HDFS-6581
>                 URL: https://issues.apache.org/jira/browse/HDFS-6581
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>         Attachments: HDFSWriteableReplicasInMemory.pdf
>
>
> Per discussion with the community on HDFS-5851, we will implement writing to 
> a single replica in DN memory via DataTransferProtocol.
> This avoids some of the issues with short-circuit writes, which we can 
> revisit at a later time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6581) Write to single replica in memory

Reply via email to