[jira] [Commented] (HDFS-5810) Unify mmap cache and short-circuit file descriptor cache

Colin Patrick McCabe (JIRA) Thu, 06 Feb 2014 11:24:19 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-5810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893685#comment-13893685
 ]


Colin Patrick McCabe commented on HDFS-5810:
--------------------------------------------

bq. I don't ask this lightly, but I really would like some patch splitting 
here. I see a lot of nice refactors, but it's hard to separate out the intent 
of the changes. One potential easy split is making BlockReader into a builder, 
since it's pretty mechanical and touches a lot of files.

I think splitting that off would be a lot harder than you think.  Previously, 
the {{BlockReader}} creation logic was all in {{DFSInputStream}}.  Pulling it 
out while using the old cache APIs would be almost a full rewrite of big parts 
of this.

Similarly, splitting off the ClientContext stuff would be difficult, since the 
old FD cache was not meant to be shared between threads.  I guess we could 
split it off and just put PeerCache in there, but I don't think that would 
reduce the size of this patch that much.

bq. ClientContext#confAsString needs to be updated to match the ClientContext 
members, one dupe too.

updated, thanks

bq. hdfs-default.xml, the default value of stale.threshold.ms of 3000000 does 
not match the DFSConfigKeys default value of 1800000.

fixed, thanks

bq. SCReplica#Key is almost identical to FsDatasetCache#Key, but I guess that's 
okay unless you want to make a shared BlockKey class.

It's a pretty small class, and we might want to add something to the key in the 
future, so I think we should probably keep it as-is.

bq. How did you decide on the default mmap cache size of 1024? I feel like 
since we default to 4096 max datanode transfer threads, that'd be a more 
appropriate number. If the default FD ulimit of 1024 is your concern, seems 
like we should lower it below that for some headroom.

That's a very good question.  I guess the answer is that when we did HDFS-4953, 
{{dfs.client.mmap.cache.size}} got set to 1024 as a nice round number.  I think 
you are right that it will cause some problems with running out of file 
descriptors with the default max file descriptor ulimit of 1024.  It is a shame 
to lower the default, but I think that's what we have to do to avoid a poor 
user experience.

bq. These caches right now expire purely on a time basis, not actually based on 
pressure like, say, the OS page cache. For the mmap cache at least, how about a 
hybrid: have a really long timeout, but make eviction also pressure based? The 
danger of caching forever seems to be holding on to a "bad fd" (e.g. DN deletes 
the file), but with the forthcoming shm communication segment this is mitigated.

I guess the argument in favor of a long value for 
{{dfs.client.read.shortcircuit.streams.cache.expiry.ms}} and 
{{dfs.client.mmap.cache.timeout.ms}} is that:
* applications should plan for max fd usage anyway, so keeping the cache closer 
to full for a while should not be a problem;
* now that the cache is shared between input streams, a longer timeout will 
help more threads;
* now that we don't open multiple fds for the same replica, there will be less 
FD pressure anyway;
* the kernel memory consumed by open fds is minimal; the page cache evicts 
mostly based on usage, not on whether a file is open

argument against is:
* an application may stop using short-circuit reads for a while (or indeed HDFS 
entirely), and want those fds back.  We have no way for the application to 
request them back right now.

Staleness I see as a separate issue.  We have two strategies for solving 
staleness: timeout-based (for when talking to pre-HDFS-5182 DataNodes), and 
shm-based (for talking to newer DataNodes that have HDFS-5182.)

On balance, I agree with your argument.  Let's lengthen these timeouts, and 
bump up {{dfs.client.read.shortcircuit.streams.cache.size}} a little bit.  On 
the other hand, the default for {{dfs.client.mmap.cache.size}} needs to be 
reduced a bit to avoid overflowing the default fd ulimit.

bq. s/mmaped/mmapped/g, good to be consistent. munmapped too, and case 
insensitive.

Fixed

bq. Can use java's built-in TimeUnit to convert rather than the new 
nanosecondsToMilliseconds function

OK

bq. In CacheCleaner, rather than a while loop, why not a for each? It'll use 
the iterator order, which should be ascending insertion time.

Iterators could get tricky here, since we're calling functions that modify the 
eviction maps.  I'm not sure what the semantics are for if you call a function 
that deletes an element that you're currently looking at an iterator to.  It 
seems safer just to call firstEntry.  The current construction allows us to 
refactor this function to release the lock between evictions later, if need be.

bq. Could probably split out the two eviction loops into a single parameterized 
function.

I feel like there would be too many parameters.  The mmap loop does slightly 
different stuff...

bq. ref, unref, fetchOrCreate javadoc say you need to hold the lock, but they 
themselves take the cache lock.

Oops-- that's outdated javadoc.  Fixed.

bq. close, the try/finally lock idiom takes the lock inside the try, normally 
this is outside?

fixed

bq. Propagating a string tag for trace logging is ugly. Can we walk the stack 
instead to get the caller, e.g. via ExceptionUtils#getFullStackTrace or other? 
I'm okay paying the stack construction cost if it's only for trace logging.

This was something I needed during debugging.  For example, for {{unref}}, I 
needed to know where the ultimate source was -- {{BlockReaderLocal#close}} or 
{{ClientMmap#unref}}.  You can't get that just by looking at the top of the 
stack-- that will just tell you that {{ShortCircuitCache#purge}} or 
{{ShortCircuitReplica#unref}}.  Maybe there's a better way to do this... but I 
can't think of it right now.  I guess we could barf out the whole stack, but 
that seems to make tracing a lot harder to read.

bq. allocMmap, no longer a toClose param mentioned in the javadoc

fixed

bq. unref, we could get rid of newRefCount, it doesn't really improve 
readability

OK

bq. ShortCircuitReplica#unref calls ShortCircuitCache#unref, but there's no 
purge. Will this precondition abort?

The refCount can never be at 0 unless the replica has been purged.  Purging 
means that the cache no longer contains the replica (and hence does not hold a 
reference to it).

On the other hand, purged replicas may have a refCount > 0 if someone is still 
using them.

bq. Some references to eviction "lists" but they look to be maps now

Hmm.  Quite a few, actually.  Fixed.

> Unify mmap cache and short-circuit file descriptor cache
> --------------------------------------------------------
>
>                 Key: HDFS-5810
>                 URL: https://issues.apache.org/jira/browse/HDFS-5810
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>    Affects Versions: 2.3.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5810.001.patch, HDFS-5810.004.patch, 
> HDFS-5810.006.patch, HDFS-5810.008.patch, HDFS-5810.015.patch
>
>
> We should unify the client mmap cache and the client file descriptor cache.  
> Since mmaps are granted corresponding to file descriptors in the cache 
> (currently FileInputStreamCache), they have to be tracked together to do 
> "smarter" things like HDFS-5182.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5810) Unify mmap cache and short-circuit file descriptor cache

Reply via email to