[
https://issues.apache.org/jira/browse/HDFS-5810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893685#comment-13893685
]
Colin Patrick McCabe commented on HDFS-5810:
--------------------------------------------
bq. I don't ask this lightly, but I really would like some patch splitting
here. I see a lot of nice refactors, but it's hard to separate out the intent
of the changes. One potential easy split is making BlockReader into a builder,
since it's pretty mechanical and touches a lot of files.
I think splitting that off would be a lot harder than you think. Previously,
the {{BlockReader}} creation logic was all in {{DFSInputStream}}. Pulling it
out while using the old cache APIs would be almost a full rewrite of big parts
of this.
Similarly, splitting off the ClientContext stuff would be difficult, since the
old FD cache was not meant to be shared between threads. I guess we could
split it off and just put PeerCache in there, but I don't think that would
reduce the size of this patch that much.
bq. ClientContext#confAsString needs to be updated to match the ClientContext
members, one dupe too.
updated, thanks
bq. hdfs-default.xml, the default value of stale.threshold.ms of 3000000 does
not match the DFSConfigKeys default value of 1800000.
fixed, thanks
bq. SCReplica#Key is almost identical to FsDatasetCache#Key, but I guess that's
okay unless you want to make a shared BlockKey class.
It's a pretty small class, and we might want to add something to the key in the
future, so I think we should probably keep it as-is.
bq. How did you decide on the default mmap cache size of 1024? I feel like
since we default to 4096 max datanode transfer threads, that'd be a more
appropriate number. If the default FD ulimit of 1024 is your concern, seems
like we should lower it below that for some headroom.
That's a very good question. I guess the answer is that when we did HDFS-4953,
{{dfs.client.mmap.cache.size}} got set to 1024 as a nice round number. I think
you are right that it will cause some problems with running out of file
descriptors with the default max file descriptor ulimit of 1024. It is a shame
to lower the default, but I think that's what we have to do to avoid a poor
user experience.
bq. These caches right now expire purely on a time basis, not actually based on
pressure like, say, the OS page cache. For the mmap cache at least, how about a
hybrid: have a really long timeout, but make eviction also pressure based? The
danger of caching forever seems to be holding on to a "bad fd" (e.g. DN deletes
the file), but with the forthcoming shm communication segment this is mitigated.
I guess the argument in favor of a long value for
{{dfs.client.read.shortcircuit.streams.cache.expiry.ms}} and
{{dfs.client.mmap.cache.timeout.ms}} is that:
* applications should plan for max fd usage anyway, so keeping the cache closer
to full for a while should not be a problem;
* now that the cache is shared between input streams, a longer timeout will
help more threads;
* now that we don't open multiple fds for the same replica, there will be less
FD pressure anyway;
* the kernel memory consumed by open fds is minimal; the page cache evicts
mostly based on usage, not on whether a file is open
argument against is:
* an application may stop using short-circuit reads for a while (or indeed HDFS
entirely), and want those fds back. We have no way for the application to
request them back right now.
Staleness I see as a separate issue. We have two strategies for solving
staleness: timeout-based (for when talking to pre-HDFS-5182 DataNodes), and
shm-based (for talking to newer DataNodes that have HDFS-5182.)
On balance, I agree with your argument. Let's lengthen these timeouts, and
bump up {{dfs.client.read.shortcircuit.streams.cache.size}} a little bit. On
the other hand, the default for {{dfs.client.mmap.cache.size}} needs to be
reduced a bit to avoid overflowing the default fd ulimit.
bq. s/mmaped/mmapped/g, good to be consistent. munmapped too, and case
insensitive.
Fixed
bq. Can use java's built-in TimeUnit to convert rather than the new
nanosecondsToMilliseconds function
OK
bq. In CacheCleaner, rather than a while loop, why not a for each? It'll use
the iterator order, which should be ascending insertion time.
Iterators could get tricky here, since we're calling functions that modify the
eviction maps. I'm not sure what the semantics are for if you call a function
that deletes an element that you're currently looking at an iterator to. It
seems safer just to call firstEntry. The current construction allows us to
refactor this function to release the lock between evictions later, if need be.
bq. Could probably split out the two eviction loops into a single parameterized
function.
I feel like there would be too many parameters. The mmap loop does slightly
different stuff...
bq. ref, unref, fetchOrCreate javadoc say you need to hold the lock, but they
themselves take the cache lock.
Oops-- that's outdated javadoc. Fixed.
bq. close, the try/finally lock idiom takes the lock inside the try, normally
this is outside?
fixed
bq. Propagating a string tag for trace logging is ugly. Can we walk the stack
instead to get the caller, e.g. via ExceptionUtils#getFullStackTrace or other?
I'm okay paying the stack construction cost if it's only for trace logging.
This was something I needed during debugging. For example, for {{unref}}, I
needed to know where the ultimate source was -- {{BlockReaderLocal#close}} or
{{ClientMmap#unref}}. You can't get that just by looking at the top of the
stack-- that will just tell you that {{ShortCircuitCache#purge}} or
{{ShortCircuitReplica#unref}}. Maybe there's a better way to do this... but I
can't think of it right now. I guess we could barf out the whole stack, but
that seems to make tracing a lot harder to read.
bq. allocMmap, no longer a toClose param mentioned in the javadoc
fixed
bq. unref, we could get rid of newRefCount, it doesn't really improve
readability
OK
bq. ShortCircuitReplica#unref calls ShortCircuitCache#unref, but there's no
purge. Will this precondition abort?
The refCount can never be at 0 unless the replica has been purged. Purging
means that the cache no longer contains the replica (and hence does not hold a
reference to it).
On the other hand, purged replicas may have a refCount > 0 if someone is still
using them.
bq. Some references to eviction "lists" but they look to be maps now
Hmm. Quite a few, actually. Fixed.
> Unify mmap cache and short-circuit file descriptor cache
> --------------------------------------------------------
>
> Key: HDFS-5810
> URL: https://issues.apache.org/jira/browse/HDFS-5810
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Affects Versions: 2.3.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-5810.001.patch, HDFS-5810.004.patch,
> HDFS-5810.006.patch, HDFS-5810.008.patch, HDFS-5810.015.patch
>
>
> We should unify the client mmap cache and the client file descriptor cache.
> Since mmaps are granted corresponding to file descriptors in the cache
> (currently FileInputStreamCache), they have to be tracked together to do
> "smarter" things like HDFS-5182.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)