[
https://issues.apache.org/jira/browse/HDFS-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854594#comment-13854594
]
Colin Patrick McCabe commented on HDFS-5651:
--------------------------------------------
bq. Do you want to put the default value for the new config option in the
apt.vm file too?
ok
bq. Can you add a message to the Precondition checks?
added
bq. I see you removed the TODO for pending/underCached/etc block stats, do you
want to file a follow-on for that?
I added the TODO back in :) This jira has a big enough scope already...
bq. Add a warn or info if the cachedBlocksPercent minimum kicks in.
Added. I think having the minimum is useful (rather than a hard error) because
it lets people who don't want caching just set the percent to 0 and forget
about it.
bq. This locking scheme where we need to recheck shutdown and not modify cache
manager state feels like a potential landmine, especially since the waitFor's
need to be moved all the way up to FSN. Could we instead periodically check the
thread's interrupt status in CRM#rescan and throw InterruptedException, and go
back to joining on the CRM thread? waitFor could check CRM's shutdown status
and also throw InterruptedException. This might also let the waitFors move back
into CacheManager.
{{InterruptedException}} doesn't interrupt {{Condition#await}}, so the only
option would be to signal and release locks. I don't want to start releasing
locks that the caller holds because it might create weird situations like two
simultaneous HA transitions. The only other option is lock timeouts in CRM,
but that would extend the HA transition time-- something we definitely don't
want.
bq. Should we wipe out the various cached stats when we go to standby? I think
the needed ones will be adjusted properly as the standby tails the edit log,
but the cached ones will just sit there.
OK. I will set them to 0.
bq. It seems like we really should have a test for transition-to-standby when a
long CRM rescan is happening. This feels doable with some test injection
functions to force sleeps. If you want to stick with multiple CRM threads,
maybe also test fluttering between standby/active repeatedly and checking for
thread cleanup and data consistency.
{{TestHAStateTransitions}} does flutter back and forth-- that's how I caught
the original bug. The consistency guarantees we give in CRM are pretty loose.
I suppose we could join all CRM threads at the end to prove that they
terminate. And perhaps set the scan interval to something really small to
increase the likelihood we catch the case where we're in a CRM scan during the
transition.
> remove dfs.namenode.caching.enabled
> -----------------------------------
>
> Key: HDFS-5651
> URL: https://issues.apache.org/jira/browse/HDFS-5651
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: namenode
> Affects Versions: 3.0.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-5651.001.patch, HDFS-5651.002.patch,
> HDFS-5651.003.patch, HDFS-5651.004.patch, HDFS-5651.006.patch
>
>
> We can remove dfs.namenode.caching.enabled and simply always enable caching,
> similar to how we do with snapshots and other features. The main overhead is
> the size of the cachedBlocks GSet. However, we can simply make the size of
> this GSet configurable, and people who don't want caching can set it to a
> very small value.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)