[ 
https://issues.apache.org/jira/browse/HDFS-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854594#comment-13854594
 ] 

Colin Patrick McCabe commented on HDFS-5651:
--------------------------------------------

bq. Do you want to put the default value for the new config option in the 
apt.vm file too?

ok

bq. Can you add a message to the Precondition checks?

added

bq. I see you removed the TODO for pending/underCached/etc block stats, do you 
want to file a follow-on for that?

I added the TODO back in :)  This jira has a big enough scope already...

bq. Add a warn or info if the cachedBlocksPercent minimum kicks in.

Added.  I think having the minimum is useful (rather than a hard error) because 
it lets people who don't want caching just set the percent to 0 and forget 
about it.

bq. This locking scheme where we need to recheck shutdown and not modify cache 
manager state feels like a potential landmine, especially since the waitFor's 
need to be moved all the way up to FSN. Could we instead periodically check the 
thread's interrupt status in CRM#rescan and throw InterruptedException, and go 
back to joining on the CRM thread? waitFor could check CRM's shutdown status 
and also throw InterruptedException. This might also let the waitFors move back 
into CacheManager.

{{InterruptedException}} doesn't interrupt {{Condition#await}}, so the only 
option would be to signal and release locks.  I don't want to start releasing 
locks that the caller holds because it might create weird situations like two 
simultaneous HA transitions.  The only other option is lock timeouts in CRM, 
but that would extend the HA transition time-- something we definitely don't 
want.

bq. Should we wipe out the various cached stats when we go to standby? I think 
the needed ones will be adjusted properly as the standby tails the edit log, 
but the cached ones will just sit there.

OK.  I will set them to 0.

bq. It seems like we really should have a test for transition-to-standby when a 
long CRM rescan is happening. This feels doable with some test injection 
functions to force sleeps. If you want to stick with multiple CRM threads, 
maybe also test fluttering between standby/active repeatedly and checking for 
thread cleanup and data consistency.

{{TestHAStateTransitions}} does flutter back and forth-- that's how I caught 
the original bug.  The consistency guarantees we give in CRM are pretty loose.  
I suppose we could join all CRM threads at the end to prove that they 
terminate.  And perhaps set the scan interval to something really small to 
increase the likelihood we catch the case where we're in a CRM scan during the 
transition.

> remove dfs.namenode.caching.enabled
> -----------------------------------
>
>                 Key: HDFS-5651
>                 URL: https://issues.apache.org/jira/browse/HDFS-5651
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5651.001.patch, HDFS-5651.002.patch, 
> HDFS-5651.003.patch, HDFS-5651.004.patch, HDFS-5651.006.patch
>
>
> We can remove dfs.namenode.caching.enabled and simply always enable caching, 
> similar to how we do with snapshots and other features.  The main overhead is 
> the size of the cachedBlocks GSet.  However, we can simply make the size of 
> this GSet configurable, and people who don't want caching can set it to a 
> very small value.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to