[ https://issues.apache.org/jira/browse/HDFS-13044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339120#comment-16339120 ]
Yiqun Lin commented on HDFS-13044: ---------------------------------- Hi [~elgoiri], there are some initial review comments from me. *DFSConfigKeys.java* 1. Can we rename config name of {{DFS_ROUTER_SAFEMODE_CACHE_EXPIRATION_MS}} from {{dfs.federation.router.safemode.cacheexpiration.count}} to {{dfs.federation.router.safemode.cache-expiration.interval}}? Why not use {{2 or 3 * DFS_ROUTER_CACHE_TIME_TO_LIVE_MS_DEFAULT}} as the default expiration interval time? Looks these two config are related, this will be better than just assigning a new value. 2. It would be better to make these configs support multiple time units suffixes. 3. Please document these configs in {{hdfs-default.xml}}. *Router.java* 1.Following lines need to update. {code:java} + if (this.safemodeService == null) { + // Router is running now + // TODO + } {code} 2. line125. This comment seems to be {{/** The start time of the Router. */}}. This was something I am missing caught before. *RouterRpcServer.java* line425: Why not define and throw a sub-exception of {{SafeModeException}}? line426: Missing "is" before {{in safe mode and cannot handle}}. line446: Can we rename {{isSafeMode}} to {{isInSafeMode}}? *RouterSafemodeService.java* line33 and 36: Seems {{FederationStateStoreService}} is non-exist. line55: Maybe we need to define a new field named {{enterTime}} to calculate elapsed time during safemoe. And {{enterTime}} will be updated in {{enter}} method every time. We may experience enter/leave safemode many times in real env. line82: {{synchronized}} is not needed since {{MutableGaugeInt#set}} is thread-safe to use. It uses AtomicInteger type to record value inside. line84: Can we print the elapsed time during safemode as well? line90: {{(int)}} is not needed, it already did in {{RouterMetrics#setSafeModeTime}}. line110 and 117: Can we print the interval time with {{MILLISECONDS}} format? There is a chance the value loosing accurate. For example, we set the interval time 999ms, it will return 0 when convert to second format. *TestRouterSafemode.java* line43: The javadoc looks confused. There are two duplicate {{the}}. {{SafeModeTimer}} is non-exist. line116: We are just for the testing, no need to sleep so long time. We can greatly decrease the interval time tested in this class. The same problem in {{testRouterEnterSafemode}}. Additional two test cases we may needed: 1. Specific test for {{dfs.federation.router.safemode.startup.interval}}. That is say, router won't leaf safemode until passing interval time. 2. The case that RouterRpcServer rejects the reuquests when it's in safemode. {quote}We could also add an API in the Router Admin to set the safe mode by hand. {quote} +1 for this proposal. We can file new JIRA for the tracking. In addition, we need to document the meaning of router states. At least I see the semantic of safemode in RBF is different with that in normal HDFS. > RBF: Add a safe mode for the Router > ----------------------------------- > > Key: HDFS-13044 > URL: https://issues.apache.org/jira/browse/HDFS-13044 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Íñigo Goiri > Assignee: Íñigo Goiri > Priority: Major > Attachments: HDFS-13004.000.patch > > > When a Router cannot communicate with the State Store, it should enter into a > safe mode that disallows certain operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org