On Tue, May 31, 2011 at 7:59 PM, Stack <[email protected]> wrote: > Like you say, it should be gone in 0.92.x. > > On each regionserver report, we'd deserialize an HServerAddress > instance. As part of deserialize, we'd make an InetSocketAddress > instance. This act of creation would do a resolve. In HSA > constructor, if InetSocketAddress failed resolve, we'd throw the below > IllegalArgumentException. > > Not sure what you can do about it in 0.90.x w/o major surgery. I > suppose you could just catch the exception and drop the report on the > ground until resolve works again. >
Yea, the question is why it got into a tight loop retrying instead of either (a) sleeping between retries, or (b) shutting down after some number of retries. The code looks like it's supposed to do (a), but the log messages are only a few millis apart. > On Tue, May 31, 2011 at 7:21 PM, Todd Lipcon <[email protected]> wrote: > > We had a QA cluster which got left on for a while during some > > maintenance to DNS/etc in our colo... everything is fine in the RS > > logs until: > > > > 2011-05-14 23:11:46,154 ERROR org.apache.hadoop.hbase.HServerAddress: > > Could not resolve the DNS name of c0505.hal.cloudera.com:60000 > > 2011-05-14 23:11:46,154 WARN > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > > java.lang.IllegalArgumentException: Could not resolve the DNS name of > > c0505.hal.cloudera.com:60000 > > at > org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105) > > at > org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66) > > at > org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:63) > > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getMasterAddress(HRegionServer.java:1469) > > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1442) > > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:742) > > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:591) > > at java.lang.Thread.run(Thread.java:619) > > 2011-05-14 23:12:14,175 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect > > to Master server at c0505.hal.cloudera.com:60000 > > 2011-05-14 23:12:14,177 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to > > master at c0505.hal.cloudera.com:60000 > > 2011-05-14 23:12:14,178 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect > > to Master server at c0505.hal.cloudera.com:60000 > > 2011-05-14 23:12:14,179 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to > > master at c0505.hal.cloudera.com:60000 > > followed by many GB of the above two messages alternating. > > > > This is something close to an 0.90.1 plus a few patches here and > > there... this ring a bell for anyone or should I dig? Looks like in > > trunk it's mostly rewritten by HBASE-3827/HBASE-1502. I do have > > HBASE-3545 in the build. > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
