On Wed, May 9, 2012 at 10:00 PM, Serge Blazhiyevskyy < serge.blazhiyevs...@nice.com> wrote:
> Looks like you have some under replicated blocks. Does that number > decreases if you fsck multiple times? > Yes, since my last post it's now down to 353.... Status: HEALTHY Total size: 246983628437 B (Total open files size: 372 B) Total dirs: 15172 Total files: 39637 (Files currently being written: 7) Total blocks (validated): 41046 (avg. block size 6017239 B) (Total open file blocks (not validated): 6) Minimally replicated blocks: 41046 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 353 (0.86001074 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.016981 Corrupt blocks: 0 Missing replicas: 1774 (1.4325514 %) Number of data-nodes: 5 Number of racks: 1 FSCK ended at Wed May 09 21:26:40 UTC 2012 in 904 milliseconds > > > Regards, > Serge > > On 5/9/12 12:23 PM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: > > >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy < > >serge.blazhiyevs...@nice.com> wrote: > > > >> > >> Whats the response from fsck look like? > >> > >> > >[snip lots of stuff about under replicated blocks] > > > >......Status: HEALTHY > > Total size: 246858876262 B (Total open files size: 372 B) > > Total dirs: 14914 > > Total files: 39248 (Files currently being written: 4) > > Total blocks (validated): 40657 (avg. block size 6071743 B) (Total > >open file blocks (not validated): 4) > > Minimally replicated blocks: 40657 (100.0 %) > > Over-replicated blocks: 0 (0.0 %) > > Under-replicated blocks: 1410 (3.4680374 %) > > Mis-replicated blocks: 0 (0.0 %) > > Default replication factor: 3 > > Average block replication: 2.9911454 > > Corrupt blocks: 0 > > Missing replicas: 2831 (2.3279145 %) > > Number of data-nodes: 5 > > Number of racks: 1 > >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds > > > > > >Further information to add to this, it appear to be affecting 2 nodes in > >the cluster, one more than the other though. In the last couple of hours > >one of the nodes has also experienced high load, this has now dropped but > >both of these nodes are now considered dead by the namenode. The first > >box > >load is still increasing, currently 234! I think I might have to reboot it > >via IPMI. > > > > > >> > >> hadoop fsck / > >> > >> > >> It might be the case that some of the blocks are misreplicated > >> > >> > >> Serge > >> > >> Hadoopway.blogspot.com > >> > >> > >> > >> > >> > >> On 5/9/12 9:58 AM, "Darrell Taylor" <darrell.tay...@gmail.com> wrote: > >> > >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy < > >> >serge.blazhiyevs...@nice.com> wrote: > >> > > >> >> Take a look at your data distribution for that cluster. Maybe, it is > >> >> unbalanced. > >> >> > >> >> > >> >> Run balancer, if it isÅ > >> >> > >> > > >> >The cluster is balanced, I ran balancer yesterday. Oddly enough the > >> >problem started after I had run the balancer. > >> > > >> >I'm running CDH3 btw. > >> > > >> > > >> > > >> >> > >> >> Regards, > >> >> Serge > >> >> > >> >> hadoopway.blogspot.com > >> >> > >> >> > >> >> > >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <darrell.tay...@gmail.com> > wrote: > >> >> > >> >> >Hi, > >> >> > > >> >> >I wonder if someone could give some pointers with a problem I'm > >>having? > >> >> > > >> >> >I have a 7 machine cluster setup for testing and we have been > >>pouring > >> >>data > >> >> >into it for a week without issue, have learnt several thing along > >>the > >> >>way > >> >> >and solved all the problems up to now by searching online, but now > >>I'm > >> >> >stuck. One of the data nodes decided to have a load of 70+ this > >> >>morning, > >> >> >stopping datanode and tasktracker brought it back to normal, but > >>every > >> >> >time > >> >> >I start the datanode again the load shoots through the roof, and > >>all I > >> >>get > >> >> >in the logs is : > >> >> > > >> >> >STARTUP_MSG: Starting DataNode > >> >> > > >> >> > > >> >> >STARTUP_MSG: host = pl464/10.20.16.64 > >> >> > > >> >> > > >> >> >STARTUP_MSG: args = [] > >> >> > > >> >> > > >> >> >STARTUP_MSG: version = 0.20.2-cdh3u3 > >> >> > > >> >> > > >> >> >STARTUP_MSG: build = > >> >> > >> > >>>>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+92 > >>>>>3. > >> >>>19 > >> >> >7-1~squeeze > >> >> >-************************************************************/ > >> >> > > >> >> > > >> >> >2012-05-09 16:12:05,925 INFO > >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >> >> >already > >> >> >set up for Hadoop, not re-installing. > >> >> > > >> >> >2012-05-09 16:12:06,139 INFO > >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS Configuration > >> >> >already > >> >> >set up for Hadoop, not re-installing. > >> >> > > >> >> >Nothing else. > >> >> > > >> >> >The load seems to max out only 1 of the CPUs, but the machine > >>becomes > >> >> >*very* unresponsive > >> >> > > >> >> >Anybody got any pointers of things I can try? > >> >> > > >> >> >Thanks > >> >> >Darrell. > >> >> > >> >> > >> > >> > >