Re: High load on datanode startup

Serge Blazhiyevskyy Wed, 09 May 2012 14:45:35 -0700

I would wait for that number to go down to 0

That could a reason for your CPU utilization


Regards,
Serge


On 5/9/12 2:27 PM, "Darrell Taylor" <[email protected]> wrote:

>On Wed, May 9, 2012 at 10:00 PM, Serge Blazhiyevskyy <
>[email protected]> wrote:
>
>> Looks like you have some under replicated blocks. Does that number
>> decreases if you fsck multiple times?
>>
>
>Yes, since my last post it's now down to 353....
>
>Status: HEALTHY
> Total size:    246983628437 B (Total open files size: 372 B)
> Total dirs:    15172
> Total files:   39637 (Files currently being written: 7)
> Total blocks (validated):      41046 (avg. block size 6017239 B) (Total
>open file blocks (not validated): 6)
> Minimally replicated blocks:   41046 (100.0 %)
> Over-replicated blocks:        0 (0.0 %)
> Under-replicated blocks:       353 (0.86001074 %)
> Mis-replicated blocks:         0 (0.0 %)
> Default replication factor:    3
> Average block replication:     3.016981
> Corrupt blocks:                0
> Missing replicas:              1774 (1.4325514 %)
> Number of data-nodes:          5
> Number of racks:               1
>FSCK ended at Wed May 09 21:26:40 UTC 2012 in 904 milliseconds
>
>
>
>
>>
>>
>> Regards,
>> Serge
>>
>> On 5/9/12 12:23 PM, "Darrell Taylor" <[email protected]> wrote:
>>
>> >On Wed, May 9, 2012 at 6:04 PM, Serge Blazhiyevskyy <
>> >[email protected]> wrote:
>> >
>> >>
>> >> Whats the response from fsck look like?
>> >>
>> >>
>> >[snip lots of stuff about under replicated blocks]
>> >
>> >......Status: HEALTHY
>> > Total size:    246858876262 B (Total open files size: 372 B)
>> > Total dirs:    14914
>> > Total files:   39248 (Files currently being written: 4)
>> > Total blocks (validated):      40657 (avg. block size 6071743 B)
>>(Total
>> >open file blocks (not validated): 4)
>> > Minimally replicated blocks:   40657 (100.0 %)
>> > Over-replicated blocks:        0 (0.0 %)
>> > Under-replicated blocks:       1410 (3.4680374 %)
>> > Mis-replicated blocks:         0 (0.0 %)
>> > Default replication factor:    3
>> > Average block replication:     2.9911454
>> > Corrupt blocks:                0
>> > Missing replicas:              2831 (2.3279145 %)
>> > Number of data-nodes:          5
>> > Number of racks:               1
>> >FSCK ended at Wed May 09 19:19:11 UTC 2012 in 980 milliseconds
>> >
>> >
>> >Further information to add to this, it appear to be affecting 2 nodes
>>in
>> >the cluster, one more than the other though.  In the last couple of
>>hours
>> >one of the nodes has also experienced high load, this has now dropped
>>but
>> >both of these nodes are now considered dead by the namenode.  The first
>> >box
>> >load is still increasing, currently 234! I think I might have to
>>reboot it
>> >via IPMI.
>> >
>> >
>> >>
>> >> hadoop fsck /
>> >>
>> >>
>> >> It might be the case that some of the blocks are misreplicated
>> >>
>> >>
>> >> Serge
>> >>
>> >> Hadoopway.blogspot.com
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 5/9/12 9:58 AM, "Darrell Taylor" <[email protected]> wrote:
>> >>
>> >> >On Wed, May 9, 2012 at 5:56 PM, Serge Blazhiyevskyy <
>> >> >[email protected]> wrote:
>> >> >
>> >> >> Take a look at your data distribution for that cluster. Maybe, it
>>is
>> >> >> unbalanced.
>> >> >>
>> >> >>
>> >> >> Run balancer, if it isŠ
>> >> >>
>> >> >
>> >> >The cluster is balanced, I ran balancer yesterday.  Oddly enough the
>> >> >problem started after I had run the balancer.
>> >> >
>> >> >I'm running CDH3 btw.
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> Regards,
>> >> >> Serge
>> >> >>
>> >> >> hadoopway.blogspot.com
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 5/9/12 9:52 AM, "Darrell Taylor" <[email protected]>
>> wrote:
>> >> >>
>> >> >> >Hi,
>> >> >> >
>> >> >> >I wonder if someone could give some pointers with a problem I'm
>> >>having?
>> >> >> >
>> >> >> >I have a 7 machine cluster setup for testing and we have been
>> >>pouring
>> >> >>data
>> >> >> >into it for a week without issue, have learnt several thing along
>> >>the
>> >> >>way
>> >> >> >and solved all the problems up to now by searching online, but
>>now
>> >>I'm
>> >> >> >stuck.  One of the data nodes decided to have a load of 70+ this
>> >> >>morning,
>> >> >> >stopping datanode and tasktracker brought it back to normal, but
>> >>every
>> >> >> >time
>> >> >> >I start the datanode again the load shoots through the roof, and
>> >>all I
>> >> >>get
>> >> >> >in the logs is :
>> >> >> >
>> >> >> >STARTUP_MSG: Starting DataNode
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   host = pl464/10.20.16.64
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   args = []
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   version = 0.20.2-cdh3u3
>> >> >> >
>> >> >> >
>> >> >> >STARTUP_MSG:   build =
>> >> >>
>> >>
>> 
>>>>>>>file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+
>>>>>>>92
>> >>>>>3.
>> >> >>>19
>> >> >> >7-1~squeeze
>> >> >> >-************************************************************/
>> >> >> >
>> >> >> >
>> >> >> >2012-05-09 16:12:05,925 INFO
>> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
>>Configuration
>> >> >> >already
>> >> >> >set up for Hadoop, not re-installing.
>> >> >> >
>> >> >> >2012-05-09 16:12:06,139 INFO
>> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
>>Configuration
>> >> >> >already
>> >> >> >set up for Hadoop, not re-installing.
>> >> >> >
>> >> >> >Nothing else.
>> >> >> >
>> >> >> >The load seems to max out only 1 of the CPUs, but the machine
>> >>becomes
>> >> >> >*very* unresponsive
>> >> >> >
>> >> >> >Anybody got any pointers of things I can try?
>> >> >> >
>> >> >> >Thanks
>> >> >> >Darrell.
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>

Re: High load on datanode startup

Reply via email to