Re: Regionserver fails to serve region

Slava Gorelik Thu, 30 Oct 2008 11:39:40 -0700

Hi.Very strange, i see in limits.conf that it's upped.
I attached the limits.conf, please have a  look, may be i did it wrong.


Best Regards.


On Thu, Oct 30, 2008 at 7:52 PM, stack <[EMAIL PROTECTED]> wrote:

> Thanks for the logs Slava.  I notice that you have not upped the ulimit on
> your cluster.  See the head of your logs where we print out the ulimit.  Its
> 1024.  This could be one cause of your grief especially when you seemingly
> have many regions (>1000).  Please try upping it.
> St.Ack
>
>
>
>
> Slava Gorelik wrote:
>
>> Hi.
>> I enabled DEBUG log level and now I'm sending all logs (archived)
>> including fsck run result.
>> Today my program starting to fail couple of minutes from the begin, it's
>> very easy to reproduce the problem, cluster became very unstable.
>>
>> Best Regards.
>>
>>
>> On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto:
>> [EMAIL PROTECTED]>> wrote:
>>
>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>
>>    St.Ack
>>
>>
>>    Slava Gorelik wrote:
>>
>>        Hi.First of all i want to say thank you for you assistance !!!
>>
>>
>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>        fsck said that HDFS is healthy.
>>
>>        Best Regards and Thank You
>>
>>
>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED]
>>        <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>
>>            Slava Gorelik wrote:
>>
>>
>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>                current usage is
>>                about
>>                30GB. This is after total re-format of the HDFS that
>>                was made a hour
>>                before.
>>
>>                BTW, the logs i sent are from the first exception that
>>                i found in them.
>>                Best Regards.
>>
>>
>>
>>            Please enable DEBUG and retry.  Send me all logs.  What
>>            does the fsck on
>>            HDFS say?  There is something seriously wrong with your
>>            cluster that you are
>>            having so much trouble getting it running.  Lets try and
>>            figure it.
>>
>>            St.Ack
>>
>>
>>
>>
>>
>>
>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>                <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>
>>
>>
>>                    I took a quick look Slava (Thanks for sending the
>>                    files).   Here's a few
>>                    notes:
>>
>>                    + The logs are from after the damage is done; the
>>                    transition from good to
>>                    bad is missing.  If I could see that, that would help
>>                    + But what seems to be plain is that that your
>>                    HDFS is very sick.  See
>>                    this
>>                    from head of one of the regionserver logs:
>>
>>                    2008-10-27 23:41:12,682 WARN
>>                    org.apache.hadoop.dfs.DFSClient:
>>                    DataStreamer
>>                    Exception: java.io.IOException: Unable to create
>>                    new block.
>>                     at
>>
>>
>>  
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>                     at
>>
>>
>>  
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>                     at
>>
>>
>>  
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>
>>                    2008-10-27 23:41:12,682 WARN
>>                    org.apache.hadoop.dfs.DFSClient: Error
>>                    Recovery for block blk_-5188192041705782716_60000
>>                    bad datanode[0]
>>                    2008-10-27 23:41:12,685 ERROR
>>
>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>                    Compaction/Split
>>                    failed for region
>>
>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>                    java.io.IOException: Could not get block
>>                    locations. Aborting...
>>
>>
>>                    If HDFS is ailing, hbase is too.  In fact, the
>>                    regionservers will shut
>>                    themselves to protect themselves against damaging
>>                    or losing data:
>>
>>                    2008-10-27 23:41:12,688 FATAL
>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>                    Replay of hlog required. Forcing server restart
>>
>>                    So, whats up with your HDFS?  Not enough space
>>                    alloted?  What happens if
>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>                    a clue as to what
>>                    happened?  Dig in the datanode and namenode logs.
>>                     Look for where the
>>                    exceptions start.  It might give you a clue.
>>
>>                    + The suse regionserver log had garbage in it.
>>
>>                    St.Ack
>>
>>
>>                    Slava Gorelik wrote:
>>
>>
>>
>>
>>                        Hi.
>>                        My happiness was very short :-( After i
>>                        successfully added 1M rows (50k
>>                        each row) i tried to add 10M rows.
>>                        And after 3-4 working hours it started to
>>                        dying. First one region server
>>                        is died, after another one and eventually all
>>                        cluster is dead.
>>
>>                        I attached log files (relevant part, archived)
>>                        from region servers and
>>                        from the master.
>>
>>                        Best Regards.
>>
>>
>>
>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>                        [EMAIL PROTECTED]
>>                        <mailto:[EMAIL PROTECTED]><mailto:
>>                        [EMAIL PROTECTED]
>>                        <mailto:[EMAIL PROTECTED]>>> wrote:
>>
>>                         Hi.
>>                         So far so good, after changing the file
>>                        descriptors
>>                         and dfs.datanode.socket.write.timeout,
>>                        dfs.datanode.max.xcievers
>>                         my cluster works stable.
>>                         Thank You and Best Regards.
>>
>>                         P.S. Regarding deleting multiple columns
>>                        missing functionality i
>>                         filled jira :
>>                        https://issues.apache.org/jira/browse/HBASE-961
>>
>>
>>
>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>                        Stack <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>                         <mailto:[EMAIL PROTECTED]
>>
>>                        <mailto:[EMAIL PROTECTED]>>> wrote:
>>
>>                             Slava Gorelik wrote:
>>
>>                                 Hi.Haven't tried yet them, i'll try
>>                        tomorrow morning. In
>>                                 general cluster is
>>                                 working well, the problems begins if
>>                        i'm trying to add 10M
>>                                 rows, after 1.2M
>>                                 if happened.
>>
>>                             Anything else running beside the
>>                        regionserver or datanodes
>>                             that would suck resources?  When
>>                        datanodes begin to slow, we
>>                             begin to see the issue Jean-Adrien's
>>                        configurations address.
>>                              Are you uploading using MapReduce?  Are
>>                        TTs running on same
>>                             nodes as the datanode and regionserver?
>>                         How are you doing the
>>                             upload?  Describe what your uploader
>>                        looks like (Sorry if
>>                             you've already done this).
>>
>>
>>                                  I already changed the limit of files
>>                        descriptors,
>>
>>                             Good.
>>
>>
>>                                  I'll try
>>                                 to change the properties:
>>                                  <property>
>>                        <name>dfs.datanode.socket.write.timeout</name>
>>                                  <value>0</value>
>>                                 </property>
>>
>>                                 <property>
>>                                  <name>dfs.datanode.max.xcievers</name>
>>                                  <value>1023</value>
>>                                 </property>
>>
>>
>>                             Yeah, try it.
>>
>>
>>                                 And let you know, is any other
>>                        prescriptions ? Did i miss
>>                                 something ?
>>
>>                                 BTW, off topic, but i sent e-mail
>>                        recently to the list and
>>                                 i can't see it:
>>                                 Is it possible to delete multiple
>>                        columns in any way by
>>                                 regex : for example
>>                                 colum_name_* ?
>>
>>                             Not that I know of.  If its not in the
>>                        API, it should be.
>>                              Mind filing a JIRA?
>>
>>                             Thanks Slava.
>>                             St.Ack
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Regionserver fails to serve region

Reply via email to