Thanks for the logs Slava. I notice that you have not upped the ulimit on your cluster. See the head of your logs where we print out the ulimit. Its 1024. This could be one cause of your grief especially when you seemingly have many regions (>1000). Please try upping it.
St.Ack



Slava Gorelik wrote:
Hi.
I enabled DEBUG log level and now I'm sending all logs (archived) including fsck run result. Today my program starting to fail couple of minutes from the begin, it's very easy to reproduce the problem, cluster became very unstable.

Best Regards.


On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    See http://wiki.apache.org/hadoop/Hbase/FAQ#5

    St.Ack


    Slava Gorelik wrote:

        Hi.First of all i want to say thank you for you assistance !!!


        DEBUG on hadoop or hbase ? And how can i enable ?
        fsck said that HDFS is healthy.

        Best Regards and Thank You


        On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED]
        <mailto:[EMAIL PROTECTED]>> wrote:

            Slava Gorelik wrote:

                Hi.HDFS capacity is about 800gb (8 datanodes) and the
                current usage is
                about
                30GB. This is after total re-format of the HDFS that
                was made a hour
                before.

                BTW, the logs i sent are from the first exception that
                i found in them.
                Best Regards.


            Please enable DEBUG and retry.  Send me all logs.  What
            does the fsck on
            HDFS say?  There is something seriously wrong with your
            cluster that you are
            having so much trouble getting it running.  Lets try and
            figure it.

            St.Ack





                On Tue, Oct 28, 2008 at 7:12 PM, stack
                <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:



                    I took a quick look Slava (Thanks for sending the
                    files).   Here's a few
                    notes:

                    + The logs are from after the damage is done; the
                    transition from good to
                    bad is missing.  If I could see that, that would help
                    + But what seems to be plain is that that your
                    HDFS is very sick.  See
                    this
                    from head of one of the regionserver logs:

                    2008-10-27 23:41:12,682 WARN
                    org.apache.hadoop.dfs.DFSClient:
                    DataStreamer
                    Exception: java.io.IOException: Unable to create
                    new block.
                     at

                    
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
                     at

                    
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
                     at

                    
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

                    2008-10-27 23:41:12,682 WARN
                    org.apache.hadoop.dfs.DFSClient: Error
                    Recovery for block blk_-5188192041705782716_60000
                    bad datanode[0]
                    2008-10-27 23:41:12,685 ERROR
                    org.apache.hadoop.hbase.regionserver.CompactSplitThread:
                    Compaction/Split
                    failed for region
                    
BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
                    java.io.IOException: Could not get block
                    locations. Aborting...


                    If HDFS is ailing, hbase is too.  In fact, the
                    regionservers will shut
                    themselves to protect themselves against damaging
                    or losing data:

                    2008-10-27 23:41:12,688 FATAL
                    org.apache.hadoop.hbase.regionserver.Flusher:
                    Replay of hlog required. Forcing server restart

                    So, whats up with your HDFS?  Not enough space
                    alloted?  What happens if
                    you run "./bin/hadoop fsck /"?  Does that give you
                    a clue as to what
                    happened?  Dig in the datanode and namenode logs.
                     Look for where the
                    exceptions start.  It might give you a clue.

                    + The suse regionserver log had garbage in it.

                    St.Ack


                    Slava Gorelik wrote:



                        Hi.
                        My happiness was very short :-( After i
                        successfully added 1M rows (50k
                        each row) i tried to add 10M rows.
                        And after 3-4 working hours it started to
                        dying. First one region server
                        is died, after another one and eventually all
                        cluster is dead.

                        I attached log files (relevant part, archived)
                        from region servers and
                        from the master.

                        Best Regards.



                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
                        [EMAIL PROTECTED]
                        <mailto:[EMAIL PROTECTED]><mailto:
                        [EMAIL PROTECTED]
                        <mailto:[EMAIL PROTECTED]>>> wrote:

                         Hi.
                         So far so good, after changing the file
                        descriptors
                         and dfs.datanode.socket.write.timeout,
                        dfs.datanode.max.xcievers
                         my cluster works stable.
                         Thank You and Best Regards.

                         P.S. Regarding deleting multiple columns
                        missing functionality i
                         filled jira :
                        https://issues.apache.org/jira/browse/HBASE-961



                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
                        Stack <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
                         <mailto:[EMAIL PROTECTED]
                        <mailto:[EMAIL PROTECTED]>>> wrote:

                             Slava Gorelik wrote:

                                 Hi.Haven't tried yet them, i'll try
                        tomorrow morning. In
                                 general cluster is
                                 working well, the problems begins if
                        i'm trying to add 10M
                                 rows, after 1.2M
                                 if happened.

                             Anything else running beside the
                        regionserver or datanodes
                             that would suck resources?  When
                        datanodes begin to slow, we
                             begin to see the issue Jean-Adrien's
                        configurations address.
                              Are you uploading using MapReduce?  Are
                        TTs running on same
                             nodes as the datanode and regionserver?
                         How are you doing the
                             upload?  Describe what your uploader
                        looks like (Sorry if
                             you've already done this).


                                  I already changed the limit of files
                        descriptors,

                             Good.


                                  I'll try
                                 to change the properties:
                                  <property>
                        <name>dfs.datanode.socket.write.timeout</name>
                                  <value>0</value>
                                 </property>

                                 <property>
                                  <name>dfs.datanode.max.xcievers</name>
                                  <value>1023</value>
                                 </property>


                             Yeah, try it.


                                 And let you know, is any other
                        prescriptions ? Did i miss
                                 something ?

                                 BTW, off topic, but i sent e-mail
                        recently to the list and
                                 i can't see it:
                                 Is it possible to delete multiple
                        columns in any way by
                                 regex : for example
                                 colum_name_* ?

                             Not that I know of.  If its not in the
                        API, it should be.
                              Mind filing a JIRA?

                             Thanks Slava.
                             St.Ack










Reply via email to