Slava Gorelik wrote:
Hi.I also noticed this exception.
Strange that this exception is happened every time on the same regionserver.
Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
 Very strange, but history folder in hadoop is empty.
It is odd indeed that the system keeps trying to load a region that does not exist.

I don't think it necessarily the same regionserver that is responsible. I'd think it an attribute of the region that we're trying to deploy on that server.

Silly question: you did replace 'X' with your machine name in the above?

If you restart, it still tries to load this nonexistent region?

If so, the .META. table is not consistent with whats on the filesystem. They've gotten out of sync. Describing how to repair is involved.

Reformatting HDFS  will help ?

Do a "scan '.META.'" in the shell. Do you see your region listed (look at the encoded names attribute to find 735893330.

If your table is damaged -- i'd guess it because ulimit was bad up to this -- the best thing might to start over.

One more things in a last minute, i found that one node in cluster has
totally different time, could this cause for such a problems ?
We thought we'd fixed all problems that could arise from time skew, but you never know. In our requirements, clocks must be synced. Fix this too if you can before reloading.

P.S. About logs, is it possible to send to some email - each log file
compressed is about 1mb, and only in 3 files i found exceptions.

There probably is such a functionality but I'm not familiar. Can you put them under a webserver at your place so I can grab them? You can send me the URL offlist if you like.

Thanks for your patience Slava.  We'll figure it.
St.Ack


On Thu, Oct 30, 2008 at 10:25 PM, stack <[EMAIL PROTECTED]> wrote:

Can you put them someplace that I can pull them?

I took another look at your logs.  I see that a region is missing files.
 That means it will never open and just keep trying.  Grep your logs for
FileNotFound.  You'll see this:

hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
File does not exist:
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
File does not exist:
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data

Try shutting down, and removing these files.   Remove the following
directories:


hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906

hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637

Then retry restarting.

You can try and figure how these files got lost by going back in your
history.


St.Ack



Slava Gorelik wrote:

Michael,still have the problem, but the logs files are very big (50MB
each)
even compressed they are bigger than limit for this mailing list.
Most of the problems are happened during compaction (i see in the log),
may
be i can send some parts from logs ?

Best Regards.

On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <[EMAIL PROTECTED]
wrote:

Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
soon
will try again.


On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <[EMAIL PROTECTED]
wrote:

Hi.Very strange, i see in limits.conf that it's upped.
I attached the limits.conf, please have a  look, may be i did it wrong.

Best Regards.


On Thu, Oct 30, 2008 at 7:52 PM, stack <[EMAIL PROTECTED]> wrote:



Thanks for the logs Slava.  I notice that you have not upped the ulimit
on your cluster.  See the head of your logs where we print out the
ulimit.
 Its 1024.  This could be one cause of your grief especially when you
seemingly have many regions (>1000).  Please try upping it.
St.Ack




Slava Gorelik wrote:



Hi.
I enabled DEBUG log level and now I'm sending all logs (archived)
including fsck run result.
Today my program starting to fail couple of minutes from the begin,
it's
very easy to reproduce the problem, cluster became very unstable.

Best Regards.


On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto:
[EMAIL PROTECTED]>> wrote:

  See http://wiki.apache.org/hadoop/Hbase/FAQ#5

  St.Ack


  Slava Gorelik wrote:

      Hi.First of all i want to say thank you for you assistance !!!


      DEBUG on hadoop or hbase ? And how can i enable ?
      fsck said that HDFS is healthy.

      Best Regards and Thank You


      On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED]
      <mailto:[EMAIL PROTECTED]>> wrote:


          Slava Gorelik wrote:


              Hi.HDFS capacity is about 800gb (8 datanodes) and the
              current usage is
              about
              30GB. This is after total re-format of the HDFS that
              was made a hour
              before.

              BTW, the logs i sent are from the first exception that
              i found in them.
              Best Regards.



          Please enable DEBUG and retry.  Send me all logs.  What
          does the fsck on
          HDFS say?  There is something seriously wrong with your
          cluster that you are
          having so much trouble getting it running.  Lets try and
          figure it.

          St.Ack






              On Tue, Oct 28, 2008 at 7:12 PM, stack
              <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:




                  I took a quick look Slava (Thanks for sending the
                  files).   Here's a few
                  notes:

                  + The logs are from after the damage is done; the
                  transition from good to
                  bad is missing.  If I could see that, that would
help
                  + But what seems to be plain is that that your
                  HDFS is very sick.  See
                  this
                  from head of one of the regionserver logs:

                  2008-10-27 23:41:12,682 WARN
                  org.apache.hadoop.dfs.DFSClient:
                  DataStreamer
                  Exception: java.io.IOException: Unable to create
                  new block.
                   at



 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
                   at



 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
                   at



 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

                  2008-10-27 23:41:12,682 WARN
                  org.apache.hadoop.dfs.DFSClient: Error
                  Recovery for block blk_-5188192041705782716_60000
                  bad datanode[0]
                  2008-10-27 23:41:12,685 ERROR

 org.apache.hadoop.hbase.regionserver.CompactSplitThread:
                  Compaction/Split
                  failed for region

 BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
                  java.io.IOException: Could not get block
                  locations. Aborting...


                  If HDFS is ailing, hbase is too.  In fact, the
                  regionservers will shut
                  themselves to protect themselves against damaging
                  or losing data:

                  2008-10-27 23:41:12,688 FATAL
                  org.apache.hadoop.hbase.regionserver.Flusher:
                  Replay of hlog required. Forcing server restart

                  So, whats up with your HDFS?  Not enough space
                  alloted?  What happens if
                  you run "./bin/hadoop fsck /"?  Does that give you
                  a clue as to what
                  happened?  Dig in the datanode and namenode logs.
                   Look for where the
                  exceptions start.  It might give you a clue.

                  + The suse regionserver log had garbage in it.

                  St.Ack


                  Slava Gorelik wrote:




                      Hi.
                      My happiness was very short :-( After i
                      successfully added 1M rows (50k
                      each row) i tried to add 10M rows.
                      And after 3-4 working hours it started to
                      dying. First one region server
                      is died, after another one and eventually all
                      cluster is dead.

                      I attached log files (relevant part, archived)
                      from region servers and
                      from the master.

                      Best Regards.



                      On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik
<
                      [EMAIL PROTECTED]
                      <mailto:[EMAIL PROTECTED]><mailto:
                      [EMAIL PROTECTED]
                      <mailto:[EMAIL PROTECTED]>>> wrote:

                       Hi.
                       So far so good, after changing the file
                      descriptors
                       and dfs.datanode.socket.write.timeout,
                      dfs.datanode.max.xcievers
                       my cluster works stable.
                       Thank You and Best Regards.

                       P.S. Regarding deleting multiple columns
                      missing functionality i
                       filled jira :
                      https://issues.apache.org/jira/browse/HBASE-961



                       On Sun, Oct 26, 2008 at 12:58 AM, Michael
                      Stack <[EMAIL PROTECTED] <mailto:
[EMAIL PROTECTED]
                                 <mailto:[EMAIL PROTECTED]

                      <mailto:[EMAIL PROTECTED]>>> wrote:

                           Slava Gorelik wrote:

                               Hi.Haven't tried yet them, i'll try
                      tomorrow morning. In
                               general cluster is
                               working well, the problems begins if
                      i'm trying to add 10M
                               rows, after 1.2M
                               if happened.

                           Anything else running beside the
                      regionserver or datanodes
                           that would suck resources?  When
                      datanodes begin to slow, we
                           begin to see the issue Jean-Adrien's
                      configurations address.
                            Are you uploading using MapReduce?  Are
                      TTs running on same
                           nodes as the datanode and regionserver?
                       How are you doing the
                           upload?  Describe what your uploader
                      looks like (Sorry if
                           you've already done this).


                                I already changed the limit of files
                      descriptors,

                           Good.


                                I'll try
                               to change the properties:
                                <property>
                      <name>dfs.datanode.socket.write.timeout</name>
                                <value>0</value>
                               </property>

                               <property>
                                <name>dfs.datanode.max.xcievers</name>
                                <value>1023</value>
                               </property>


                           Yeah, try it.


                               And let you know, is any other
                      prescriptions ? Did i miss
                               something ?

                               BTW, off topic, but i sent e-mail
                      recently to the list and
                               i can't see it:
                               Is it possible to delete multiple
                      columns in any way by
                               regex : for example
                               colum_name_* ?

                           Not that I know of.  If its not in the
                      API, it should be.
                            Mind filing a JIRA?

                           Thanks Slava.
                           St.Ack


















Reply via email to