Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon will try again.
On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <[EMAIL PROTECTED]>wrote: > Hi.Very strange, i see in limits.conf that it's upped. > I attached the limits.conf, please have a look, may be i did it wrong. > > Best Regards. > > > On Thu, Oct 30, 2008 at 7:52 PM, stack <[EMAIL PROTECTED]> wrote: > >> Thanks for the logs Slava. I notice that you have not upped the ulimit on >> your cluster. See the head of your logs where we print out the ulimit. Its >> 1024. This could be one cause of your grief especially when you seemingly >> have many regions (>1000). Please try upping it. >> St.Ack >> >> >> >> >> Slava Gorelik wrote: >> >>> Hi. >>> I enabled DEBUG log level and now I'm sending all logs (archived) >>> including fsck run result. >>> Today my program starting to fail couple of minutes from the begin, it's >>> very easy to reproduce the problem, cluster became very unstable. >>> >>> Best Regards. >>> >>> >>> On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto: >>> [EMAIL PROTECTED]>> wrote: >>> >>> See http://wiki.apache.org/hadoop/Hbase/FAQ#5 >>> >>> St.Ack >>> >>> >>> Slava Gorelik wrote: >>> >>> Hi.First of all i want to say thank you for you assistance !!! >>> >>> >>> DEBUG on hadoop or hbase ? And how can i enable ? >>> fsck said that HDFS is healthy. >>> >>> Best Regards and Thank You >>> >>> >>> On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED] >>> <mailto:[EMAIL PROTECTED]>> wrote: >>> >>> >>> Slava Gorelik wrote: >>> >>> >>> Hi.HDFS capacity is about 800gb (8 datanodes) and the >>> current usage is >>> about >>> 30GB. This is after total re-format of the HDFS that >>> was made a hour >>> before. >>> >>> BTW, the logs i sent are from the first exception that >>> i found in them. >>> Best Regards. >>> >>> >>> >>> Please enable DEBUG and retry. Send me all logs. What >>> does the fsck on >>> HDFS say? There is something seriously wrong with your >>> cluster that you are >>> having so much trouble getting it running. Lets try and >>> figure it. >>> >>> St.Ack >>> >>> >>> >>> >>> >>> >>> On Tue, Oct 28, 2008 at 7:12 PM, stack >>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: >>> >>> >>> >>> >>> I took a quick look Slava (Thanks for sending the >>> files). Here's a few >>> notes: >>> >>> + The logs are from after the damage is done; the >>> transition from good to >>> bad is missing. If I could see that, that would help >>> + But what seems to be plain is that that your >>> HDFS is very sick. See >>> this >>> from head of one of the regionserver logs: >>> >>> 2008-10-27 23:41:12,682 WARN >>> org.apache.hadoop.dfs.DFSClient: >>> DataStreamer >>> Exception: java.io.IOException: Unable to create >>> new block. >>> at >>> >>> >>> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349) >>> at >>> >>> >>> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735) >>> at >>> >>> >>> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912) >>> >>> 2008-10-27 23:41:12,682 WARN >>> org.apache.hadoop.dfs.DFSClient: Error >>> Recovery for block blk_-5188192041705782716_60000 >>> bad datanode[0] >>> 2008-10-27 23:41:12,685 ERROR >>> >>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >>> Compaction/Split >>> failed for region >>> >>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518 >>> java.io.IOException: Could not get block >>> locations. Aborting... >>> >>> >>> If HDFS is ailing, hbase is too. In fact, the >>> regionservers will shut >>> themselves to protect themselves against damaging >>> or losing data: >>> >>> 2008-10-27 23:41:12,688 FATAL >>> org.apache.hadoop.hbase.regionserver.Flusher: >>> Replay of hlog required. Forcing server restart >>> >>> So, whats up with your HDFS? Not enough space >>> alloted? What happens if >>> you run "./bin/hadoop fsck /"? Does that give you >>> a clue as to what >>> happened? Dig in the datanode and namenode logs. >>> Look for where the >>> exceptions start. It might give you a clue. >>> >>> + The suse regionserver log had garbage in it. >>> >>> St.Ack >>> >>> >>> Slava Gorelik wrote: >>> >>> >>> >>> >>> Hi. >>> My happiness was very short :-( After i >>> successfully added 1M rows (50k >>> each row) i tried to add 10M rows. >>> And after 3-4 working hours it started to >>> dying. First one region server >>> is died, after another one and eventually all >>> cluster is dead. >>> >>> I attached log files (relevant part, archived) >>> from region servers and >>> from the master. >>> >>> Best Regards. >>> >>> >>> >>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik < >>> [EMAIL PROTECTED] >>> <mailto:[EMAIL PROTECTED]><mailto: >>> [EMAIL PROTECTED] >>> <mailto:[EMAIL PROTECTED]>>> wrote: >>> >>> Hi. >>> So far so good, after changing the file >>> descriptors >>> and dfs.datanode.socket.write.timeout, >>> dfs.datanode.max.xcievers >>> my cluster works stable. >>> Thank You and Best Regards. >>> >>> P.S. Regarding deleting multiple columns >>> missing functionality i >>> filled jira : >>> https://issues.apache.org/jira/browse/HBASE-961 >>> >>> >>> >>> On Sun, Oct 26, 2008 at 12:58 AM, Michael >>> Stack <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> >>> <mailto:[EMAIL PROTECTED] >>> >>> <mailto:[EMAIL PROTECTED]>>> wrote: >>> >>> Slava Gorelik wrote: >>> >>> Hi.Haven't tried yet them, i'll try >>> tomorrow morning. In >>> general cluster is >>> working well, the problems begins if >>> i'm trying to add 10M >>> rows, after 1.2M >>> if happened. >>> >>> Anything else running beside the >>> regionserver or datanodes >>> that would suck resources? When >>> datanodes begin to slow, we >>> begin to see the issue Jean-Adrien's >>> configurations address. >>> Are you uploading using MapReduce? Are >>> TTs running on same >>> nodes as the datanode and regionserver? >>> How are you doing the >>> upload? Describe what your uploader >>> looks like (Sorry if >>> you've already done this). >>> >>> >>> I already changed the limit of files >>> descriptors, >>> >>> Good. >>> >>> >>> I'll try >>> to change the properties: >>> <property> >>> <name>dfs.datanode.socket.write.timeout</name> >>> <value>0</value> >>> </property> >>> >>> <property> >>> <name>dfs.datanode.max.xcievers</name> >>> <value>1023</value> >>> </property> >>> >>> >>> Yeah, try it. >>> >>> >>> And let you know, is any other >>> prescriptions ? Did i miss >>> something ? >>> >>> BTW, off topic, but i sent e-mail >>> recently to the list and >>> i can't see it: >>> Is it possible to delete multiple >>> columns in any way by >>> regex : for example >>> colum_name_* ? >>> >>> Not that I know of. If its not in the >>> API, it should be. >>> Mind filing a JIRA? >>> >>> Thanks Slava. >>> St.Ack >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >
