Re: Dataloss HBASE-728

stack Thu, 16 Oct 2008 10:39:30 -0700

Jean-Adrien wrote:

Hello.
I have some questions for the hbase-users (and developers) team. I'll post
them in different threads since it concerns different subjects
I read the jira issue tracker content, and I have a question about casehttp://issues.apache.org/jira/browse/HBASE-728 HBASE-728 .
I was wondering in which case dataloss is possible. And what is the impact
on rows content.

Data can be lost in present system if a regionserver does not shutdowncleanly (When hbase-728 goes in, our dataloss likelihood should greatlydiminish).

The data lost will be edits applied since the last time the regionserverwrite-ahead-log was rolled to regions carried on the downedregionserver. The WAL is rolled every 30k edits by default.

I run a little cluster (see bellow for configuration details), and I
discover that a single column family was lost over about 50 rows; which
could correspond to a MapFile (?). Could be linked to HBASE-728 ?. Note that
remaining data of the rows was present

If the updates against the column family went into the regionserver justbefore the crash and before the regionserver had had time to flush them,then lack of hbase-728 could explain this loss.

In order to avoid such case, I'm asking what I've done which yield to such a
failure. I have one idea, maybe someone can tell me if these hypothesis are
possible or not:

a) Kill regionserver during shutdown thread

- Once I stop the hbase cluster, I had to wait about 5 min the stop-hbase
script returns. After that, one of my regionservers was still running.
Looking at top, it was working with 99% of CPU usage, I wait for a while
(about 15 minutes) and I eventually decided to kill the process.

I noticed the following in the log:
Last lines before I kill (SIGINT) the process:


--- region server log ---
2008-10-14 13:07:13,606 INFO org.mortbay.util.Container: Stopped
[EMAIL PROTECTED]
2008-10-14 13:07:13,607 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/0:0:0:0:0:0:0:0:60020.compactor exiting
2008-10-14 13:07:13,608 INFO org.apache.hadoop.hbase.Leases:
regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker closing leases
2008-10-14 13:07:13,609 INFO org.apache.hadoop.hbase.Leases:
regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker closed leases

It seems to be normal, except that the Shutdown thread was not launched.When I sent the INT signal, the following line was logged



--- region server log ---
2008-10-14 13:23:03,948 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.


But (big mistake) I did not noticed it, and since the process was still
running, I sent the KILL signal. The shutdown thread had no time to end.

Was it a deadlock ( http://issues.apache.org/jira/browse/HBASE-500 HBASE-500?) in which case does a deadlock in java use all the CPU ? ( kind of while {

trylock } in the code )

Yes. Could have been deadlocked. Next time send the "-QUIT" signal andyou'll get a threaddump in the regionserver .out file.

b) hdfs errors

I often noticed such messages, I guessed that was usual, and cannot yield to

dataloss.


--- region server log ---
2008-10-14 12:03:52,267 WARN org.apache.hadoop.dfs.DFSClient: Exception
while reading from blk_-9054609689772898417_200511 of /hbase/table-0.3/1
790941809/bytes/mapfiles/7306020330727690009/data from 192.168.1.15:50010:
java.io.IOException: Premeture EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)

Disks full on your HDFS?

I see above too on occasion on our clusters. Was going to spend sometime on it today. In particular, see if HADOOP-3831 helps.

1GB of RAM as Jon Gray has suggested is not going to cut it in myexperience. With so little memory, would suggest you run datanodes andregionservers on their own machines.

While I'm giving out advice, smile, be sure to have upped your ulimit sohbase can use > 1024 file descriptors and run with DEBUG enabled so youcan see more about why hbase is doing (See FAQ for how to do both).


Thanks J-A,
St.Ack

Thanks for your work and your advises.

-- Jean-Adrien

Cluster setup:
4 regionsservers / datanodes
1 is master / namenode as well.
java-6-sun
Total size of hdfs: 81.98 GB (replication factor 3)
fsck -> healthy
hadoop: 0.18.1
hbase: 0.18.0 (jar of hadoop replaced with 0.18.1)
1Gb ram per node

Re: Dataloss HBASE-728

Reply via email to