Jean-Adrien wrote:
Hello.
I have some questions for the hbase-users (and developers) team. I'll post
them in different threads since it concerns different subjects

I read the jira issue tracker content, and I have a question about case http://issues.apache.org/jira/browse/HBASE-728 HBASE-728 .

I was wondering in which case dataloss is possible. And what is the impact
on rows content.
Data can be lost in present system if a regionserver does not shutdown cleanly (When hbase-728 goes in, our dataloss likelihood should greatly diminish).

The data lost will be edits applied since the last time the regionserver write-ahead-log was rolled to regions carried on the downed regionserver. The WAL is rolled every 30k edits by default.


I run a little cluster (see bellow for configuration details), and I
discover that a single column family was lost over about 50 rows; which
could correspond to a MapFile (?). Could be linked to HBASE-728 ?. Note that
remaining data of the rows was present

If the updates against the column family went into the regionserver just before the crash and before the regionserver had had time to flush them, then lack of hbase-728 could explain this loss.

In order to avoid such case, I'm asking what I've done which yield to such a
failure. I have one idea, maybe someone can tell me if these hypothesis are
possible or not:

a) Kill regionserver during shutdown thread

- Once I stop the hbase cluster, I had to wait about 5 min the stop-hbase
script returns. After that, one of my regionservers was still running.
Looking at top, it was working with 99% of CPU usage, I wait for a while
(about 15 minutes) and I eventually decided to kill the process.

I noticed the following in the log:
Last lines before I kill (SIGINT) the process:


--- region server log ---
2008-10-14 13:07:13,606 INFO org.mortbay.util.Container: Stopped
[EMAIL PROTECTED]
2008-10-14 13:07:13,607 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/0:0:0:0:0:0:0:0:60020.compactor exiting
2008-10-14 13:07:13,608 INFO org.apache.hadoop.hbase.Leases:
regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker closing leases
2008-10-14 13:07:13,609 INFO org.apache.hadoop.hbase.Leases:
regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker closed leases


It seems to be normal, except that the Shutdown thread was not launched. When I sent the INT signal, the following line was logged


--- region server log ---
2008-10-14 13:23:03,948 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.


But (big mistake) I did not noticed it, and since the process was still
running, I sent the KILL signal. The shutdown thread had no time to end.

Was it a deadlock ( http://issues.apache.org/jira/browse/HBASE-500 HBASE-500 ?) in which case does a deadlock in java use all the CPU ? ( kind of while {
trylock } in the code )

Yes. Could have been deadlocked. Next time send the "-QUIT" signal and you'll get a threaddump in the regionserver .out file.


b) hdfs errors

I often noticed such messages, I guessed that was usual, and cannot yield to
dataloss.

--- region server log ---
2008-10-14 12:03:52,267 WARN org.apache.hadoop.dfs.DFSClient: Exception
while reading from blk_-9054609689772898417_200511 of /hbase/table-0.3/1
790941809/bytes/mapfiles/7306020330727690009/data from 192.168.1.15:50010:
java.io.IOException: Premeture EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)


Disks full on your HDFS?

I see above too on occasion on our clusters. Was going to spend some time on it today. In particular, see if HADOOP-3831 helps.

1GB of RAM as Jon Gray has suggested is not going to cut it in my experience. With so little memory, would suggest you run datanodes and regionservers on their own machines.

While I'm giving out advice, smile, be sure to have upped your ulimit so hbase can use > 1024 file descriptors and run with DEBUG enabled so you can see more about why hbase is doing (See FAQ for how to do both).

Thanks J-A,
St.Ack

Thanks for your work and your advises.

-- Jean-Adrien

Cluster setup:
4 regionsservers / datanodes
1 is master / namenode as well.
java-6-sun
Total size of hdfs: 81.98 GB (replication factor 3)
fsck -> healthy
hadoop: 0.18.1
hbase: 0.18.0 (jar of hadoop replaced with 0.18.1)
1Gb ram per node


Reply via email to