[
https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139986#comment-13139986
]
gaojinchao commented on HBASE-4695:
-----------------------------------
Latest Trunk version, test passed in a real cluster:
Region Server logs:
2011-10-31 03:32:42,922 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
C3S31,20020,1320034091400
2011-10-31 03:32:46,974 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
C3S31,20020,1320034091400; all regions closed.
2011-10-31 03:32:48,633 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Moved 7 log files to /hbase/.oldlogs
2011-10-31 03:32:49,200 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
C3S31,20020,1320034091400; zookeeper connection closed.
Namenode logs:
2011-10-31 03:32:46,988 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(192)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=listStatus src=/hbase/.logs/C3S31,20020,1320034091400
perm=root:supergroup:rwxr-xr-x
2011-10-31 03:32:46,991 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320045179340
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320045179340
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,992 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046155808
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046155808
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,994 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046186294
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046186294
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,996 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046216288
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046216288
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,998 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046255166
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046255166
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:47,206 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(192)) - ugi=webuser,webgroup ip=/158.1.130.33
cmd=listStatus src=/hbase/.logs/C3S31,20020,1320034091400
perm=root:supergroup:rwxr-xr-x
2011-10-31 03:32:48,518 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046295501
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046295501
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:48,633 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=rename
src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046325013
dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046325013
perm=root:supergroup:rw-r--r--
2011-10-31 03:32:48,650 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(206)) - ugi=root,root,sfcb ip=/158.1.130.31
cmd=delete src=/hbase/.logs/C3S31,20020,1320034091400
2011-10-31 03:32:49,389 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditEvent(206)) - ugi=root,root,sfcb ip=/158.1.130.32
cmd=delete src=/hbase/.META./1028785192/.tmp
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the
> region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many
> regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting
> on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting
> on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=root ip=/10.101.1.5 cmd=delete
> src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have
> any WAL logs to replay. We need to make sure that logs are deleted or moved
> out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira