[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511733#comment-16511733
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 10:24 PM:
------------------------------------------------------------------------

[~busbey] I went back to check, here is the more detail about the test.
 1. it was a 1 master (only HMaster) 6 slaves (HDFS + Regionserver) cluster 
running with HBase 1.4.2 on EMR, setting the `hbase.rootdir` to s3 bucket with 
identity multiwal

2. I didn't use ITBILL and verify it, I have my own program writes and verify 
it, please find 
[SequentialBatchWrite|https://drive.google.com/open?id=1IIYP8ypkjslrwDJekr61AJcNRWQjjBJK]
 and 
[README|https://drive.google.com/open?id=1g7DsGa1cutWLziPCIN1ko71fazHX-QVtlHavxPNYFU0]
 here (sorry my testing code may be dirty). it's not hard to reproduce it.

3. the `dfs.replication` was 2 (maybe this is why I saw data loss?) but I never 
turn off any datanode.

4. the executed client program didn't fail and continued writing till the end, 
although we I killed the assigned region server, the write were hanging for a 
while before moving on to next puts.

5. About logs on HDFS or HBASE, I can retest it next week and capture those 
ERROR or stacktrace, but may I ask what you maybe expected from the logs?

[[email protected]] logs had been garbage collected, so I will need to rerun 
it and see if those lines exists.


was (Author: taklwu):
[~busbey] I went back to check, here is the more detail about the test.
1. it was a 1 master (only HMaster) 6 slaves (HDFS + Regionserver) cluster 
running with HBase 1.4.2 on EMR, setting the `hbase.rootdir` to s3 bucket with 
identity multiwal

2. I didn't use ITBILL and verify it, I have my own program writes and verify 
it, please find 
[SequentialBatchWrite|https://drive.google.com/open?id=1IIYP8ypkjslrwDJekr61AJcNRWQjjBJK]
 and 
[README|https://drive.google.com/open?id=1g7DsGa1cutWLziPCIN1ko71fazHX-QVtlHavxPNYFU0]
 here (sorry my testing code may be dirty). it's not hard to reproduce it.

3. the `dfs.replication` was 2 (maybe this is why I saw data loss?) but I never 
turn off any datanode.

4. the executed client program didn't fail and continued writing till the end, 
although we I killed the assigned region server, the write were hanging for a 
while before moving on to next puts.

5. About logs on HDFS or HBASE, I can retest it next week and capture those 
ERROR or stacktrace, but may I ask what you maybe expected from the logs?

[[email protected]] logs had been garbage collected, so I will need to rerun 
it and see if those lines exists. 

 

 

 

 

 

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -------------------------------------------------------------------------
>
>                 Key: HBASE-20723
>                 URL: https://issues.apache.org/jira/browse/HBASE-20723
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>    Affects Versions: 1.1.2
>            Reporter: Rohan Pednekar
>            Priority: Major
>         Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://xxxxx@yyyyy/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:[email protected]]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [[email protected]] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to