[ 
https://issues.apache.org/jira/browse/HBASE-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760594#comment-13760594
 ] 

Elliott Clark edited comment on HBASE-9338 at 9/6/13 10:21 PM:
---------------------------------------------------------------

h1.What's happened

When the new Hadoop release was in RC and we were getting close to a 0.96 
release of our own I changed all of the IT tests to run for longer. When I did 
this IntegrationTestBigLinkedList and IntegrationTestLoadAndVerify started 
failing.  I at first thought this was due to 2.1.0 but now I'm pretty sure it 
was just a red herring as the tests pass with a shorter run time. So I started 
creating new chaos monkeys with different actions turned on and off.  Doing 
that I found that Online Schema Change was what was causing Test Load and 
Verify to fail.  So that was disabled in 0.96.

Then I started working on Big Linked List.  That one has been much harder.

h1.Observations

The calm monkey always passes.  The unbalance one almost always passes.  When 
it doesn't, the failure is usually due to the assignment manager getting stuck. 
 There appears to be no data loss then.

Lots of the failures are like this one: 
{code}  Map-Reduce Framework
                Map input records=599999660
                Map output records=1199999320
                Map output bytes=38999977900
                Map output materialized bytes=41399980716
                Input split bytes=7360
                Combine input records=0
                Combine output records=0
                Reduce input groups=600000000
                Reduce shuffle bytes=41399980716
                Reduce input records=1199999320
                Reduce output records=680
                Spilled Records=3576554634
                Shuffled Maps =696
                Failed Shuffles=0
                Merged Map outputs=696
                GC time elapsed (ms)=96902
                CPU time spent (ms)=14269000
                Physical memory (bytes) snapshot=86934548480
                Virtual memory (bytes) snapshot=157839667200
                Total committed heap usage (bytes)=86516498432
        HBase Counters
                BYTES_IN_REMOTE_RESULTS=29267422740
                BYTES_IN_RESULTS=35999982060
                MILLIS_BETWEEN_NEXTS=15450135
                NOT_SERVING_REGION_EXCEPTION=27
                NUM_SCANNER_RESTARTS=0
                REGIONS_SCANNED=100
                REMOTE_RPC_CALLS=4878206
                REMOTE_RPC_RETRIES=90
                RPC_CALLS=6000370
                RPC_RETRIES=103
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
                REFERENCED=599999320
                UNDEFINED=340
                UNREFERENCED=340
{code}

The number of map input records is too low.  It should always be a multiple of 
30m. The number of un-referenced rows is equal to the number of undefined rows. 
If you add that number to the number of map input records then all rows are 
accounted for. That suggests that for this job there are lots of places that we 
are missing exactly on link at each location. 

However that pattern isn't 100%.  There are other tests where the number of 
unreferenced nodes is not equal to the number of undefined rows.  So those 
tests runs have some places in the chain were more then one link is missing.  
To me this seems to suggest that it's not an off by one error and that it seems 
like it's on the client side.

h1.Things I've tried

* Tried to see if it's just a badly written test
** The test is hard to follow and could be cleaned up
** But it's logic seems solid.
* I thought maybe yarn is the one messing up our data.  So I Changed the output 
keys to match what Load Test and Verify was doing
** This didn't seem to have any real impact one Online Schema Change was turned 
off.
* Tried to see if one map task was failing.  So I changed the job to log more 
and to use a better client column.
** It's not just one map task that has failures
** It is always the last generation job that was the writer of data that's 
missing.
* Then I tried to see if it's just the beginning or the end of the linked list 
that is missing data
** It's not.  There is missing data everywhere in the chain.
* Running Verify step with no monkey
** This failed suggesting that we are really losing data
* Then I though it might be the same bug that was causing replication to lose 
edits. So I re-ran the tests before and after JD's HLog reader change.
** Tests fail with JD's patch
* Then I tried Running the test with a Monkey that kills RS and Master
** This passed
** I even ran this with more loop iterations and it still passed
* I Thought maybe it was merge regions. So I turned off merging regions
** The tests still failed.
* So I ran a job where inside of the generation job I run a get right after the 
puts.
** This was SLOW.  
** Like really really really slow.  10x slower.  What has happened to our get 
speed ?
** It failed.  To me this suggested that this was related to the client.  My 
thought was that the client was silently failing a put.
* Then I thought that it was possible that the gets were just arriving at the 
region servers after the data loss had happened.
* So I moved the get *directly* after the puts.
** This required flushing a lot more. So it was slow.
** The first two iterations passed but it would be 3 days for me to finish the 
entire five iterations so I stopped the run.
** Everything was passing though
* Next I thought it could be a JVM issue.  1.7.15 had been a little flakey on 
me.
* So I upgraded to 1.7.25.
** This had no effect.
* Next run I started looking at where the data was lost.
** {color:red}All the data I lost in my last run was in the same region.{color}
** All of it was around the same time.
** Different mappers.


h1.Sample Dangling Pointers
{code}
=> #<#<Class:0x2123dde6>:0x5f1840c3>
hbase(main):002:0> get 'IntegrationTestBigLinkedList', 
"Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465754975, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000007_0   
                                             
 meta:count                                         timestamp=1378465754975, 
value=\x00\x00\x00\x00\x01:{\xE9                                                
                                             
 meta:prev                                          timestamp=1378465754975, 
value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H                            
                                             
3 row(s) in 0.0920 seconds

hbase(main):003:0> get 'IntegrationTestBigLinkedList', 
"\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465679500, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000001_0   
                                             
 meta:count                                         timestamp=1378465679500, 
value=\x00\x00\x00\x00\x01_\x97z                                                
                                             
 meta:prev                                          timestamp=1378465679500, 
value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19                         
                                             
3 row(s) in 0.0940 seconds

hbase(main):004:0> get 'IntegrationTestBigLinkedList', 
"s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465746735, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0   
                                             
 meta:count                                         timestamp=1378465746735, 
value=\x00\x00\x00\x00\x01Z:\xEB                                                
                                             
 meta:prev                                          timestamp=1378465746735, 
value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A                            
                                             
3 row(s) in 0.1850 seconds

hbase(main):005:0> get 'IntegrationTestBigLinkedList', 
"T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465679216, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0   
                                             
 meta:count                                         timestamp=1378465679216, 
value=\x00\x00\x00\x00\x01P\xC1!                                                
                                             
 meta:prev                                          timestamp=1378465679216, 
value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD                            
                                             
3 row(s) in 0.1400 seconds

hbase(main):006:0> get 'IntegrationTestBigLinkedList', 
"\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465684324, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000003_0   
                                             
 meta:count                                         timestamp=1378465684324, 
value=\x00\x00\x00\x00\x01b\x12\xCA                                             
                                             
 meta:prev                                          timestamp=1378465684324, 
value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C                         
                                             
3 row(s) in 0.0730 seconds

hbase(main):007:0> get 'IntegrationTestBigLinkedList', 
"|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465730726, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0   
                                             
 meta:count                                         timestamp=1378465730726, 
value=\x00\x00\x00\x00\x01R\x99\x0D                                             
                                             
 meta:prev                                          timestamp=1378465730726, 
value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6                         
                                             
3 row(s) in 0.1530 seconds

hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p 
^V\xA4\xD0@\x0A\x0E\xC1"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465752879, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0   
                                             
 meta:count                                         timestamp=1378465752879, 
value=\x00\x00\x00\x00\x01Z\xC8%                                                
                                             
 meta:prev                                          timestamp=1378465752879, 
value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6                                  
                                             
3 row(s) in 0.0780 seconds
{code}

All of the missing rows start with "\xAE" or "\xAD"

{code}
hbase(main):002:0> java.util.Date.new(1378465752879).toString
=> "Fri Sep 06 04:09:12 PDT 2013"
hbase(main):003:0> java.util.Date.new(1378465730726).toString
=> "Fri Sep 06 04:08:50 PDT 2013"
hbase(main):004:0> java.util.Date.new(1378465684324).toString
=> "Fri Sep 06 04:08:04 PDT 2013"
hbase(main):005:0> java.util.Date.new(1378465679216).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):006:0> java.util.Date.new(1378465746735).toString
=> "Fri Sep 06 04:09:06 PDT 2013"
hbase(main):007:0> java.util.Date.new(1378465679500).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):008:0> java.util.Date.new(1378465754975).toString
=> "Fri Sep 06 04:09:14 PDT 2013"
{code}

The dangling pointers are all written within 3 mins of each other.

h2.Side Note

JDK7 seems to core dump a lot when starting up region servers.  Once things are 
all set up it's stable.  I need to turn on core dumps into the a known folder 
so the Jenkins can archive them. 
                
      was (Author: eclark):
    h1.What's happened

When the new Hadoop release was in RC and we were getting close to a 0.96 
release of our own I changed all of the IT tests to run for longer. When I did 
this IntegrationTestBigLinkedList and IntegrationTestLoadAndVerify started 
failing.  I at first thought this was due to 2.1.0 but now I'm pretty sure it 
was just a red herring as the tests pass with a shorter run time. So I started 
creating new chaos monkeys with different actions turned on and off.  Doing 
that I found that Online Schema Change was what was causing Test Load and 
Verify to fail.  So that was disabled in 0.96.

Then I started working on Big Linked List.  That one has been much harder.

h1.Observations

The calm monkey always passes.  The unbalance one almost always passes.  When 
it doesn't, the failure is usually due to the assignment manager getting stuck. 
 There appears to be no data loss then.

Lots of the failures are like this one: 
{code}  Map-Reduce Framework
                Map input records=599999660
                Map output records=1199999320
                Map output bytes=38999977900
                Map output materialized bytes=41399980716
                Input split bytes=7360
                Combine input records=0
                Combine output records=0
                Reduce input groups=600000000
                Reduce shuffle bytes=41399980716
                Reduce input records=1199999320
                Reduce output records=680
                Spilled Records=3576554634
                Shuffled Maps =696
                Failed Shuffles=0
                Merged Map outputs=696
                GC time elapsed (ms)=96902
                CPU time spent (ms)=14269000
                Physical memory (bytes) snapshot=86934548480
                Virtual memory (bytes) snapshot=157839667200
                Total committed heap usage (bytes)=86516498432
        HBase Counters
                BYTES_IN_REMOTE_RESULTS=29267422740
                BYTES_IN_RESULTS=35999982060
                MILLIS_BETWEEN_NEXTS=15450135
                NOT_SERVING_REGION_EXCEPTION=27
                NUM_SCANNER_RESTARTS=0
                REGIONS_SCANNED=100
                REMOTE_RPC_CALLS=4878206
                REMOTE_RPC_RETRIES=90
                RPC_CALLS=6000370
                RPC_RETRIES=103
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
                REFERENCED=599999320
                UNDEFINED=340
                UNREFERENCED=340
{code}

The number of map input records is too low.  It should always be a multiple of 
30m. The number of un-referenced rows is equal to the number of undefined rows. 
If you add that number to the number of map input records then all rows are 
accounted for. That suggests that for this job there are lots of places that we 
are missing exactly on link at each location. 

However that pattern isn't 100%.  There are other tests where the number of 
unreferenced nodes is not equal to the number of undefined rows.  So those 
tests runs have some places in the chain were more then one link is missing.  
To me this seems to suggest that it's not an off by one error and that it seems 
like it's on the client side.

h1.Things I've tried

* Tried to see if it's just a badly written test
** The test is hard to follow and could be cleaned up
** But it's logic seems solid.
* I thought maybe yarn is the one messing up our data.  So I Changed the output 
keys to match what Load Test and Verify was doing
** This didn't seem to have any real impact one Online Schema Change was turned 
off.
* Tried to see if one map task was failing.  So I changed the job to log more 
and to use a better client column.
** It's not just one map task that has failures
** It is always the last generation job that was the writer of data that's 
missing.
* Then I tried to see if it's just the beginning or the end of the linked list 
that is missing data
** It's not.  There is missing data everywhere in the chain.
* Running Verify step with no monkey
** This failed suggesting that we are really losing data
* Then I though it might be the same bug that was causing replication to lose 
edits. So I re-ran the tests before and after JD's HLog reader change.
** Tests fail with JD's patch
* Then I tried Running the test with a Monkey that kills RS and Master
** This passed
** I even ran this with more loop iterations and it still passed
* I Thought maybe it was merge regions. So I turned off merging regions
** The tests still failed.
* So I ran a job where inside of the generation job I run a get right after the 
puts.
** This was SLOW.  
** Like really really really slow.  10x slower.  What has happened to our get 
speed ?
** It failed.  To me this suggested that this was related to the client.  My 
thought was that the client was silently failing a put.
* Then I thought that it was possible that the gets were just arriving at the 
region servers after the data loss had happened.
* So I moved the get *directly* after the puts.
** This required flushing a lot more. So it was slow.
** The first two iterations passed but it would be 3 days for me to finish the 
entire five iterations so I stopped the run.
** Everything was passing though
* Next I thought it could be a JVM issue.  1.7.15 had been a little flakey on 
me.
* So I upgraded to 1.7.25.
** This had no effect.
* Next run I started looking at where the data was lost.
** {color:red}All the data I lost in my last run was in the same region.{color}
** All of it was around the same time.
** Different mappers.


h1.Sample Dangling Pointers
{code}
=> #<#<Class:0x2123dde6>:0x5f1840c3>
hbase(main):002:0> get 'IntegrationTestBigLinkedList', 
"Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465754975, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000007_0   
                                             
 meta:count                                         timestamp=1378465754975, 
value=\x00\x00\x00\x00\x01:{\xE9                                                
                                             
 meta:prev                                          timestamp=1378465754975, 
value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H                            
                                             
3 row(s) in 0.0920 seconds

hbase(main):003:0> get 'IntegrationTestBigLinkedList', 
"\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465679500, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000001_0   
                                             
 meta:count                                         timestamp=1378465679500, 
value=\x00\x00\x00\x00\x01_\x97z                                                
                                             
 meta:prev                                          timestamp=1378465679500, 
value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19                         
                                             
3 row(s) in 0.0940 seconds

hbase(main):004:0> get 'IntegrationTestBigLinkedList', 
"s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465746735, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0   
                                             
 meta:count                                         timestamp=1378465746735, 
value=\x00\x00\x00\x00\x01Z:\xEB                                                
                                             
 meta:prev                                          timestamp=1378465746735, 
value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A                            
                                             
3 row(s) in 0.1850 seconds

hbase(main):005:0> get 'IntegrationTestBigLinkedList', 
"T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465679216, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0   
                                             
 meta:count                                         timestamp=1378465679216, 
value=\x00\x00\x00\x00\x01P\xC1!                                                
                                             
 meta:prev                                          timestamp=1378465679216, 
value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD                            
                                             
3 row(s) in 0.1400 seconds

hbase(main):006:0> get 'IntegrationTestBigLinkedList', 
"\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465684324, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000003_0   
                                             
 meta:count                                         timestamp=1378465684324, 
value=\x00\x00\x00\x00\x01b\x12\xCA                                             
                                             
 meta:prev                                          timestamp=1378465684324, 
value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C                         
                                             
3 row(s) in 0.0730 seconds

hbase(main):007:0> get 'IntegrationTestBigLinkedList', 
"|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465730726, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0   
                                             
 meta:count                                         timestamp=1378465730726, 
value=\x00\x00\x00\x00\x01R\x99\x0D                                             
                                             
 meta:prev                                          timestamp=1378465730726, 
value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6                         
                                             
3 row(s) in 0.1530 seconds

hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p 
^V\xA4\xD0@\x0A\x0E\xC1"
COLUMN                                              CELL                        
                                                                                
                                          
 meta:client                                        timestamp=1378465752879, 
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0   
                                             
 meta:count                                         timestamp=1378465752879, 
value=\x00\x00\x00\x00\x01Z\xC8%                                                
                                             
 meta:prev                                          timestamp=1378465752879, 
value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6                                  
                                             
3 row(s) in 0.0780 seconds
{code}

All of the missing rows start with "\xAE" or "\xAD"

{code}
hbase(main):002:0> java.util.Date.new(1378465752879).toString
=> "Fri Sep 06 04:09:12 PDT 2013"
hbase(main):003:0> java.util.Date.new(1378465730726).toString
=> "Fri Sep 06 04:08:50 PDT 2013"
hbase(main):004:0> java.util.Date.new(1378465684324).toString
=> "Fri Sep 06 04:08:04 PDT 2013"
hbase(main):005:0> java.util.Date.new(1378465679216).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):006:0> java.util.Date.new(1378465746735).toString
=> "Fri Sep 06 04:09:06 PDT 2013"
hbase(main):007:0> java.util.Date.new(1378465679500).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):008:0> java.util.Date.new(1378465754975).toString
=> "Fri Sep 06 04:09:14 PDT 2013"
{code}

The dangling pointers are all written within 3-4 seconds of each other.

h2.Side Note

JDK7 seems to core dump a lot when starting up region servers.  Once things are 
all set up it's stable.  I need to turn on core dumps into the a known folder 
so the Jenkins can archive them. 
                  
> Test Big Linked List fails on Hadoop 2.1.0
> ------------------------------------------
>
>                 Key: HBASE-9338
>                 URL: https://issues.apache.org/jira/browse/HBASE-9338
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>            Priority: Blocker
>             Fix For: 0.96.0
>
>         Attachments: HBASE-HBASE-9338-TESTING.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to