[
https://issues.apache.org/jira/browse/HBASE-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760594#comment-13760594
]
Elliott Clark edited comment on HBASE-9338 at 9/6/13 10:21 PM:
---------------------------------------------------------------
h1.What's happened
When the new Hadoop release was in RC and we were getting close to a 0.96
release of our own I changed all of the IT tests to run for longer. When I did
this IntegrationTestBigLinkedList and IntegrationTestLoadAndVerify started
failing. I at first thought this was due to 2.1.0 but now I'm pretty sure it
was just a red herring as the tests pass with a shorter run time. So I started
creating new chaos monkeys with different actions turned on and off. Doing
that I found that Online Schema Change was what was causing Test Load and
Verify to fail. So that was disabled in 0.96.
Then I started working on Big Linked List. That one has been much harder.
h1.Observations
The calm monkey always passes. The unbalance one almost always passes. When
it doesn't, the failure is usually due to the assignment manager getting stuck.
There appears to be no data loss then.
Lots of the failures are like this one:
{code} Map-Reduce Framework
Map input records=599999660
Map output records=1199999320
Map output bytes=38999977900
Map output materialized bytes=41399980716
Input split bytes=7360
Combine input records=0
Combine output records=0
Reduce input groups=600000000
Reduce shuffle bytes=41399980716
Reduce input records=1199999320
Reduce output records=680
Spilled Records=3576554634
Shuffled Maps =696
Failed Shuffles=0
Merged Map outputs=696
GC time elapsed (ms)=96902
CPU time spent (ms)=14269000
Physical memory (bytes) snapshot=86934548480
Virtual memory (bytes) snapshot=157839667200
Total committed heap usage (bytes)=86516498432
HBase Counters
BYTES_IN_REMOTE_RESULTS=29267422740
BYTES_IN_RESULTS=35999982060
MILLIS_BETWEEN_NEXTS=15450135
NOT_SERVING_REGION_EXCEPTION=27
NUM_SCANNER_RESTARTS=0
REGIONS_SCANNED=100
REMOTE_RPC_CALLS=4878206
REMOTE_RPC_RETRIES=90
RPC_CALLS=6000370
RPC_RETRIES=103
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
REFERENCED=599999320
UNDEFINED=340
UNREFERENCED=340
{code}
The number of map input records is too low. It should always be a multiple of
30m. The number of un-referenced rows is equal to the number of undefined rows.
If you add that number to the number of map input records then all rows are
accounted for. That suggests that for this job there are lots of places that we
are missing exactly on link at each location.
However that pattern isn't 100%. There are other tests where the number of
unreferenced nodes is not equal to the number of undefined rows. So those
tests runs have some places in the chain were more then one link is missing.
To me this seems to suggest that it's not an off by one error and that it seems
like it's on the client side.
h1.Things I've tried
* Tried to see if it's just a badly written test
** The test is hard to follow and could be cleaned up
** But it's logic seems solid.
* I thought maybe yarn is the one messing up our data. So I Changed the output
keys to match what Load Test and Verify was doing
** This didn't seem to have any real impact one Online Schema Change was turned
off.
* Tried to see if one map task was failing. So I changed the job to log more
and to use a better client column.
** It's not just one map task that has failures
** It is always the last generation job that was the writer of data that's
missing.
* Then I tried to see if it's just the beginning or the end of the linked list
that is missing data
** It's not. There is missing data everywhere in the chain.
* Running Verify step with no monkey
** This failed suggesting that we are really losing data
* Then I though it might be the same bug that was causing replication to lose
edits. So I re-ran the tests before and after JD's HLog reader change.
** Tests fail with JD's patch
* Then I tried Running the test with a Monkey that kills RS and Master
** This passed
** I even ran this with more loop iterations and it still passed
* I Thought maybe it was merge regions. So I turned off merging regions
** The tests still failed.
* So I ran a job where inside of the generation job I run a get right after the
puts.
** This was SLOW.
** Like really really really slow. 10x slower. What has happened to our get
speed ?
** It failed. To me this suggested that this was related to the client. My
thought was that the client was silently failing a put.
* Then I thought that it was possible that the gets were just arriving at the
region servers after the data loss had happened.
* So I moved the get *directly* after the puts.
** This required flushing a lot more. So it was slow.
** The first two iterations passed but it would be 3 days for me to finish the
entire five iterations so I stopped the run.
** Everything was passing though
* Next I thought it could be a JVM issue. 1.7.15 had been a little flakey on
me.
* So I upgraded to 1.7.25.
** This had no effect.
* Next run I started looking at where the data was lost.
** {color:red}All the data I lost in my last run was in the same region.{color}
** All of it was around the same time.
** Different mappers.
h1.Sample Dangling Pointers
{code}
=> #<#<Class:0x2123dde6>:0x5f1840c3>
hbase(main):002:0> get 'IntegrationTestBigLinkedList',
"Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92"
COLUMN CELL
meta:client timestamp=1378465754975,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000007_0
meta:count timestamp=1378465754975,
value=\x00\x00\x00\x00\x01:{\xE9
meta:prev timestamp=1378465754975,
value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H
3 row(s) in 0.0920 seconds
hbase(main):003:0> get 'IntegrationTestBigLinkedList',
"\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4"
COLUMN CELL
meta:client timestamp=1378465679500,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000001_0
meta:count timestamp=1378465679500,
value=\x00\x00\x00\x00\x01_\x97z
meta:prev timestamp=1378465679500,
value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19
3 row(s) in 0.0940 seconds
hbase(main):004:0> get 'IntegrationTestBigLinkedList',
"s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94"
COLUMN CELL
meta:client timestamp=1378465746735,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0
meta:count timestamp=1378465746735,
value=\x00\x00\x00\x00\x01Z:\xEB
meta:prev timestamp=1378465746735,
value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A
3 row(s) in 0.1850 seconds
hbase(main):005:0> get 'IntegrationTestBigLinkedList',
"T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX"
COLUMN CELL
meta:client timestamp=1378465679216,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0
meta:count timestamp=1378465679216,
value=\x00\x00\x00\x00\x01P\xC1!
meta:prev timestamp=1378465679216,
value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD
3 row(s) in 0.1400 seconds
hbase(main):006:0> get 'IntegrationTestBigLinkedList',
"\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB"
COLUMN CELL
meta:client timestamp=1378465684324,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000003_0
meta:count timestamp=1378465684324,
value=\x00\x00\x00\x00\x01b\x12\xCA
meta:prev timestamp=1378465684324,
value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C
3 row(s) in 0.0730 seconds
hbase(main):007:0> get 'IntegrationTestBigLinkedList',
"|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83"
COLUMN CELL
meta:client timestamp=1378465730726,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0
meta:count timestamp=1378465730726,
value=\x00\x00\x00\x00\x01R\x99\x0D
meta:prev timestamp=1378465730726,
value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6
3 row(s) in 0.1530 seconds
hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p
^V\xA4\xD0@\x0A\x0E\xC1"
COLUMN CELL
meta:client timestamp=1378465752879,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0
meta:count timestamp=1378465752879,
value=\x00\x00\x00\x00\x01Z\xC8%
meta:prev timestamp=1378465752879,
value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6
3 row(s) in 0.0780 seconds
{code}
All of the missing rows start with "\xAE" or "\xAD"
{code}
hbase(main):002:0> java.util.Date.new(1378465752879).toString
=> "Fri Sep 06 04:09:12 PDT 2013"
hbase(main):003:0> java.util.Date.new(1378465730726).toString
=> "Fri Sep 06 04:08:50 PDT 2013"
hbase(main):004:0> java.util.Date.new(1378465684324).toString
=> "Fri Sep 06 04:08:04 PDT 2013"
hbase(main):005:0> java.util.Date.new(1378465679216).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):006:0> java.util.Date.new(1378465746735).toString
=> "Fri Sep 06 04:09:06 PDT 2013"
hbase(main):007:0> java.util.Date.new(1378465679500).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):008:0> java.util.Date.new(1378465754975).toString
=> "Fri Sep 06 04:09:14 PDT 2013"
{code}
The dangling pointers are all written within 3 mins of each other.
h2.Side Note
JDK7 seems to core dump a lot when starting up region servers. Once things are
all set up it's stable. I need to turn on core dumps into the a known folder
so the Jenkins can archive them.
was (Author: eclark):
h1.What's happened
When the new Hadoop release was in RC and we were getting close to a 0.96
release of our own I changed all of the IT tests to run for longer. When I did
this IntegrationTestBigLinkedList and IntegrationTestLoadAndVerify started
failing. I at first thought this was due to 2.1.0 but now I'm pretty sure it
was just a red herring as the tests pass with a shorter run time. So I started
creating new chaos monkeys with different actions turned on and off. Doing
that I found that Online Schema Change was what was causing Test Load and
Verify to fail. So that was disabled in 0.96.
Then I started working on Big Linked List. That one has been much harder.
h1.Observations
The calm monkey always passes. The unbalance one almost always passes. When
it doesn't, the failure is usually due to the assignment manager getting stuck.
There appears to be no data loss then.
Lots of the failures are like this one:
{code} Map-Reduce Framework
Map input records=599999660
Map output records=1199999320
Map output bytes=38999977900
Map output materialized bytes=41399980716
Input split bytes=7360
Combine input records=0
Combine output records=0
Reduce input groups=600000000
Reduce shuffle bytes=41399980716
Reduce input records=1199999320
Reduce output records=680
Spilled Records=3576554634
Shuffled Maps =696
Failed Shuffles=0
Merged Map outputs=696
GC time elapsed (ms)=96902
CPU time spent (ms)=14269000
Physical memory (bytes) snapshot=86934548480
Virtual memory (bytes) snapshot=157839667200
Total committed heap usage (bytes)=86516498432
HBase Counters
BYTES_IN_REMOTE_RESULTS=29267422740
BYTES_IN_RESULTS=35999982060
MILLIS_BETWEEN_NEXTS=15450135
NOT_SERVING_REGION_EXCEPTION=27
NUM_SCANNER_RESTARTS=0
REGIONS_SCANNED=100
REMOTE_RPC_CALLS=4878206
REMOTE_RPC_RETRIES=90
RPC_CALLS=6000370
RPC_RETRIES=103
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
REFERENCED=599999320
UNDEFINED=340
UNREFERENCED=340
{code}
The number of map input records is too low. It should always be a multiple of
30m. The number of un-referenced rows is equal to the number of undefined rows.
If you add that number to the number of map input records then all rows are
accounted for. That suggests that for this job there are lots of places that we
are missing exactly on link at each location.
However that pattern isn't 100%. There are other tests where the number of
unreferenced nodes is not equal to the number of undefined rows. So those
tests runs have some places in the chain were more then one link is missing.
To me this seems to suggest that it's not an off by one error and that it seems
like it's on the client side.
h1.Things I've tried
* Tried to see if it's just a badly written test
** The test is hard to follow and could be cleaned up
** But it's logic seems solid.
* I thought maybe yarn is the one messing up our data. So I Changed the output
keys to match what Load Test and Verify was doing
** This didn't seem to have any real impact one Online Schema Change was turned
off.
* Tried to see if one map task was failing. So I changed the job to log more
and to use a better client column.
** It's not just one map task that has failures
** It is always the last generation job that was the writer of data that's
missing.
* Then I tried to see if it's just the beginning or the end of the linked list
that is missing data
** It's not. There is missing data everywhere in the chain.
* Running Verify step with no monkey
** This failed suggesting that we are really losing data
* Then I though it might be the same bug that was causing replication to lose
edits. So I re-ran the tests before and after JD's HLog reader change.
** Tests fail with JD's patch
* Then I tried Running the test with a Monkey that kills RS and Master
** This passed
** I even ran this with more loop iterations and it still passed
* I Thought maybe it was merge regions. So I turned off merging regions
** The tests still failed.
* So I ran a job where inside of the generation job I run a get right after the
puts.
** This was SLOW.
** Like really really really slow. 10x slower. What has happened to our get
speed ?
** It failed. To me this suggested that this was related to the client. My
thought was that the client was silently failing a put.
* Then I thought that it was possible that the gets were just arriving at the
region servers after the data loss had happened.
* So I moved the get *directly* after the puts.
** This required flushing a lot more. So it was slow.
** The first two iterations passed but it would be 3 days for me to finish the
entire five iterations so I stopped the run.
** Everything was passing though
* Next I thought it could be a JVM issue. 1.7.15 had been a little flakey on
me.
* So I upgraded to 1.7.25.
** This had no effect.
* Next run I started looking at where the data was lost.
** {color:red}All the data I lost in my last run was in the same region.{color}
** All of it was around the same time.
** Different mappers.
h1.Sample Dangling Pointers
{code}
=> #<#<Class:0x2123dde6>:0x5f1840c3>
hbase(main):002:0> get 'IntegrationTestBigLinkedList',
"Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92"
COLUMN CELL
meta:client timestamp=1378465754975,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000007_0
meta:count timestamp=1378465754975,
value=\x00\x00\x00\x00\x01:{\xE9
meta:prev timestamp=1378465754975,
value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H
3 row(s) in 0.0920 seconds
hbase(main):003:0> get 'IntegrationTestBigLinkedList',
"\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4"
COLUMN CELL
meta:client timestamp=1378465679500,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000001_0
meta:count timestamp=1378465679500,
value=\x00\x00\x00\x00\x01_\x97z
meta:prev timestamp=1378465679500,
value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19
3 row(s) in 0.0940 seconds
hbase(main):004:0> get 'IntegrationTestBigLinkedList',
"s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94"
COLUMN CELL
meta:client timestamp=1378465746735,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0
meta:count timestamp=1378465746735,
value=\x00\x00\x00\x00\x01Z:\xEB
meta:prev timestamp=1378465746735,
value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A
3 row(s) in 0.1850 seconds
hbase(main):005:0> get 'IntegrationTestBigLinkedList',
"T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX"
COLUMN CELL
meta:client timestamp=1378465679216,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0
meta:count timestamp=1378465679216,
value=\x00\x00\x00\x00\x01P\xC1!
meta:prev timestamp=1378465679216,
value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD
3 row(s) in 0.1400 seconds
hbase(main):006:0> get 'IntegrationTestBigLinkedList',
"\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB"
COLUMN CELL
meta:client timestamp=1378465684324,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000003_0
meta:count timestamp=1378465684324,
value=\x00\x00\x00\x00\x01b\x12\xCA
meta:prev timestamp=1378465684324,
value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C
3 row(s) in 0.0730 seconds
hbase(main):007:0> get 'IntegrationTestBigLinkedList',
"|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83"
COLUMN CELL
meta:client timestamp=1378465730726,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0
meta:count timestamp=1378465730726,
value=\x00\x00\x00\x00\x01R\x99\x0D
meta:prev timestamp=1378465730726,
value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6
3 row(s) in 0.1530 seconds
hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p
^V\xA4\xD0@\x0A\x0E\xC1"
COLUMN CELL
meta:client timestamp=1378465752879,
value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0
meta:count timestamp=1378465752879,
value=\x00\x00\x00\x00\x01Z\xC8%
meta:prev timestamp=1378465752879,
value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6
3 row(s) in 0.0780 seconds
{code}
All of the missing rows start with "\xAE" or "\xAD"
{code}
hbase(main):002:0> java.util.Date.new(1378465752879).toString
=> "Fri Sep 06 04:09:12 PDT 2013"
hbase(main):003:0> java.util.Date.new(1378465730726).toString
=> "Fri Sep 06 04:08:50 PDT 2013"
hbase(main):004:0> java.util.Date.new(1378465684324).toString
=> "Fri Sep 06 04:08:04 PDT 2013"
hbase(main):005:0> java.util.Date.new(1378465679216).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):006:0> java.util.Date.new(1378465746735).toString
=> "Fri Sep 06 04:09:06 PDT 2013"
hbase(main):007:0> java.util.Date.new(1378465679500).toString
=> "Fri Sep 06 04:07:59 PDT 2013"
hbase(main):008:0> java.util.Date.new(1378465754975).toString
=> "Fri Sep 06 04:09:14 PDT 2013"
{code}
The dangling pointers are all written within 3-4 seconds of each other.
h2.Side Note
JDK7 seems to core dump a lot when starting up region servers. Once things are
all set up it's stable. I need to turn on core dumps into the a known folder
so the Jenkins can archive them.
> Test Big Linked List fails on Hadoop 2.1.0
> ------------------------------------------
>
> Key: HBASE-9338
> URL: https://issues.apache.org/jira/browse/HBASE-9338
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 0.96.0
> Reporter: Elliott Clark
> Assignee: Elliott Clark
> Priority: Blocker
> Fix For: 0.96.0
>
> Attachments: HBASE-HBASE-9338-TESTING.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira