[
https://issues.apache.org/jira/browse/HBASE-16232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15440174#comment-15440174
]
Mikhail Antonov commented on HBASE-16232:
-----------------------------------------
Some update on this one and a quick summary. After spending lots of time
chasing this w/ different configuration I've tried here's my conclusion so far:
- I can't reproduce it running on large distributed cluster with active Chaos
Monkeys with default "hbase.test.regions-per-server" (3). This is fairly
rigorous test and a good baseline.
- I've seen that very occasionally on the same cluster when I set
num-regions-per-server to be 100. At this point sporadic test failures start to
creep in when I run the loop long enough as occasionally some regions are
getting stuck in transition and load generator or verifier MR tasks fail with
retries exhausted exception.
- I see it more often if I up number of regions to be like 300 per server, and
that makes test iterations take longer and longer and aforementioned task
crashes more often, making reproduction really painful and unreliable.
- With 100 or 300 regions per machine I have seen this on the tip of
then-1.2.1 branch build. That makes me think there might be something old
lurking in existing code for a long time.
- To the best of my knowledge, nobody else who ran ITBLL off 1.3 builds on
real clusters ([~stack]?) was able to reproduce it, unlike the previous issues
with fake keys, which was reasonably reproducible on small distributed cluster
(I haven't seen any other jiras files on that or much activity here).
So based on that, I'm going to lower the priority for this task to Major and
make it non-blocker for release, while continuing looking into that in the
background, because at this point I don't know any reliable and repeatable way
to reproduce it in reasonable amount of time on reasonable cluster setup. If
anybody has such repro on this, by all means please feel free to step up and
pick this one up.
> ITBLL fails on branch-1.3, now loosing actual keys
> --------------------------------------------------
>
> Key: HBASE-16232
> URL: https://issues.apache.org/jira/browse/HBASE-16232
> Project: HBase
> Issue Type: Bug
> Components: dataloss, integration tests
> Affects Versions: 1.3.0
> Reporter: Mikhail Antonov
> Assignee: Mikhail Antonov
> Priority: Blocker
> Fix For: 1.3.0
>
>
> So I'm running ITBLL off branch-1.3 on recent commit (after [~stack]'s fix
> for fake keys showing up in the scans) with increased number of regions per
> regionserver and seeing the following.
> {quote}
> $Verify$Counts
> REFERENCED 0 4,999,999,994 4,999,999,994
> UNDEFINED 0 3 3
> UNREFERENCED 0 3 3
> {quote}
> So we're loosing some keys. This time those aren't fake:
> {quote}
> undef
> \x89\x10\xE0\xBBx\xF1\xC4\xBAY`\xC4\xD77\x87\x84\x0F 0 1 1
> \x89\x11\x0F\xBA@\x0D8^\xAE \xB1\xCAh\xEB&\xE3 0 1 1
> \x89\x16waxv;\xB1\xE3Z\xE6"|\xFC\xBE\x9A 0 1 1
> unref
> \x15\x1F*f\x92i6\x86\x1D\x8E\xB7\xE1\xC1=\x96\xEF 0 1 1
> \xF4G\xC6E\xD6\xF1\xAB\xB7\xDB\xC0\x94\xF2\xE7mN\xEC 0 1 1
> U\x0F'\x88\x106\x19\x1C\x87Y"\xF3\xE6\xC1\xC8\x15
> {quote}
> Re-running verify step with CM off still shows this issue. Search tool
> reports:
> {quote}
> Total
> \x89\x11\x0F\xBA@\x0D8^\xAE \xB1\xCAh\xEB&\xE3 5 0 5
> \x89\x16waxv;\xB1\xE3Z\xE6"|\xFC\xBE\x9A 4 0 4
> CELL_WITH_MISSING_ROW 15 0 15
> {quote}
> Will post more as I dig into.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)