[
https://issues.apache.org/jira/browse/HBASE-12782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-12782:
--------------------------
Attachment: 12782v2.txt
Looks like this fix helps alot. I ran my rig and it passed (9 times out of ten
it does not). I then doubled up the counts so we did 250M instead of 125M and
again it passed. Will run some bigger tests over w/e.
Here is the patch I'd like to apply. It has the fix, an obnoxious unit test to
verify the fix, and then the tooling I used to find the issue. That patch is
fat because it includes a big data file of recovered.edits to replay in the
unit test.
Patch changes ITBLL to add better logging with more data around missing rows.
It also amends the verify step in ITBLL to emit the binary missing along w/ the
type of the missing data. This output is then useable by a new tool, a search,
which takes the missing rows from verify and then goes off to search WALs and
oldWALs. This latter tool was good for figuring where the edits had gone
missing (ante- or post-WAL).
The search tool emits each time it finds a key. This was useful narrowing in
on the WALs that had the rows that were missing.
I'd then take the name of the WAL that had the edits and then go look at its
provenance. In this case, the WALs were opened just before a crash and no
flush had happened. The WALs would then be split to produce recovered.edits.
The patch includes a means of having recovered.edits files moved to archive
when done rather than delete (This is a change in HRegion). This was useful
for checking if the WAL split had actually moved the missing edits from WAL to
recovered.edits. It had in this case, so then the replay of edits was suspect
(of note, the recovered.edits files can be viewed with the WALPrettyPrinter --
which also has some improvements courtesy of this patch).
WALPlayer is used by the search tool in ITBLL. Added a filter method so I
could use the WALPlayer near directly when searching.
Made removing of files from archive or wherever DEBUG level rather than TRACE.
Made a minor improvement to recovered edits replay checking at the WALEdit
level if the edit is for THIS region rather than doing the check per Cell. It
will help some with the likes of the recovered.edits files I was seeing in my
cluster testing where a single WALEdit had hundreds of Cells in it.
The actual fix in HRegion was a simple one-liner (see above).
> ITBLL fails for me if generator does anything but 5M per maptask
> ----------------------------------------------------------------
>
> Key: HBASE-12782
> URL: https://issues.apache.org/jira/browse/HBASE-12782
> Project: HBase
> Issue Type: Bug
> Components: integration tests
> Affects Versions: 1.0.0
> Reporter: stack
> Priority: Critical
> Fix For: 1.0.1
>
> Attachments: 12782.fix.txt,
> 12782.search.plus.archive.recovered.edits.txt, 12782.search.plus.txt,
> 12782.search.txt, 12782.unit.test.and.it.test.txt,
> 12782.unit.test.writing.txt, 12782v2.txt
>
>
> Anyone else seeing this? If I do an ITBLL with generator doing 5M rows per
> maptask, all is good -- verify passes. I've been running 5 servers and had
> one splot per server. So below works:
> HADOOP_CLASSPATH="/home/stack/conf_hbase:`/home/stack/hbase/bin/hbase
> classpath`" ./hadoop/bin/hadoop --config ~/conf_hadoop
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey
> serverKilling Generator 5 5000000 g1.tmp
> or if I double the map tasks, it works:
> HADOOP_CLASSPATH="/home/stack/conf_hbase:`/home/stack/hbase/bin/hbase
> classpath`" ./hadoop/bin/hadoop --config ~/conf_hadoop
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey
> serverKilling Generator 10 5000000 g2.tmp
> ...but if I change the 5M to 50M or 25M, Verify fails.
> Looking into it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)