[
https://issues.apache.org/jira/browse/HBASE-12782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-12782:
--------------------------
Attachment: 12782.search.plus.txt
There was a bug in my little search tool such that I was finding 'references'
rather than the keys that were missing. It sent me on false trail.
After fixing bug, the 'good' news is that the 'lost' keys can still be found in
WALs.
Following one trail, I can see -- using WALPrettyPrinter -- that a good clump
of missing edits are contiguous in a particular WAL. I can also see that they
are the bulk of a single WALEdit of 800+ cells. All cells in this batch have
the same sequenceid. A portion of this WALEdit's edits made it in.
The server that received the missing edits soon after crashed. The edits are in
the WAL. The server had not had a chance to flush.
On replay, I can see a recovered edits file that should have the edits that
cover the missing block but while I can preserve hfiles and WALs with
properties like those below, recovered.edits files are deleted when we are
done. Let me try and keep them around.
<property>
<name>hbase.master.logcleaner.ttl</name>
<value>600000000</value>
<description>Maximum time a WAL can stay in the .oldlogdir directory,
after which it will be cleaned by a Master thread.</description>
</property>
<property>
<name>hbase.master.hfilecleaner.ttl</name>
<value>600000000</value>
<description>Maximum time a WAL can stay in the .oldlogdir directory,
after which it will be cleaned by a Master thread.</description>
</property>
Other notes:
+ The next WAL entry in the WAL makes it in fine. Its like we dropped the bulk
of a WAL entry in middle of the WAL.
+ 4 seconds to recover lease. We need to fix.
+ We report length on a log file before we recover lease. The length is < that
final length. Might be worth trip to nn to find new length post lease recovery.
Patch is my search tool and other cleanup of logs and fixup for
WALPrettyPrinter.
> ITBLL fails for me if generator does anything but 5M per maptask
> ----------------------------------------------------------------
>
> Key: HBASE-12782
> URL: https://issues.apache.org/jira/browse/HBASE-12782
> Project: HBase
> Issue Type: Bug
> Components: integration tests
> Affects Versions: 1.0.0
> Reporter: stack
> Priority: Critical
> Fix For: 1.0.1
>
> Attachments: 12782.search.plus.txt, 12782.search.txt,
> 12782.unit.test.and.it.test.txt, 12782.unit.test.writing.txt
>
>
> Anyone else seeing this? If I do an ITBLL with generator doing 5M rows per
> maptask, all is good -- verify passes. I've been running 5 servers and had
> one splot per server. So below works:
> HADOOP_CLASSPATH="/home/stack/conf_hbase:`/home/stack/hbase/bin/hbase
> classpath`" ./hadoop/bin/hadoop --config ~/conf_hadoop
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey
> serverKilling Generator 5 5000000 g1.tmp
> or if I double the map tasks, it works:
> HADOOP_CLASSPATH="/home/stack/conf_hbase:`/home/stack/hbase/bin/hbase
> classpath`" ./hadoop/bin/hadoop --config ~/conf_hadoop
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey
> serverKilling Generator 10 5000000 g2.tmp
> ...but if I change the 5M to 50M or 25M, Verify fails.
> Looking into it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)