Eric Newton created ACCUMULO-1685:
-------------------------------------
Summary: bench testing shows that the NN loses the WAL
Key: ACCUMULO-1685
URL: https://issues.apache.org/jira/browse/ACCUMULO-1685
Project: Accumulo
Issue Type: Bug
Components: tserver
Environment: Hadoop 1.0.4, single node dev't system
Reporter: Eric Newton
Assignee: Eric Newton
Priority: Critical
Fix For: 1.6.0
Doing bench testing; I build accumulo:
{noformat}
$ mvn -Pnative package -DskipTests
{noformat}
I go into the assembly area and configure and run accumulo
{noformat}
$ cd assemble/target/accumulo-1.6.0-SNAPSHOT-dev/accumulo-1.6.0-SNAPSHOT
$ cp ~/conf/* conf
$ hadoop fs -rmr /accumulo
Moved to trash: hdfs://somehost:9000/accumulo
$ ( echo test ; echo Y ; echo secret ; echo secret ) | ./bin/accumulo init
$ 2013-09-04 12:23:51,558 [util.Initialize] INFO : Hadoop Filesystem is
hdfs://somehost:9000
2013-09-04 12:23:51,559 [util.Initialize] INFO : Accumulo data dirs are
[hdfs://somehost:9000/accumulo]
2013-09-04 12:23:51,559 [util.Initialize] INFO : Zookeeper server is
localhost:2181
2013-09-04 12:23:51,559 [util.Initialize] INFO : Checking if Zookeeper is
available. If this hangs, then you need to make sure zookeeper is running
Instance name : test
Instance name "test" exists. Delete existing entry from zookeeper? [Y/N] : Y
Enter initial password for root (this may not be applicable for your security
setup): ******
Confirm initial password for root: ******
$ ./bin/start-all.sh
Starting monitor on localhost
Starting tablet servers .... done
Starting tablet server on localhost
2013-09-04 12:26:24,545 [server.Accumulo] INFO : Attempting to talk to zookeeper
2013-09-04 12:26:24,675 [server.Accumulo] INFO : Zookeeper connected and
initialized, attemping to talk to HDFS
2013-09-04 12:26:24,679 [server.Accumulo] INFO : Connected to HDFS
Starting master on localhost
Starting garbage collector on localhost
Starting tracer on localhost
{noformat}
Next, create a table
{noformat}
$ ./bin/accumulo shell -u root -p secret
2013-09-04 12:27:01,628 [shell.Shell] WARN : Specifying a raw password is
deprecated.
Shell - Apache Accumulo Interactive Shell
-
- version: 1.6.0-SNAPSHOT
- instance name: test
- instance id: 1967c1ec-cc0f-439b-b4da-4029debd16e3
-
- type 'help' for a list of available commands
-
root@test> createtable t
root@test t>
{noformat}
Then I checked the tserver log for the write-ahead log created for this update
to the root table:
{noformat}
$ fgrep -a /wal/ logs/tserver_*.debug.log
2013-09-04 12:26:27,130 [log.DfsLogger] DEBUG: Got new write-ahead log:
localhost+9997/hdfs://rd6ul-14706v.tycho.ncsc.mil:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
2013-09-04 12:26:58,264 [tabletserver.Tablet] DEBUG: Logs for memory compacted:
!!R<<
localhost+9997/hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
{noformat}
Now, let's check for the file:
{noformat}
$ hadoop fs -ls
hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
ls: Cannot access
hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9:
No such file or directory.
{noformat}
What?
Check the NN logs:
{noformat}
$ fgrep 1dd2727f /some/log/dir/hadoop-ecnewt2-local-namenode-somehost.log
2013-09-04 12:26:27,075 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9.
blk_-6011963215434912690_971163
2013-09-04 12:26:27,113 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.fsync: file
/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9 for
DFSClient_-787226921
{noformat}
So, the NN seems to be making the file, but it's not there when we go to look!
Here's my hdfs-site.xml file:
{noformat}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/local/ecn/data/hadoop/nn</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/disk01/data/hadoop/dn,/disk02/data/hadoop/dn,/disk03/data/hadoop/dn</value>
</property>
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
<property>
<name>dfs.data.synconclose</name>
<value>true</value>
</property>
</configuration>
{noformat}
I have written an integration test that I dumped into RestartIT.java, but that
doesn't seem to fail in same way.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira