[jira] [Created] (ACCUMULO-1685) bench testing shows that the NN loses the WAL

Eric Newton (JIRA) Wed, 04 Sep 2013 10:21:48 -0700

Eric Newton created ACCUMULO-1685:
-------------------------------------

             Summary: bench testing shows that the NN loses the WAL
                 Key: ACCUMULO-1685
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1685
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
         Environment: Hadoop 1.0.4, single node dev't system
            Reporter: Eric Newton
            Assignee: Eric Newton
            Priority: Critical
             Fix For: 1.6.0



Doing bench testing; I build accumulo:

{noformat}
$ mvn -Pnative package -DskipTests
{noformat}

I go into the assembly area and configure and run accumulo

{noformat}
$ cd assemble/target/accumulo-1.6.0-SNAPSHOT-dev/accumulo-1.6.0-SNAPSHOT
$ cp ~/conf/* conf
$ hadoop fs -rmr /accumulo
Moved to trash: hdfs://somehost:9000/accumulo
$ ( echo test ; echo Y ; echo secret ; echo secret ) | ./bin/accumulo init
$ 2013-09-04 12:23:51,558 [util.Initialize] INFO : Hadoop Filesystem is 
hdfs://somehost:9000
2013-09-04 12:23:51,559 [util.Initialize] INFO : Accumulo data dirs are 
[hdfs://somehost:9000/accumulo]
2013-09-04 12:23:51,559 [util.Initialize] INFO : Zookeeper server is 
localhost:2181
2013-09-04 12:23:51,559 [util.Initialize] INFO : Checking if Zookeeper is 
available. If this hangs, then you need to make sure zookeeper is running
Instance name : test
Instance name "test" exists. Delete existing entry from zookeeper? [Y/N] : Y
Enter initial password for root (this may not be applicable for your security 
setup): ******
Confirm initial password for root: ******
$ ./bin/start-all.sh 
Starting monitor on localhost
Starting tablet servers .... done
Starting tablet server on localhost
2013-09-04 12:26:24,545 [server.Accumulo] INFO : Attempting to talk to zookeeper
2013-09-04 12:26:24,675 [server.Accumulo] INFO : Zookeeper connected and 
initialized, attemping to talk to HDFS
2013-09-04 12:26:24,679 [server.Accumulo] INFO : Connected to HDFS
Starting master on localhost
Starting garbage collector on localhost
Starting tracer on localhost
{noformat}

Next, create a table

{noformat}
$ ./bin/accumulo shell -u root -p secret
2013-09-04 12:27:01,628 [shell.Shell] WARN : Specifying a raw password is 
deprecated.

Shell - Apache Accumulo Interactive Shell
- 
- version: 1.6.0-SNAPSHOT
- instance name: test
- instance id: 1967c1ec-cc0f-439b-b4da-4029debd16e3
- 
- type 'help' for a list of available commands
- 
root@test> createtable t
root@test t> 
{noformat}

Then I checked the tserver log for the write-ahead log created for this update 
to the root table:

{noformat}
$ fgrep -a /wal/ logs/tserver_*.debug.log
2013-09-04 12:26:27,130 [log.DfsLogger] DEBUG: Got new write-ahead log: 
localhost+9997/hdfs://rd6ul-14706v.tycho.ncsc.mil:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
2013-09-04 12:26:58,264 [tabletserver.Tablet] DEBUG: Logs for memory compacted: 
!!R<< 
localhost+9997/hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
{noformat}

Now, let's check for the file:

{noformat}
$ hadoop fs -ls 
hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
ls: Cannot access 
hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9:
 No such file or directory.
{noformat}

What?

Check the NN logs:

{noformat}
$ fgrep 1dd2727f /some/log/dir/hadoop-ecnewt2-local-namenode-somehost.log 
2013-09-04 12:26:27,075 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9. 
blk_-6011963215434912690_971163
2013-09-04 12:26:27,113 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.fsync: file 
/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9 for 
DFSClient_-787226921
{noformat}

So, the NN seems to be making the file, but it's not there when we go to look!

Here's my hdfs-site.xml file:

{noformat}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
      <name>dfs.replication</name>
      <value>1</value>
  </property>
  <property>
      <name>dfs.name.dir</name>
      <value>/local/ecn/data/hadoop/nn</value>
  </property>
  <property>
      <name>dfs.data.dir</name>
      
<value>/disk01/data/hadoop/dn,/disk02/data/hadoop/dn,/disk03/data/hadoop/dn</value>
  </property>
  <property>
      <name>dfs.support.append</name>
      <value>true</value>
  </property>
  <property>
      <name>dfs.data.synconclose</name>
      <value>true</value>
  </property>
</configuration>
{noformat}

I have written an integration test that I dumped into RestartIT.java, but that 
doesn't seem to fail in same way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (ACCUMULO-1685) bench testing shows that the NN loses the WAL

Reply via email to