[
https://issues.apache.org/jira/browse/ACCUMULO-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Vines updated ACCUMULO-939:
--------------------------------
Description:
Attempting to test ACCUMULO-575 with the following test framework:
Test bench-
1 node running hadoop namenode and 1 datanode
slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
Running patched version of accumulo with the following aptch to provide helper
debug
{code}Index:
server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
===================================================================
--- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
(revision 1429057)
+++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
(working copy)
@@ -81,6 +81,7 @@
private FileSystem fs;
protected KeyExtent extent;
private List<IteratorSetting> iterators;
+ protected boolean minor= false;
Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue>
files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env,
List<IteratorSetting> iterators) {
@@ -158,7 +159,7 @@
log.error("Verification of successful compaction fails!!! " + extent +
" " + outputFile, ex);
throw ex;
}
-
+ log.info("Just completed minor? " + minor + " for table " +
extent.getTableId());
log.debug(String.format("Compaction %s %,d read | %,d written | %,6d
entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() /
((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
Index:
server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
===================================================================
---
server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
(revision 1429057)
+++
server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
(working copy)
@@ -88,6 +88,7 @@
do {
try {
+ this.minor = true;
CompactionStats ret = super.call();
// log.debug(String.format("MinC %,d recs in | %,d recs out | %,d
recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
{code}
I stood up a new instance, create a table named test. Ran the following -
{code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh
{code}
where ifttt.sh is
{code} #!/bin/sh
dnpid=`jps -m | grep DataNode | awk '{print $1}'`
while [ -z "" ]; do
if [ -e $1 ] ;then read str; else str=$1;fi
if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then
echo "I'm gonna kill datanode, pid $dnpid"
kill -9 $dnpid
fi
done
{code}
Then I ran thefollowing
{code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows
65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000
--batchThreads 10 {code}
Eventually the memory map filled, minor compaction happened, local datanode was
killed and things died. Logs filled with-
{code} org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could
only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
{code}
and
{code}
Unexpected error writing to log, retrying attempt 1
java.io.IOException: DFSOutputStream is closed
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666)
at
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
at
org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
... repeats...
{code}.
Bringing the datanode back up did NOT fix it, either.
UPDATE: reran and never killed datanode and it still died. So this isn't an
issue with my datanode killing, it's something with hadop 1.0.1 and the new
rite ahead logs.
was:
Attempting to test ACCUMULO-575 with the following test framework:
Test bench-
1 node running hadoop namenode and 1 datanode
slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
Running patched version of accumulo with the following aptch to provide helper
debug
{code}Index:
server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
===================================================================
--- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
(revision 1429057)
+++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
(working copy)
@@ -81,6 +81,7 @@
private FileSystem fs;
protected KeyExtent extent;
private List<IteratorSetting> iterators;
+ protected boolean minor= false;
Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue>
files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env,
List<IteratorSetting> iterators) {
@@ -158,7 +159,7 @@
log.error("Verification of successful compaction fails!!! " + extent +
" " + outputFile, ex);
throw ex;
}
-
+ log.info("Just completed minor? " + minor + " for table " +
extent.getTableId());
log.debug(String.format("Compaction %s %,d read | %,d written | %,6d
entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() /
((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
Index:
server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
===================================================================
---
server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
(revision 1429057)
+++
server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
(working copy)
@@ -88,6 +88,7 @@
do {
try {
+ this.minor = true;
CompactionStats ret = super.call();
// log.debug(String.format("MinC %,d recs in | %,d recs out | %,d
recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
{code}
I stood up a new instance, create a table named test. Ran the following -
{code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh
{code}
where ifttt.sh is
{code} #!/bin/sh
dnpid=`jps -m | grep DataNode | awk '{print $1}'`
while [ -z "" ]; do
if [ -e $1 ] ;then read str; else str=$1;fi
if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then
echo "I'm gonna kill datanode, pid $dnpid"
kill -9 $dnpid
fi
done
{code}
Then I ran thefollowing
{code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows
65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000
--batchThreads 10 {code}
Eventually the memory map filled, minor compaction happened, local datanode was
killed and things died. Logs filled with-
{code} org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could
only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
{code}
and
{code}
Unexpected error writing to log, retrying attempt 1
java.io.IOException: DFSOutputStream is closed
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666)
at
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
at
org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
at
org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
... repeats...
{code}.
Bringing the datanode back up did NOT fix it, either.
Summary: WAL get stuck for unknown reasons (was: WAL get stuck when
datanode dies)
> WAL get stuck for unknown reasons
> ---------------------------------
>
> Key: ACCUMULO-939
> URL: https://issues.apache.org/jira/browse/ACCUMULO-939
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.5.0
> Environment: hadoop-1.0.1
> Reporter: John Vines
> Assignee: Eric Newton
>
> Attempting to test ACCUMULO-575 with the following test framework:
> Test bench-
> 1 node running hadoop namenode and 1 datanode
> slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
> Running patched version of accumulo with the following aptch to provide
> helper debug
> {code}Index:
> server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
> ===================================================================
> ---
> server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
> (revision 1429057)
> +++
> server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
> (working copy)
> @@ -81,6 +81,7 @@
> private FileSystem fs;
> protected KeyExtent extent;
> private List<IteratorSetting> iterators;
> + protected boolean minor= false;
>
> Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue>
> files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
> TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env,
> List<IteratorSetting> iterators) {
> @@ -158,7 +159,7 @@
> log.error("Verification of successful compaction fails!!! " + extent
> + " " + outputFile, ex);
> throw ex;
> }
> -
> + log.info("Just completed minor? " + minor + " for table " +
> extent.getTableId());
> log.debug(String.format("Compaction %s %,d read | %,d written | %,6d
> entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
> majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() /
> ((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
>
> Index:
> server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
> ===================================================================
> ---
> server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
> (revision 1429057)
> +++
> server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
> (working copy)
> @@ -88,6 +88,7 @@
>
> do {
> try {
> + this.minor = true;
> CompactionStats ret = super.call();
>
> // log.debug(String.format("MinC %,d recs in | %,d recs out | %,d
> recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
> {code}
> I stood up a new instance, create a table named test. Ran the following -
> {code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log |
> ./ifttt.sh {code}
> where ifttt.sh is
> {code} #!/bin/sh
> dnpid=`jps -m | grep DataNode | awk '{print $1}'`
> while [ -z "" ]; do
> if [ -e $1 ] ;then read str; else str=$1;fi
> if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ];
> then
> echo "I'm gonna kill datanode, pid $dnpid"
> kill -9 $dnpid
> fi
> done
> {code}
> Then I ran thefollowing
> {code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows
> 65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000
> --batchThreads 10 {code}
> Eventually the memory map filled, minor compaction happened, local datanode
> was killed and things died. Logs filled with-
> {code} org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
> /accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could
> only be replicated to 0 nodes, instead of 1
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
> at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:416)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
> {code}
> and
> {code}
> Unexpected error writing to log, retrying attempt 1
> java.io.IOException: DFSOutputStream is closed
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666)
> at
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> at
> org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
> at
> org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
> ... repeats...
> {code}.
> Bringing the datanode back up did NOT fix it, either.
> UPDATE: reran and never killed datanode and it still died. So this isn't an
> issue with my datanode killing, it's something with hadop 1.0.1 and the new
> rite ahead logs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira