Hi,
Qiao’s HBase log shows there were errors for HBase to open all table regions
under “_MD_” schema, the error stack is like this:
2016-09-08 16:44:36,327 ERROR [RS_OPEN_REGION-hadoop2slave7:60020-0]
handler.OpenRegionHandler: Failed open of
region=TRAFODION._MD_.COLUMNS,,1471946223350.b6191867e73d4203d3ac6fad3c860138.,
starting to roll back the global memstore size.
org.apache.hadoop.hbase.DroppedSnapshotException: region:
TRAFODION._MD_.COLUMNS,,1471946223350.b6191867e73d4203d3ac6fad3c860138.
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2243)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1972)
at
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3826)
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:969)
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:841)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:814)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5828)
at
org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion.openHRegion(TransactionalRegion.java:101)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5794)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5765)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5721)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5672)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:356)
at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:126)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.AssertionError: Key \xB9"b*M3c\x00ADMCKID
/#1:\x01/1473306352163/Put/vlen=8/seqid=1749 followed by a smaller key
\xB9"b*M3c\x00ADMCKID
/#1:\x01/1473306352163/Put/vlen=8/seqid=4003 in
cf #1
at
org.apache.hadoop.hbase.regionserver.StoreScanner.checkScanOrder(StoreScanner.java:699)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:493)
at
org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:115)
at
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
at
org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:940)
at
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2217)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2197)
... 17 more
Not sure why this happened, and the HBase work well. Since metadata is not
available and Qiao’s data is just test data, so he reinitialize trafodion, and
it recovered.
I don’t have enough information to know the root cause yet. Above error happens
during HBase startup. There are some HDFS error before HBase abort:
---------------------------------------------------------------------------------
2016-09-07 22:34:21,228 ERROR [regionserver/hadoop2slave7/10.1.1.22:60020]
wal.ProtobufLogWriter: Got IOException while writing trailer
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1635)
at
org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at
com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
at
com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
at
com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:80)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.writeWALTrailer(ProtobufLogWriter.java:157)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.shutdown(FSHLog.java:1149)
at
org.apache.hadoop.hbase.wal.DefaultWALProvider.shutdown(DefaultWALProvider.java:114)
at
org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:215)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1248)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1003)
at java.lang.Thread.run(Thread.java:745)
And
2016-09-07 22:34:20,765 ERROR [sync.4] wal.FSHLog: Error syncing, request close
of wal
java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,767 WARN
[RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.FSHLog: Failed last sync
but no outstanding unsync edits so falling through to close;
java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
2016-09-07 22:34:20,767 ERROR
[RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.ProtobufLogWriter: Got
IOException while writing trailer
java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,767 ERROR
[RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.FSHLog: Failed close of
WAL writer
hdfs://hadoop2slave7:8020/hbase/WALs/hadoop2slave7,60020,1473040797512/hadoop2slave7%2C60020%2C1473040797512..meta.1473255260637.meta,
unflushedEntries=0
java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,767 FATAL
[RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] regionserver.HRegionServer:
ABORTING region server hadoop2slave7,60020,1473040797512: Failed log close in
log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
hdfs://hadoop2slave7:8020/hbase/WALs/hadoop2slave7,60020,1473040797512/hadoop2slave7%2C60020%2C1473040797512..meta.1473255260637.meta,
unflushedEntries=0
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:978)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:716)
at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: All datanodes
DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,768 FATAL
[RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] regionserver.HRegionServer:
RegionServer abort: loaded coprocessors are:
[org.apache.hadoop.hbase.coprocessor.AggregateImplementation,
org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint,
org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionObserver,
org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint]
------------------------------------------------------------------------------------
Not sure if these log info helps to find the root cause of metadata corruption?
I am still investigating.
Thanks,
Ming
From: 乔彦克 [mailto:[email protected]]
Sent: Friday, September 09, 2016 11:27 AM
To: [email protected]; [email protected]
Cc: Amanda Moran <[email protected]>; Selva Govindarajan
<[email protected]>; Liu, Ming (Ming) <[email protected]>
Subject: Re: Load with log error rows gets Trafodion not work
Thanks to Selva and Amanda, I loaded three data sets from hive to Trafodion
yesterday, the other two succeed and the last got the error.
And this error result in that I cannot execute any query from trafci but
"initialize trafodion, drop" (Thanks @Liuming told me to do so). Ming analyzed
the hbase log and found that the data region belongs to trafodion cannot be
opened.
After I initialize trafodion again, I reload the three data sets and it goes
well.
@Selva, the Trafodion and Hbase are running normal and below is the result of
'sqvers -u' :
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
cat: /opt/hptc/pdsh/nodes: No such file or directory
MY_SQROOT=/home/trafodion/apache-trafodion_server-2.0.1-incubating
who@host=trafodion@hadoop2slave7
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
linux=2.6.32-220.el6.x86_64
redhat=6.2
NO patches
Most common Apache_Trafodion Release 2.0.1 (Build release [DEV], branch -, date
24Jun16)
UTT count is 2
[8] Apache_Trafodion Release 2.0.1 (Build release [DEV], branch
release2.0, date 24Jun16)
export/lib/hbase-trx-apache1_0_2-2.0.1.jar
export/lib/hbase-trx-hdp2_3-2.0.1.jar
export/lib/sqmanvers.jar
export/lib/trafodion-dtm-apache1_0_2-2.0.1.jar
export/lib/trafodion-dtm-hdp2_3-2.0.1.jar
export/lib/trafodion-sql-apache1_0_2-2.0.1.jar
export/lib/trafodion-sql-hdp2_3-2.0.1.jar
export/lib/trafodion-utility-2.0.1.jar
[3] Release 2.0.1 (Build release [DEV], branch release2.0, date 24Jun16)
export/lib/jdbcT2.jar
export/lib/jdbcT4.jar
export/lib/lib_mgmt.jar
@Amanda:
The Hdfs /user directory has not the user trafodion, just root and hive. But I
can load and insert data into Trafodion, so I don't think the problem is there.
Thank you for your replies.
Many thanks again,
Qiao
Amanda Moran
<[email protected]<mailto:[email protected]>>于2016年9月9日周五 上午1:03写道:
Please run this command:
sudo su hdfs --command "hadoop fs -ls /user"
Please verify you have the trafodion user id listed there.
Thanks!
Amanda
On Thu, Sep 8, 2016 at 8:08 AM, Selva Govindarajan <
[email protected]<mailto:[email protected]>> wrote:
> Hi Qiao,
>
>
>
> The JIRA you mentioned in the message is already fixed and merged to
> Trafodion on July 20th. It is unfortunate that this JIRA wasn’t marked
> as resolved. I have marked it as resolved now. This JIRA deals with the
> issue of trafodion process aborting when there is an error while logging
> the error rows. The error rows are logged in hdfs directly. Most likely
> the “Trafodion” user has no write permission to the hdfs directory where
> the error is logged.
>
>
>
> You can try “Load with continue on error … “ command instead and check if
> it works.
>
>
>
> Can you also please send the output of the command below to confirm if the
> version installed has the above fix.
>
>
>
> sqvers -u
>
>
>
> Can you also issue the following command to confirm if the Trafodion and
> hbase are started successfully.
>
>
>
> hbcheck
>
> sqcheck
>
>
>
>
>
> Selva
>
> *From:* 乔彦克 [mailto:[email protected]<mailto:[email protected]>]
> *Sent:* Thursday, September 8, 2016 12:20 AM
> *To:*
> [email protected]<mailto:[email protected]>;
> [email protected]<mailto:[email protected]>
> .org
> *Subject:* Load with log error rows gets Trafodion not work
>
>
>
> Hi, all,
>
> I used load with log error rows to load data from hive, and got the
> following error:
>
> [image: loaderr.png]
>
> which leading to hbase-region server crashed.
>
> I restart Hbase region serve and Trafodion, but query in Trafodion has no
> response, even the simplest query "get tables;" or " get schemas".
>
> Can someone help me to let Trafodion go normal?
>
> https://issues.apache.org/jira/browse/TRAFODION-2109, this jira describe
> the same problem.
>
>
>
> Any reply is appreciated.
>
> Thank you
>
> Qiao
>
--
Thanks,
Amanda Moran