May I see your logs Kareem? What version of hbase? Can I see your
config. too?
Thanks,
St.Ack
Kareem Dana wrote:
> I am using xen with Linux 2.6.18. dfs -put works fine. I can read data
> I have put and all other dfs operations work. They work before I run
> the PE test and then after the PE test fails dfs still works fine on
> its own. However I found some more DFS errors in the logs that happen
> right before the PE test fails. My DFS datanodes are hadoop08-12
>
> On hadoop08:
> 2007-11-15 19:13:52,751 INFO org.apache.hadoop.dfs.DataNode: Starting
> thread to transfer block blk_6384396336224061547 to
> [Lorg.apache.hadoop.dfs.DatanodeInfo;@1d349e2
> 2007-11-15 19:13:52,755 WARN org.apache.hadoop.dfs.DataNode: Failed to
> transfer blk_6384396336224061547 to 172.16.6.56:50010 got
> java.net.SocketException: Connection reset
>
> hadoop09:
> 2007-11-15 19:13:58,788 ERROR org.apache.hadoop.dfs.DataNode:
> DataXceiver: java.io.IOException: Block blk_6384396336224061547 has
> already been started (thoug
> h not completed), and thus cannot be created.
>
> hadoop10:
> 2007-11-15 19:14:13,119 WARN org.apache.hadoop.dfs.DataNode:
> Unexpected error trying to delete block blk_-6070535147471430901.
> Block not found in blockMap.
> 2007-11-15 19:14:13,120 INFO org.apache.hadoop.dfs.DataNode: Deleting
> block blk_4063930368628711897 file
> /tmp/hadoop-kcd/dfs/data/current/blk_4063930368628711897
> 2007-11-15 19:14:13,136 INFO org.apache.hadoop.dfs.DataNode: Deleting
> block blk_-2206654761004087942 file
> /tmp/hadoop-kcd/dfs/data/current/blk_-2206654761004087942
> 2007-11-15 19:14:13,157 WARN org.apache.hadoop.dfs.DataNode:
> java.io.IOException: Error in deleting blocks.
>
> hadoop12:
> 2007-11-15 19:14:13,119 WARN org.apache.hadoop.dfs.DataNode:
> Unexpected error trying to delete block blk_-6070535147471430901.
> Block not found in blockMap.
> 2007-11-15 19:14:13,120 INFO org.apache.hadoop.dfs.DataNode: Deleting
> block blk_4063930368628711897 file
> /tmp/hadoop-kcd/dfs/data/current/blk_4063930368628711897
> 2007-11-15 19:14:13,136 INFO org.apache.hadoop.dfs.DataNode: Deleting
> block blk_-2206654761004087942 file
> /tmp/hadoop-kcd/dfs/data/current/blk_-2206654761004087942
> 2007-11-15 19:14:13,157 WARN org.apache.hadoop.dfs.DataNode:
> java.io.IOException: Error in deleting blocks.
>
> hadoop07 Namenode:
> 2007-11-15 19:10:33,090 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 54310, call
> open(/tmp/hadoop-kcd/hbase/hregion_TestTable,4204932,5347114880093364680/info/info/5829185525592087769,
> 0, 671088640) from 172.16.6.58:57409: error: java.io.IOException:
> Cannot open filename
> /tmp/hadoop-kcd/hbase/hregion_TestTable,4204932,5347114880093364680/info/info/5829185525592087769
> java.io.IOException: Cannot open filename
> /tmp/hadoop-kcd/hbase/hregion_TestTable,4204932,5347114880093364680/info/info/5829185525592087769
>
> It looks like something wrong with DFS but DFS is working fine
> otherwise and when I run a PE with just 1 client it finishes to
> completion. Does that put the same stress on DFS or does a 2 client
> test effectively double the IO through DFS?
>
> Regards,
> Kareem
>
> On Nov 15, 2007 9:08 PM, 闫雪冰 <[EMAIL PROTECTED]> wrote:
>
>> Are you working on FreeBSD 4.11? Did you ever succeed in doing a 'dfs -put'
>> operation?
>>
>> I went into a very similar trouble a few days ago. In my case, I got an
>> "only be replicated to 0 nodes, instead of 1" msg when I tried to run the PE
>> program, I found that I couldn't even managed to make a 'dfs -put' which
>> would also give me the previous error msg, though I succeeded in doing 'dfs
>> -makedir'.
>>
>> The reason is SecureRandom doesn't work on my FreeBSD 4.11, I finally get
>> two solutions:
>> a) Get back to hadoop-0.14.3, which will work fine with the same
>> configuration, or
>> b) Comment off the SecureRandom block like below
>> ----------------------------------------------------------
>> /*
>> try {
>> rand =
>> SecureRandom.getInstance("SHA1PRNG").nextInt(Integer.MAX_VALUE);
>> } catch (NoSuchAlgorithmException e) {
>> LOG.warn("Could not use SecureRandom");
>> rand = (new Random()).nextInt(Integer.MAX_VALUE);
>> }
>> */
>> rand = (new Random()).nextInt(Integer.MAX_VALUE);
>> ----------------------------------------------------------
>> May it help.
>> -Xuebing Yan
>>
>> -----邮件原件-----
>> 发件人: Kareem Dana [mailto:[EMAIL PROTECTED]
>> 发送时间: 2007年11月16日 9:32
>> 收件人: [email protected]
>> 主题: Re: HBase PerformanceEvaluation failing
>>
>> My DFS appears healthy. After the PE fails, the datanodes are still
>> running but all the HRegionServers have exited. My initial concern is
>> free harddrive space or memory. Each node has ~1.5GB free space for
>> DFS and 400MB ram/256mb swap. Is this enough for the PE? I tried
>> monitoring the free space as the PE ran and it never completely filled
>> up but it is kind of tight.
>>
>>
>> On Nov 15, 2007 8:01 PM, stack <[EMAIL PROTECTED]> wrote:
>>
>>> Your DFS is healthy? This seems odd: "File
>>>
>>>
>> /tmp/hadoop-kcd/hbase/hregion_TestTable,2102165,6843477525281170954/info/map
>> files/6464987859396543981/datacould
>>
>>
>>> only be replicated to 0 nodes, instead of 1;" In my experience, IIRC,
>>> it means no datanodes running.
>>>
>>> (I just tried the PE from TRUNK and it ran to completion).
>>>
>>> St.Ack
>>>
>>>
>>> Kareem Dana wrote:
>>>
>>>> I'm trying to run the HBase PerformanceEvaluation program on a cluster
>>>> of 5 hadoop nodes (on virtual machines).
>>>>
>>>> hadoop07 is a DFS Master and HBase master
>>>> hadoop08-12 are HBase region servers
>>>>
>>>> I start the test as follows:
>>>>
>>>> $ bin/hadoop jar
>>>> ${HADOOP_HOME}build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar
>>>> sequentialWrite 2
>>>>
>>>> This starts the sequentialWrite test with 2 clients. After about 25
>>>> minutes the map tasks are about 25% complete and reduce at 6% the test
>>>> fails with the following error:
>>>> 2007-11-15 17:06:35,100 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> TaskInProgress tip_200711151626_0001_m_000002 has failed 1 times.
>>>> 2007-11-15 17:06:35,100 INFO org.apache.hadoop.mapred.JobInProgress:
>>>> Aborting job job_200711151626_0001
>>>> 2007-11-15 17:06:35,101 INFO org.apache.hadoop.mapred.TaskInProgress:
>>>> Error from task_200711151626_0001_m_000006_0:
>>>> org.apache.hadoop.hbase.NoServerForRegionException: failed to find
>>>> server for TestTable after 5 retries
>>>> at
>>>>
>> org.apache.hadoop.hbase.HConnectionManager$TableServers.scanOneMetaRegion(HC
>> onnectionManager.java:761)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HConnectionManager$TableServers.findServersForTable(
>> HConnectionManager.java:521)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HConnectionManager$TableServers.reloadTableServers(H
>> ConnectionManager.java:317)
>>
>>>> at org.apache.hadoop.hbase.HTable.commit(HTable.java:671)
>>>> at org.apache.hadoop.hbase.HTable.commit(HTable.java:636)
>>>> at
>>>>
>> org.apache.hadoop.hbase.PerformanceEvaluation$SequentialWriteTest.testRow(Pe
>> rformanceEvaluation.java:493)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluatio
>> n.java:356)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvalua
>> tion.java:529)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.PerformanceEvaluation$EvaluationMapTask.map(Performa
>> nceEvaluation.java:184)
>>
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at
>>>>
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
>>
>>>> An HBase region server log shows these errors:
>>>> 2007-11-15 17:03:00,017 ERROR org.apache.hadoop.hbase.HRegionServer:
>>>> error closing region TestTable,2102165,6843477525281170954
>>>> org.apache.hadoop.hbase.DroppedSnapshotException: java.io.IOException:
>>>> File
>>>>
>> /tmp/hadoop-kcd/hbase/hregion_TestTable,2102165,6843477525281170954/info/map
>> files/6464987859396543981/data
>>
>>>> could only be replicated to 0 nodes, instead of 1
>>>> at
>>>>
>> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003
>> )
>>
>>>> at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
>>>> at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
>>>> at
>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>>>> at java.lang.reflect.Method.invoke(Method.java:585)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>>>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegion.internalFlushcache(HRegion.java:886)
>>
>>>> at org.apache.hadoop.hbase.HRegion.close(HRegion.java:388)
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegionServer.closeAllRegions(HRegionServer.java:978
>> )
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:593)
>>
>>>> at java.lang.Thread.run(Thread.java:595)
>>>> 2007-11-15 17:03:00,615 ERROR org.apache.hadoop.hbase.HRegionServer:
>>>> error closing region TestTable,3147654,8929124532081908894
>>>> org.apache.hadoop.hbase.DroppedSnapshotException: java.io.IOException:
>>>> File
>>>>
>> /tmp/hadoop-kcd/hbase/hregion_TestTable,3147654,8929124532081908894/info/map
>> files/3451857497397493742/data
>>
>>>> could only be replicated to 0 nodes, instead of 1
>>>> at
>>>>
>> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003
>> )
>>
>>>> at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
>>>> at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
>>>> at
>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>>>> at java.lang.reflect.Method.invoke(Method.java:585)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>>>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegion.internalFlushcache(HRegion.java:886)
>>
>>>> at org.apache.hadoop.hbase.HRegion.close(HRegion.java:388)
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegionServer.closeAllRegions(HRegionServer.java:978
>> )
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:593)
>>
>>>> at java.lang.Thread.run(Thread.java:595)
>>>> 2007-11-15 17:03:00,639 ERROR org.apache.hadoop.hbase.HRegionServer:
>>>> Close and delete failed
>>>> java.io.IOException: java.io.IOException: File
>>>>
>>>>
>> /tmp/hadoop-kcd/hbase/log_172.16.6.57_-3889232888673408171_60020/hlog.dat.00
>> 5
>>
>>>> could only be replicated to 0 nodes, instead of 1
>>>> at
>>>>
>> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003
>> )
>>
>>>> at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
>>>> at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
>>>> at
>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>>>> at java.lang.reflect.Method.invoke(Method.java:585)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>>>>
>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>
>> Method)
>>
>>>> at
>>>>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
>> sorImpl.java:39)
>>
>>>> at
>>>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
>> torAccessorImpl.java:27)
>>
>>>> at
>>>>
>> java.lang.reflect.Constructor.newInstance(Constructor.java:494)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteE
>> xceptionHandler.java:82)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExcept
>> ionHandler.java:48)
>>
>>>> at
>>>>
>> org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:597)
>>
>>>> at java.lang.Thread.run(Thread.java:595)
>>>> 2007-11-15 17:03:00,640 INFO org.apache.hadoop.hbase.HRegionServer:
>>>> telling master that region server is shutting down at:
>>>> 172.16.6.57:60020
>>>> 2007-11-15 17:03:00,643 INFO org.apache.hadoop.hbase.HRegionServer:
>>>> stopping server at: 172.16.6.57:60020
>>>> 2007-11-15 17:03:00,643 INFO org.apache.hadoop.hbase.HRegionServer:
>>>> regionserver/0.0.0.0:60020 exiting
>>>>
>>>> I can provide some more logs if necessary. Any ideas or suggestions
>>>> about how I track this down? Running sequentialWrite test with just 1
>>>> client works fine but using 2 or more causes these errors.
>>>>
>>>> Thanks for any help,
>>>> Kareem Dana
>>>>
>>>>
>>>