Hi J-D,
            yes i did restart hbase after increasing the region-server lease
timeout. initially i did set dfs.datanode.socket.write. timeout to 0. but it
gave some problems on my local setup. i will try setting the
dfs.datanode.socket.write to zero and test it again, if i face any issues, i
will let you know.

PS: I haven't seen dfs.datanode.socket.write.timeout property in
hadoop-default.
 & Even after the exception, All my datanodes and tasktrackers are live.
none of them are dead.


Thanks,
Raakhi

On Wed, Apr 8, 2009 at 5:38 PM, Jean-Daniel Cryans <[email protected]>wrote:

> Rakhi,
>
> Just to be sure, when you changed the RS lease timeout did you restart
> hbase?
>
> The datanode logs seems to imply that some channels are left open for
> too long. Please set dfs.datanode.socket.write.timeout to 0 in
> hadoop-site.
>
> J-D
>
> On Wed, Apr 8, 2009 at 7:57 AM, Rakhi Khatwani <[email protected]>
> wrote:
> > Hi,
> >     I came across Scanner Timeout Exception again :(
> > this time i had a look at the tasktracker and datanode logs of the
> machine
> > where the task failed.
> >
> > The logs are as follows:
> >
> > TaskTracker:
> >
> > 2009-04-08 07:18:07,532 INFO org.apache.hadoop.mapred.TaskTracker:
> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> >
> taskTracker/jobcache/job_200904080539_0001/attempt_200904080539_0001_m_000001_0/output/file.out
> > in any of the configured local directories
> > 2009-04-08 07:18:08,337 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_200904080539_0001_m_000001_0 0.0% Starting Analysis...
> > 2009-04-08 07:18:12,565 INFO org.apache.hadoop.mapred.TaskTracker:
> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> >
> taskTracker/jobcache/job_200904080539_0001/attempt_200904080539_0001_m_000001_0/output/file.out
> > in any of the configured local directories
> > 2009-04-08 07:18:14,399 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_200904080539_0001_m_000001_0 0.0% Starting Analysis...
> > 2009-04-08 07:18:17,409 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_200904080539_0001_m_000001_0 0.0% Starting Analysis...
> > 2009-04-08 07:18:17,583 INFO org.apache.hadoop.mapred.TaskTracker:
> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> >
> taskTracker/jobcache/job_200904080539_0001/attempt_200904080539_0001_m_000001_0/output/file.out
> > in any of the configured local directories
> > 2009-04-08 07:18:19,763 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> > jvm_200904080539_0001_m_-1878302273 exited. Number of tasks it ran: 0
> > 2009-04-08 07:18:22,587 INFO org.apache.hadoop.mapred.TaskTracker:
> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> >
> taskTracker/jobcache/job_200904080539_0001/attempt_200904080539_0001_m_000001_0/output/file.out
> > in any of the configured local directories
> > 2009-04-08 07:18:22,779 INFO org.apache.hadoop.mapred.TaskRunner:
> > attempt_200904080539_0001_m_000001_0 done; removing files.
> > 2009-04-08 07:18:22,780 INFO org.apache.hadoop.mapred.TaskTracker:
> > addFreeSlot : current free slots : 3
> >
> >
> > At the Data Node:
> >
> >
> >  2009-04-08 07:19:01,153 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 10.251.74.84:50010, dest: /10.251.74.84:59583, bytes: 1320960, op:
> > HDFS_READ, cliID: DFSClient_258286192, srvID:
> > DS-2059868082-10.251.74.84-50010-1239116275760, blockid:
> > blk_-4896946973674546604_2508
> > 2009-04-08 07:19:01,154 WARN
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.251.74.84:50010,
> > storageID=DS-2059868082-10.251.74.84-50010-1239116275760, infoPort=50075,
> > ipcPort=50020):Got exception while serving blk_-4896946973674546604_2508
> to
> > /10.251.74.84:
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> > channel to be ready for write. ch :
> > java.nio.channels.SocketChannel[connected local=/10.251.74.84:50010remote=/
> > 10.251.74.84:59583]
> >       at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> >       at
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >       at
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >       at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> >       at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> >       at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> >       at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> >       at java.lang.Thread.run(Thread.java:619)
> >
> >
> > so ya ultimately it boils down to some problem with hdfs. but i am still
> not
> > able to figure out what the issue could be.
> >
> > Thanks
> > Raakhi,
> >
> >
> > On Wed, Apr 8, 2009 at 3:26 PM, Rakhi Khatwani <[email protected]
> >wrote:
> >
> >> Hi,
> >>    I am pasting the region server logs:
> >>
> >> 2009-04-08 00:06:26,378 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> 5427021309867584920 lease expired
> >> 2009-04-08 00:16:23,641 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> -3894991203345155244 lease expired
> >> 2009-04-08 00:29:08,402 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> 651295424715622118 lease expired
> >> 2009-04-08 00:39:05,430 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> -2734117247134548430 lease expired
> >> 2009-04-08 00:46:35,515 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> 2810685965461882801 lease expired
> >> 2009-04-08 00:56:38,289 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> 9085655909080042643 lease expired
> >> 2009-04-08 01:06:36,035 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> 5701864683466148562 lease expired
> >> 2009-04-08 03:13:02,545 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> -2157771707879192919 lease expired
> >> 2009-04-08 03:29:24,603 ERROR
> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> org.apache.hadoop.hbase.UnknownScannerException: Name:
> -2157771707879192919
> >> 2009-04-08 03:29:24,606 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> >> handler 0 on 60020, call next(-2157771707879192919, 30) from
> >> 10.250.6.4:37602: error:
> org.apache.hadoop.hbase.UnknownScannerException:
> >> Name: -2157771707879192919
> >> org.apache.hadoop.hbase.UnknownScannerException: Name:
> -2157771707879192919
> >>        at
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1568)
> >>        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> >>        at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>        at
> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:632)
> >>        at
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:895)
> >> 2009-04-08 03:29:24,655 ERROR
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.h
> Sent at 15:17 on Wednesday
> >>
> >> What i believe is at the region server... scanner lease expires for
> scanner
> >> id SCANNER_ID [which happens @ 3:13]
> >> and then my map reduce program is calling a next() which takes this
> >> SCANNER_ID and hence we get this scanner timeout exception/unknown
> scanner
> >> exception. [this happens at 3:24]
> >>
> >> How do i avoid succha situation?
> >>
> >> Thanks,
> >> Raakhi
> >>
> >>
> >>
> >> On Wed, Apr 8, 2009 at 2:03 PM, Rakhi Khatwani <
> [email protected]>wrote:
> >>
> >>> Hi,
> >>>       I am using hbase-0.19 on 20 node ec2 cluster.
> >>>      I have a map-reduce program which performs some analysis on each
> row.
> >>> when i process about 17k rows in ec2 cluster, after performing 65%, my
> job
> >>> fails
> >>> after going through the logs, in the UI we found out that the job
> failed
> >>> because of a Scanner Timeout Exception.
> >>>
> >>> My map function reads data from one table 'table1' performs analysis,
> if
> >>> analysis is completed, i mark the status of the row to 'analyzed'
> (table1
> >>> has a column-family called status). and i write the result of the
> analyzed
> >>> data into table2. (All this happens in my map function. i have no
> reduce for
> >>> this).
> >>>
> >>> i did go through the archives.. where someone mentioned to increase the
> >>> reagion lease period. so i increased the lease period to 360000 ms (1
> hour).
> >>> despite that i came across Scanner Timeout Exception.
> >>>
> >>> Your help will be greatly appreciated as this scanner timeout exception
> is
> >>> a blocker to my application.
> >>>
> >>> Thanks,
> >>> Raakhi
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>

Reply via email to