[ 
https://issues.apache.org/jira/browse/HADOOP-3707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612672#action_12612672
 ] 

Raghu Angadi commented on HADOOP-3707:
--------------------------------------


> I think you don't need the approximation ,

I don't see how it can possibly be accurate with out having lot more code and 
more memory to maintain more state in NameNode. For. e.g. say NameNode returns 
3 datanodes to a client to write to, and then client immedicately dies. If 
there are no such errors, then it is accurate. 

bq. increment when block is scheduled for replication and when a new block is 
allocated, this can be in one common place at chooseTargets().

chooseTarget() is not always called to schedule writes. And blocks allocated 
may not be used by the caller of chooseTargets(). It calls in two places 
instead of one.. as an alternative, we could have a common method that is 
invoked whenever NN asks a DN or a client to write to a DN.

> May be we can solve this using simpler things. Like
May be. Though I don't see why NameNode can't do this itself.. like this patch.

> slow down the replication scheduler.
I think slowing down activities is pretty defensive. Also replication is not 
the only cause: should we slow down client writes too?

I don't mean to say this is the best solution. Is there a better solution that 
is comparably simple and safe for 0.17 and 0.18?


> Frequent DiskOutOfSpaceException on almost-full datanodes
> ---------------------------------------------------------
>
>                 Key: HADOOP-3707
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3707
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.0
>            Reporter: Koji Noguchi
>            Assignee: Raghu Angadi
>            Priority: Blocker
>             Fix For: 0.17.2, 0.18.0, 0.19.0
>
>         Attachments: HADOOP-3707-branch-017.patch, 
> HADOOP-3707-branch-017.patch, HADOOP-3707-trunk.patch, 
> HADOOP-3707-trunk.patch, HADOOP-3707-trunk.patch
>
>
> On a datanode which is completely full (leaving reserve space),  we 
> frequently see
> target node reporting, 
> {noformat}
> 2008-07-07 16:54:44,707 INFO org.apache.hadoop.dfs.DataNode: Receiving block 
> blk_3328886742742952100 src: /11.1.11.111:22222 dest: /11.1.11.111:22222
> 2008-07-07 16:54:44,708 INFO org.apache.hadoop.dfs.DataNode: writeBlock 
> blk_3328886742742952100 received exception 
> org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Insufficient 
> space for an additional block
> 2008-07-07 16:54:44,708 ERROR org.apache.hadoop.dfs.DataNode: 
> 33.3.33.33:22222:DataXceiver: 
> org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Insufficient 
> space for an additional block
>         at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:444)
>         at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:716)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2187)
>         at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1113)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:976)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> Sender reporting 
> {noformat}
> 2008-07-07 16:54:44,712 INFO org.apache.hadoop.dfs.DataNode: 
> 11.1.11.111:22222:Exception writing block blk_3328886742742952100 to mirror 
> 33.3.33.33:22222
> java.io.IOException: Broken pipe
>         at sun.nio.ch.FileDispatcher.write0(Native Method)
>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>         at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>         at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:53)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveChunk(DataNode.java:2292)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket(DataNode.java:2411)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2476)
>         at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1204)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:976)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> Since it's not constantly happening,  my guess is whenever datanode gets some 
> small space available, namenode over-assigns blocks which can fail the block
> pipeline.
> (Note, before 0.17, namenode was much slower in assigning blocks)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to