[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Doug Cutting (JIRA) Thu, 18 Jun 2009 12:58:33 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721437#action_12721437
 ]


Doug Cutting commented on HADOOP-6072:
--------------------------------------

> Is that still OK fornamenode's perf ?

This should not be a problem for the namenode.  It would be best to write the 
file first with normal replication, then increase its replication, to avoid an 
overly-long HDFS write pipeline.

The rationale for sqrt is that a two-stage fanout is done: first from the 
original to the replicas, then from the replicas to the maps.  Sqrt(maps) uses 
approximately the same fanout factor at each stage, minimizing the number of 
datanode clients (the presumed bottleneck here).

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing 
> distcp_src_files seems to be an issue, as all map tasks would be accessing 
> this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : 
> attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at 
> org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, 
> that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Reply via email to