IIRC, enabling symlink creation for your files should solve the problem.
Call DistributedCache.createSymLink(); before submitting your job.



On 12/25/08 10:40 AM, "Sean Shanny" <[email protected]> wrote:

> To all,
> 
> Version:  hadoop-0.17.2.1-core.jar
> 
> I created a MapFile on a local node.
> 
> I  put the files into the HDFS using the following commands:
> 
> $ bin/hadoop fs -copyFromLocal /tmp/ur/data    /2008-12-19/url/data
> $ bin/hadoop fs -copyFromLocal /tmp/ur/index  /2008-12-19/url/index
> 
> and placed them in the DistributedCache using the following calls in
> the JobConf class:
> 
> DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf);
> DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"), conf);
> 
> What I cannot figure out how to do is actually access the MapFile now
> within my Map code.  I tried the following but I am getting file not
> found errors when I try to run the job.
> 
> private FileSystem              fs;
> private MapFile.Reader     myReader;
> private Path[]                        localFiles;
> 
> ....
> 
>   public void configure(JobConf conf)
>      {
>          String[] s = conf.getStrings("map.input.file");
>          m_sFileName = s[0];
> 
>         try
>          {
>              localFiles = DistributedCache.getLocalCacheFiles(conf);
> 
>              for (Path localFile : localFiles)
>              {
>                  String sFileName = localFile.getName();
> 
>                  if (sFileName.equalsIgnoreCase("data"))
>                  {
>                      System.out.println("Full Path: " +
> localFile.toString());
>                      System.out.println("Parent: " +
> localFile.getParent().toString());
> 
>                      fs = FileSystem.get(localFile.toUri(), conf);
>                      myReader = new MapFile.Reader(fs,
> localFile.getParent().toString(), conf);
>                  }
>              }
>          }
>          catch (IOException e)
>          {
>              // TODO Auto-generated catch block
>              e.printStackTrace();
>          }
> 
> The following exception is thrown and I cannot figure out why it is
> adding the extra data element at the end of the path.  The data is
> actually at
> 
> Task Logs: 'task_200812250002_0001_m_000000_0'
> 
> stdout logs
> Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
> 2008-12-19/url/data/data
> Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
> 2008-12-19/url/data
> stderr logs
> java.io.FileNotFoundException: File does not exist: /tmp/hadoop-root/
> mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at
> org 
> .apache 
> .hadoop 
> .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:
> 369) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:628)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
> 1431) at org.apache.hadoop.io.SequenceFile
> $Reader.<init>(SequenceFile.java:1426) at org.apache.hadoop.io.MapFile
> $Reader.createDataFileReader(MapFile.java:301) at
> org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:283) at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:272) at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:259) at
> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:252) at
> com 
> .TripResearch 
> .warehouse.etl.EtlTestUrlMapLookup.configure(EtlTestUrlMapLookup.java:
> 84) at  
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
> 58) at  
> org 
> .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
> 58) at  
> org 
> .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> 
> The files do exist but I don't understand why they were placed in
> their own directories.  I would have expected both files to exist at /
> 2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/
> 
> ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
> 2008-12-19/url/data
> total 740640
> drwxr-xr-x 2 root root 4096 Dec 24 23:49 .
> drwxr-xr-x 4 root root 4096 Dec 24 23:49 ..
> -rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data
> -rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc 
> 
> [r...@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/
> taskTracker/archive/hdp01n/2008-12-19/url/index
> total 2148
> drwxr-xr-x 2 root root    4096 Dec 25 00:04 .
> drwxr-xr-x 4 root root    4096 Dec 25 00:04 ..
> -rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index
> -rw-r--r-- 1 root root   16924 Dec 25 00:04 .index.crc
> 
> ....
> 
> I know I must be doing something really stupid here as I am sure this
> has been done by lots of folks prior to my feeble attempt.  I did a
> google search but really could not come up with any examples of using
> a MapFile on the DistributedCache.
> 
> Thanks.
> 
> --sean
> 
> 
> 
> 


Reply via email to