[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

ASF GitHub Bot (JIRA) Mon, 11 Sep 2017 02:30:27 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160972#comment-16160972
 ]


ASF GitHub Bot commented on NUTCH-2375:
---------------------------------------

sebastian-nagel commented on a change in pull request #221: NUTCH-2375 
Upgrading nutch to use org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#discussion_r138018731
 
 

 ##########
 File path: src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
 ##########
 @@ -29,73 +29,84 @@
 import org.apache.hadoop.io.Writable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
-import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.util.Progressable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 import org.apache.hadoop.mapred.InvalidJobConfException;
-import org.apache.hadoop.mapred.OutputFormat;
-import org.apache.hadoop.mapred.RecordWriter;
-import org.apache.hadoop.mapred.JobConf;
-import org.apache.hadoop.mapred.Reporter;
-import org.apache.hadoop.mapred.SequenceFileOutputFormat;
+import org.apache.hadoop.mapreduce.OutputFormat;
+import org.apache.hadoop.mapreduce.RecordWriter;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapred.FileSplit;
 import org.apache.hadoop.util.Progressable;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseOutputFormat;
 import org.apache.nutch.protocol.Content;
 
 /** Splits FetcherOutput entries into multiple map files. */
-public class FetcherOutputFormat implements OutputFormat<Text, NutchWritable> {
+public class FetcherOutputFormat extends FileOutputFormat<Text, NutchWritable> 
{
 
-  public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
+  @Override
+  public void checkOutputSpecs(JobContext job) throws IOException {
+    Configuration conf = job.getConfiguration();
+    FileSystem fs = FileSystem.get(conf);
     Path out = FileOutputFormat.getOutputPath(job);
     if ((out == null) && (job.getNumReduceTasks() != 0)) {
-      throw new InvalidJobConfException("Output directory not set in 
JobConf.");
+      throw new InvalidJobConfException("Output directory not set in conf.");
     }
     if (fs == null) {
-      fs = out.getFileSystem(job);
+      fs = out.getFileSystem(conf);
     }
     if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME)))
       throw new IOException("Segment already fetched!");
   }
 
-  public RecordWriter<Text, NutchWritable> getRecordWriter(final FileSystem fs,
-      final JobConf job, final String name, final Progressable progress)
+  @Override
+  public RecordWriter<Text, NutchWritable> getRecordWriter(TaskAttemptContext 
context)
           throws IOException {
 
-    Path out = FileOutputFormat.getOutputPath(job);
+    Configuration conf = context.getConfiguration();
+    String name = context.getJobName();//getTaskAttemptID().toString();
+    Path dir = FileOutputFormat.getOutputPath(context);
+    FileSystem fs = dir.getFileSystem(context.getConfiguration());
+    Path out = FileOutputFormat.getOutputPath(context);
 
 Review comment:
   This will change the output folder structure and probably will cause 
collisions of output folders if run in distributed mode (on a Hadoop cluster). 
The directory tree of a segment should look as before:
   ```
   crawl/segments/20170816093452/                                               
                                                                                
 
   |-- content                                                                  
                                                                                
 
   |   `-- part-00000
   |       |-- data
   |       `-- index
   |-- crawl_fetch
   |   `-- part-00000
   |       |-- data
   |       `-- index
   |-- crawl_generate
   |   `-- part-00000
   |-- crawl_parse
   |   `-- part-00000
   |-- parse_data
   |   `-- part-00000
   |       |-- data
   |       `-- index
   `-- parse_text
       `-- part-00000
           |-- data
           `-- index
   ```
   
   There will be changes due to the MapReduce upgrade (part-xxxxx -> 
part-r-xxxxx). The tree is now
   ```
   crawl/segments/20170911103223/
   |-- content
   |   `-- FetchData
   |       |-- data
   |       `-- index
   |-- crawl_fetch
   |   `-- FetchData
   |       |-- data
   |       `-- index
   |-- crawl_generate
   |   `-- part-r-00000
   |-- crawl_parse
   |   `-- parse\ crawl
   |       `-- segments
   |           `-- 20170911103223
   |-- parse_data
   |   `-- parse\ crawl
   |       `-- segments
   |           `-- 20170911103223
   |               |-- data
   |               `-- index
   `-- parse_text
       `-- parse\ crawl
           `-- segments
               `-- 20170911103223
                   |-- data
                   `-- index
   ```
   
   which makes a crawl failing, e.g. with
   ```
   CrawlDb update: java.io.FileNotFoundException: File 
file:.../crawl/segments/20170911103223/crawl_parse/parse crawl/data does not 
exist
           at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
           at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
           at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
           at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
   ```
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2375
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2375
>             Project: Nutch
>          Issue Type: Improvement
>          Components: deployment
>            Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

Reply via email to