[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

ASF GitHub Bot (JIRA) Mon, 11 Sep 2017 02:30:19 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160973#comment-16160973
 ]


ASF GitHub Bot commented on NUTCH-2375:
---------------------------------------

sebastian-nagel commented on a change in pull request #221: NUTCH-2375 
Upgrading nutch to use org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#discussion_r138008013
 
 

 ##########
 File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
 ##########
 @@ -368,41 +367,46 @@ public void close() {
     closeReaders();
   }
 
-  private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, 
Configuration config, boolean sort) throws IOException{
+  private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, 
Configuration config, boolean sort) 
+          throws IOException, InterruptedException, ClassNotFoundException{
          Path tmpFolder = new Path(crawlDb, "stat_tmp" + 
System.currentTimeMillis());
 
-         JobConf job = new NutchJob(config);
+         Job job = NutchJob.getInstance(config);
+          config = job.getConfiguration();
          job.setJobName("stats " + crawlDb);
-         job.setBoolean("db.reader.stats.sort", sort);
+         config.setBoolean("db.reader.stats.sort", sort);
 
          FileInputFormat.addInputPath(job, new Path(crawlDb, 
CrawlDb.CURRENT_NAME));
-         job.setInputFormat(SequenceFileInputFormat.class);
+         job.setInputFormatClass(SequenceFileInputFormat.class);
 
          job.setMapperClass(CrawlDbStatMapper.class);
          job.setCombinerClass(CrawlDbStatCombiner.class);
          job.setReducerClass(CrawlDbStatReducer.class);
 
          FileOutputFormat.setOutputPath(job, tmpFolder);
-         job.setOutputFormat(SequenceFileOutputFormat.class);
+         job.setOutputFormatClass(SequenceFileOutputFormat.class);
          job.setOutputKeyClass(Text.class);
          job.setOutputValueClass(LongWritable.class);
 
          // https://issues.apache.org/jira/browse/NUTCH-1029
-         job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
-
-         JobClient.runJob(job);
+         config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
 
+          try {
+            int complete = job.waitForCompletion(true)?0:1;
+          } catch (InterruptedException | ClassNotFoundException e) {
+            LOG.error(StringUtils.stringifyException(e));
+            throw e;
+          }
          // reading the result
          FileSystem fileSystem = tmpFolder.getFileSystem(config);
-         SequenceFile.Reader[] readers = 
SequenceFileOutputFormat.getReaders(config,
-                         tmpFolder);
+         MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, 
config);
 
 Review comment:
   ... a SequenceFile.Reader is required to read it. Otherwise reading fails 
with
   ```
   Exception in thread "main" java.io.FileNotFoundException: File 
file:.../stat_tmp1505114561240/part-r-00000/data does not exist
           at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
           at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
           at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
           at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
           at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1820)
           at 
org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
           at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
           at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
           at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:408)
           at 
org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:98)
           at 
org.apache.nutch.crawl.CrawlDbReader.processStatJobHelper(CrawlDbReader.java:402)
           at 
org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:444)
           at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:740)
           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
           at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:792)
   ```
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2375
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2375
>             Project: Nutch
>          Issue Type: Improvement
>          Components: deployment
>            Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

Reply via email to