[ 
https://issues.apache.org/jira/browse/NUTCH-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453862#comment-16453862
 ] 

ASF GitHub Bot commented on NUTCH-2570:
---------------------------------------

sebastian-nagel closed pull request #323: NUTCH-2570 Deduplication job fails to 
install deduplicated CrawlDb
URL: https://github.com/apache/nutch/pull/323
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/DeduplicationJob.java 
b/src/java/org/apache/nutch/crawl/DeduplicationJob.java
index 555f9e2eb..12ebd3c8b 100644
--- a/src/java/org/apache/nutch/crawl/DeduplicationJob.java
+++ b/src/java/org/apache/nutch/crawl/DeduplicationJob.java
@@ -265,7 +265,7 @@ public int run(String[] args) throws IOException {
     }
 
     String group = "none";
-    String crawldb = args[0];
+    Path crawlDb = new Path(args[0]);
     String compareOrder = "score,fetchTime,urlLength";
 
     for (int i = 1; i < args.length; i++) {
@@ -287,17 +287,16 @@ public int run(String[] args) throws IOException {
     long start = System.currentTimeMillis();
     LOG.info("DeduplicationJob: starting at " + sdf.format(start));
 
-    Path tempDir = new Path(getConf().get("mapreduce.cluster.temp.dir", ".")
-        + "/dedup-temp-"
+    Path tempDir = new Path(crawlDb, "dedup-temp-"
         + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
 
     Job job = NutchJob.getInstance(getConf());
     Configuration conf = job.getConfiguration();
-    job.setJobName("Deduplication on " + crawldb);
+    job.setJobName("Deduplication on " + crawlDb);
     conf.set(DEDUPLICATION_GROUP_MODE, group);
     conf.set(DEDUPLICATION_COMPARE_ORDER, compareOrder);
 
-    FileInputFormat.addInputPath(job, new Path(crawldb, CrawlDb.CURRENT_NAME));
+    FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
     job.setInputFormatClass(SequenceFileInputFormat.class);
 
     FileOutputFormat.setOutputPath(job, tempDir);
@@ -341,28 +340,33 @@ public int run(String[] args) throws IOException {
       LOG.info("Deduplication: Updating status of duplicate urls into crawl 
db.");
     }
 
-    Path dbPath = new Path(crawldb);
-    Job mergeJob = CrawlDb.createJob(getConf(), dbPath);
+    Job mergeJob = CrawlDb.createJob(getConf(), crawlDb);
     FileInputFormat.addInputPath(mergeJob, tempDir);
     mergeJob.setReducerClass(StatusUpdateReducer.class);
+    mergeJob.setJarByClass(DeduplicationJob.class);
 
+    fs = crawlDb.getFileSystem(getConf());
+    Path outPath = FileOutputFormat.getOutputPath(job);
+    Path lock = CrawlDb.lock(getConf(), crawlDb, false);
     try {
-      boolean success = job.waitForCompletion(true);
+      boolean success = mergeJob.waitForCompletion(true);
       if (!success) {
         String message = "Crawl job did not succeed, job status:"
-            + job.getStatus().getState() + ", reason: "
-            + job.getStatus().getFailureInfo();
+            + mergeJob.getStatus().getState() + ", reason: "
+            + mergeJob.getStatus().getFailureInfo();
         LOG.error(message);
         fs.delete(tempDir, true);
+        NutchJob.cleanupAfterFailure(outPath, lock, fs);
         throw new RuntimeException(message);
       }
     } catch (IOException | InterruptedException | ClassNotFoundException e) {
       LOG.error("DeduplicationMergeJob: " + StringUtils.stringifyException(e));
       fs.delete(tempDir, true);
+      NutchJob.cleanupAfterFailure(outPath, lock, fs);
       return -1;
     }
 
-    CrawlDb.install(mergeJob, dbPath);
+    CrawlDb.install(mergeJob, crawlDb);
 
     // clean up
     fs.delete(tempDir, true);


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Deduplication job fails to install deduplicated CrawlDb
> -------------------------------------------------------
>
>                 Key: NUTCH-2570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2570
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.15
>
>
> The DeduplicationJob ("nutch dedup") fails to install the deduplicated 
> CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):
> {noformat}
> % tree crawldb
> crawldb
> ├── current
> │   └── part-r-00000
> │   ├── data
> │   └── index
> └── old
> └── part-r-00000
> ├── data
> └── index
> % bin/nutch dedup crawldb
> DeduplicationJob: starting at 2018-04-22 21:48:08
> Deduplication: 6 documents marked as duplicates
> Deduplication: Updating status of duplicate urls into crawl db.
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/tmp/crawldb/1742327020 does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374)
> at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613)
> at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225)
> at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379)
> % tree crawldb
> crawldb
> └── old
> └── part-r-00000
> ├── data
> └── index
> {noformat}
> In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left 
> without any error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to