[ 
https://issues.apache.org/jira/browse/NUTCH-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447358#comment-16447358
 ] 

ASF GitHub Bot commented on NUTCH-2570:
---------------------------------------

sebastian-nagel opened a new pull request #323: NUTCH-2570 Deduplication job 
fails to install deduplicated CrawlDb
URL: https://github.com/apache/nutch/pull/323
 
 
   - run merge job to update status of duplicates to CrawlDb
   - lock CrawlDb while running merge job
   - cleanup if merge job fails

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Deduplication job fails to install deduplicated CrawlDb
> -------------------------------------------------------
>
>                 Key: NUTCH-2570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2570
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.15
>
>
> The DeduplicationJob ("nutch dedup") fails to install the deduplicated 
> CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):
> {noformat}
> % tree crawldb
> crawldb
> ├── current
> │   └── part-r-00000
> │   ├── data
> │   └── index
> └── old
> └── part-r-00000
> ├── data
> └── index
> % bin/nutch dedup crawldb
> DeduplicationJob: starting at 2018-04-22 21:48:08
> Deduplication: 6 documents marked as duplicates
> Deduplication: Updating status of duplicate urls into crawl db.
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/tmp/crawldb/1742327020 does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374)
> at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613)
> at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225)
> at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379)
> % tree crawldb
> crawldb
> └── old
> └── part-r-00000
> ├── data
> └── index
> {noformat}
> In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left 
> without any error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to