[jira] [Commented] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression

Hudson (Jira) Tue, 01 Oct 2019 08:00:18 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942048#comment-16942048
 ]


Hudson commented on NUTCH-2279:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3649 (See 
[https://builds.apache.org/job/Nutch-trunk/3649/])
NUTCH-2279 LinkRank fails when using Hadoop MR output compression - read 
(snagel: 
[https://github.com/apache/nutch/commit/03475276204cb0a31f1f5f0b6a547d3c92c6a799])
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java


> LinkRank fails when using Hadoop MR output compression
> ------------------------------------------------------
>
>                 Key: NUTCH-2279
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2279
>             Project: Nutch
>          Issue Type: Bug
>          Components: webgraph
>    Affects Versions: 1.12
>            Reporter: Joseph Naegele
>            Assignee: Sebastian Nagel
>            Priority: Major
>              Labels: patch-available
>             Fix For: 1.16
>
>
> When using MapReduce job output compression, i.e. 
> {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
> results of its {{Counter}} MR job due to the additional, generated file 
> extension.
> For example, using the default compression codec (which appears to be 
> DEFLATE), the counter file is written to 
> {{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job 
> attempts to manually read this file to obtain the number of links using the 
> following code:
> {code}
> FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
> {code}
> which fails because the file {{part-00000}} doesn't exist:
> {code}
> LinkAnalysis: java.io.FileNotFoundException: File 
> crawl/webgraph/_num_nodes_/part-00000 does not exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
>         at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
> {code}
> To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to 
> the properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression

Reply via email to