[ 
https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941090#comment-16941090
 ] 

ASF GitHub Bot commented on NUTCH-2279:
---------------------------------------

sebastian-nagel commented on pull request #478: NUTCH-2279 LinkRank fails when 
using Hadoop MR output compression
URL: https://github.com/apache/nutch/pull/478
 
 
   - read output directory of link counter job to determine output file name 
(fail if there is none or more than one file)
   - determine output codec and use it to read the output
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LinkRank fails when using Hadoop MR output compression
> ------------------------------------------------------
>
>                 Key: NUTCH-2279
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2279
>             Project: Nutch
>          Issue Type: Bug
>          Components: webgraph
>    Affects Versions: 1.12
>            Reporter: Joseph Naegele
>            Assignee: Sebastian Nagel
>            Priority: Major
>              Labels: patch-available
>             Fix For: 1.16
>
>
> When using MapReduce job output compression, i.e. 
> {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
> results of its {{Counter}} MR job due to the additional, generated file 
> extension.
> For example, using the default compression codec (which appears to be 
> DEFLATE), the counter file is written to 
> {{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job 
> attempts to manually read this file to obtain the number of links using the 
> following code:
> {code}
> FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
> {code}
> which fails because the file {{part-00000}} doesn't exist:
> {code}
> LinkAnalysis: java.io.FileNotFoundException: File 
> crawl/webgraph/_num_nodes_/part-00000 does not exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
>         at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
> {code}
> To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to 
> the properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to