[ 
https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941269#comment-16941269
 ] 

ASF GitHub Bot commented on NUTCH-2279:
---------------------------------------

lewismc commented on issue #478: NUTCH-2279 LinkRank fails when using Hadoop MR 
output compression
URL: https://github.com/apache/nutch/pull/478#issuecomment-536709615
 
 
   I've never come across this issue. What is the error trace like when this is 
encountered? Is a test required?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LinkRank fails when using Hadoop MR output compression
> ------------------------------------------------------
>
>                 Key: NUTCH-2279
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2279
>             Project: Nutch
>          Issue Type: Bug
>          Components: webgraph
>    Affects Versions: 1.12
>            Reporter: Joseph Naegele
>            Assignee: Sebastian Nagel
>            Priority: Major
>              Labels: patch-available
>             Fix For: 1.16
>
>
> When using MapReduce job output compression, i.e. 
> {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
> results of its {{Counter}} MR job due to the additional, generated file 
> extension.
> For example, using the default compression codec (which appears to be 
> DEFLATE), the counter file is written to 
> {{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job 
> attempts to manually read this file to obtain the number of links using the 
> following code:
> {code}
> FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
> {code}
> which fails because the file {{part-00000}} doesn't exist:
> {code}
> LinkAnalysis: java.io.FileNotFoundException: File 
> crawl/webgraph/_num_nodes_/part-00000 does not exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
>         at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
> {code}
> To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to 
> the properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to