[ 
https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941290#comment-16941290
 ] 

ASF GitHub Bot commented on NUTCH-2279:
---------------------------------------

sebastian-nagel commented on issue #478: NUTCH-2279 LinkRank fails when using 
Hadoop MR output compression
URL: https://github.com/apache/nutch/pull/478#issuecomment-536723448
 
 
   @naegelejd added the stack trace in Jira. A test isn't really necessary, 
just tried it in local mode with mapreduce.output.fileoutputformat.compress 
enabled or disabled. Will also test later in pseudo-distributed mode when 
testing the release candidate.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LinkRank fails when using Hadoop MR output compression
> ------------------------------------------------------
>
>                 Key: NUTCH-2279
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2279
>             Project: Nutch
>          Issue Type: Bug
>          Components: webgraph
>    Affects Versions: 1.12
>            Reporter: Joseph Naegele
>            Assignee: Sebastian Nagel
>            Priority: Major
>              Labels: patch-available
>             Fix For: 1.16
>
>
> When using MapReduce job output compression, i.e. 
> {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
> results of its {{Counter}} MR job due to the additional, generated file 
> extension.
> For example, using the default compression codec (which appears to be 
> DEFLATE), the counter file is written to 
> {{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job 
> attempts to manually read this file to obtain the number of links using the 
> following code:
> {code}
> FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
> {code}
> which fails because the file {{part-00000}} doesn't exist:
> {code}
> LinkAnalysis: java.io.FileNotFoundException: File 
> crawl/webgraph/_num_nodes_/part-00000 does not exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
>         at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
>         at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
> {code}
> To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to 
> the properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to