Joseph Naegele created NUTCH-2279:
-------------------------------------
Summary: LinkRank fails when using Hadoop MR output compression
Key: NUTCH-2279
URL: https://issues.apache.org/jira/browse/NUTCH-2279
Project: Nutch
Issue Type: Bug
Affects Versions: 1.11
Reporter: Joseph Naegele
When using MapReduce job output compression, i.e.
{{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the
results of its {{Counter}} MR job due to the additional, generated file
extension.
For example, using the default compression codec (which appears to be DEFLATE),
the counter file is written to
{{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job
attempts to manually read this file to obtain the number of links using the
following code:
{code}
FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000"));
{code}
which fails because the file {{part-00000}} doesn't exist:
{code}
LinkAnalysis: java.io.FileNotFoundException: File
crawl/webgraph/_num_nodes_/part-00000 does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
{code}
To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to the
properties for {{bin/nutch linkrank ...}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)