[ https://issues.apache.org/jira/browse/NUTCH-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-3079. ------------------------------------ Assignee: Hiran Chaudhuri Resolution: Fixed Fixed in PR [#837|https://github.com/apache/nutch/pull/837]. Thanks, [~hiranchaudhuri]! > Dumping a segment fails unless it has been fetched and parsed > ------------------------------------------------------------- > > Key: NUTCH-3079 > URL: https://issues.apache.org/jira/browse/NUTCH-3079 > Project: Nutch > Issue Type: Bug > Environment: Ubuntu 22 LTS > $ $JAVA_HOME/bin/java -version > openjdk version "21.0.4" 2024-07-16 LTS > OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS) > OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, > sharing) > Reporter: Hiran Chaudhuri > Assignee: Hiran Chaudhuri > Priority: Major > Fix For: 1.21 > > > On some existing crawldb generate a new segment: > {{./local/bin/nutch generate crawl/crawldb crawl/segments}} > {{...}} > {{2024-10-14 07:58:58,589 INFO org.apache.nutch.crawl.Generator [main] > Generator: segment: crawl/segments/20241014075858}} > {{2024-10-14 07:58:59,731 INFO org.apache.nutch.crawl.Generator [main] > Generator: finished, elapsed: 3423 ms}} > Then try to dump this new segment: > {{./local/bin/nutch readseg -dump crawl/segments/20241014075858 > crawl/log/dumpsegment}} > {{This errors out with}} > {{2024-10-14 08:01:10,448 INFO org.apache.nutch.segment.SegmentReader [main] > SegmentReader: dump segment: crawl/segments/20241014075858}} > {{2024-10-14 08:01:10,705 ERROR org.apache.nutch.segment.SegmentReader [main] > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}} > {{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}} > {{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}} > {{ at > java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}} > {{ at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}} > {{ at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}} > {{ at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}} > {{ at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}} > {{ at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}} > {{ at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}} > {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}} > {{ at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}} > {{Caused by: java.io.IOException: Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}} > {{{} ... 17 more{}}}{{{}Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch{}}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}} > {{Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}} > {{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}} > {{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}} > {{ at > java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}} > {{ at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}} > {{ at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}} > {{ at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}} > {{ at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}} > {{ at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}} > {{ at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}} > {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}} > {{ at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}} > {{Caused by: java.io.IOException: Input path does not exist: > file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}} > {{{} ... 17 more{}}}{{{}{}}} > {{I know there was no fetch and no parse step executed, but I would have > expected to see a dump of the segment's data (the contained URLs presumably)}} -- This message was sent by Atlassian Jira (v8.20.10#820010)