Hiran Chaudhuri created NUTCH-3079:
--------------------------------------

             Summary: Dumping a segment fails unless it has been fetched and 
parsed
                 Key: NUTCH-3079
                 URL: https://issues.apache.org/jira/browse/NUTCH-3079
             Project: Nutch
          Issue Type: Bug
         Environment: Ubuntu 22 LTS

$ $JAVA_HOME/bin/java -version
openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
sharing)
            Reporter: Hiran Chaudhuri


On some existing crawldb generate a new segment:

{{./local/bin/nutch generate crawl/crawldb crawl/segments}}
{{...}}
{{2024-10-14 07:58:58,589 INFO org.apache.nutch.crawl.Generator [main] 
Generator: segment: crawl/segments/20241014075858}}
{{2024-10-14 07:58:59,731 INFO org.apache.nutch.crawl.Generator [main] 
Generator: finished, elapsed: 3423 ms}}

Then try to dump this new segment:

{{./local/bin/nutch readseg -dump crawl/segments/20241014075858 
crawl/log/dumpsegment}}

{{This errors out with}}

{{2024-10-14 08:01:10,448 INFO org.apache.nutch.segment.SegmentReader [main] 
SegmentReader: dump segment: crawl/segments/20241014075858}}
{{2024-10-14 08:01:10,705 ERROR org.apache.nutch.segment.SegmentReader [main] 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
{{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
{{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
{{    at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}}
{{    at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}}
{{    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
{{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
{{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
{{    at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}}
{{    at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}}
{{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
{{    at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}}
{{Caused by: java.io.IOException: Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
{{{}    ... 17 more{}}}{{{}Exception in thread "main" 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch{}}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}}
{{Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
{{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
{{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
{{    at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}}
{{    at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}}
{{    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
{{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
{{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
{{    at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}}
{{    at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}}
{{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
{{    at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}}
{{Caused by: java.io.IOException: Input path does not exist: 
file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
{{{}    ... 17 more{}}}{{{}{}}}

{{I know there was no fetch and no parse step executed, but I would have 
expected to see a dump of the segment's data (the contained URLs presumably)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to