Marco Ebbinghaus created NUTCH-2517: ---------------------------------------
Summary: mergesegs corrupts segment data Key: NUTCH-2517 URL: https://issues.apache.org/jira/browse/NUTCH-2517 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.15 Environment: xubuntu 17.10, docker container of apache/nutch LATEST Reporter: Marco Ebbinghaus Attachments: Screenshot_2018-03-03_18-09-28.png The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting segment: 20180304134535 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: crawl_generate * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir mycrawl/MERGEDsegments) which results in a consequential error ** console output: `LinkDb: adding segment: file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535 LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` So as it seems mapreduce corrupts the segment folder during mergesegs command -- This message was sent by Atlassian JIRA (v7.6.3#76005)