[
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marco Ebbinghaus updated NUTCH-2517:
------------------------------------
Description:
The problem probably occurs since commit
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
How to reproduce:
* create container from apache/nutch image (latest)
* open terminal in that container
* set http.agent.name
* create crawldir and urls file
* run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
* run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments
1)
** this results in a segment (e.g. 20180304134215)
* run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215
-threads 2)
* run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215
-threads 2)
** ls in the segment folder -> existing folders: content, crawl_fetch,
crawl_generate, crawl_parse, parse_data, parse_text
* run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb
mycrawl/segments/20180304134215)
* run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments
mycrawl/segments/* -filter)
** console output: `SegmentMerger: using segment data from: content
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
** resulting segment: 20180304134535
* ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder:
crawl_generate
* run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir
mycrawl/MERGEDsegments) which results in a consequential error
** console output: `LinkDb: adding segment:
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist:
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
So as it seems mapreduce corrupts the segment folder during mergesegs command.
Pay attention to the fact that this issue is not related on trying to merge a
single segment like described above. As you can see on the attached screenshot
that problem also appears when using bin/nutch generate with a topN > 1 -
resulting in a segment count > 1.
was:
The problem probably occurs since commit
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
How to reproduce:
* create container from apache/nutch image (latest)
* open terminal in that container
* set http.agent.name
* create crawldir and urls file
* run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
* run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments
1)
** this results in a segment (e.g. 20180304134215)
* run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215
-threads 2)
* run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215
-threads 2)
** ls in the segment folder -> existing folders: content, crawl_fetch,
crawl_generate, crawl_parse, parse_data, parse_text
* run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb
mycrawl/segments/20180304134215)
* run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments
mycrawl/segments/* -filter)
** console output: `SegmentMerger: using segment data from: content
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
** resulting segment: 20180304134535
* ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder:
crawl_generate
* run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir
mycrawl/MERGEDsegments) which results in a consequential error
** console output: `LinkDb: adding segment:
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist:
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
So as it seems mapreduce corrupts the segment folder during mergesegs command.
Pay attention to the fact that this issue is not related on trying to merge a
single segment like described above. As you can see on the appended screenshot
that problem also appears when using bin/nutch generate with a topN > 1 -
resulting in a segment count > 1.
> mergesegs corrupts segment data
> -------------------------------
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
> Issue Type: Bug
> Components: segment
> Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
> Reporter: Marco Ebbinghaus
> Priority: Major
> Labels: mapreduce, mergesegs
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
> * create container from apache/nutch image (latest)
> * open terminal in that container
> * set http.agent.name
> * create crawldir and urls file
> * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
> * run bin/nutch generate (bin/nutch generate mycrawl/crawldb
> mycrawl/segments 1)
> ** this results in a segment (e.g. 20180304134215)
> * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215
> -threads 2)
> * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215
> -threads 2)
> ** ls in the segment folder -> existing folders: content, crawl_fetch,
> crawl_generate, crawl_parse, parse_data, parse_text
> * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb
> mycrawl/segments/20180304134215)
> * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments
> mycrawl/segments/* -filter)
> ** console output: `SegmentMerger: using segment data from: content
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
> ** resulting segment: 20180304134535
> * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing
> folder: crawl_generate
> * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir
> mycrawl/MERGEDsegments) which results in a consequential error
> ** console output: `LinkDb: adding segment:
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
> LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> path does not exist:
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
> at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>
> Pay attention to the fact that this issue is not related on trying to merge a
> single segment like described above. As you can see on the attached
> screenshot that problem also appears when using bin/nutch generate with a
> topN > 1 - resulting in a segment count > 1.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)