[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388650#comment-16388650 ]
Lewis John McGibbney edited comment on NUTCH-2517 at 3/6/18 10:49 PM: ---------------------------------------------------------------------- I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/ crawl_fetch/ crawl_generate/ crawl_parse/ parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments3 mycrawl/segments/20180306143139/ mycrawl/MERGEDsegments/20180306143518/ mycrawl/MERGEDsegments2/20180306143617/ -filter Merging 3 segments to mycrawl/MERGEDsegments3/20180306143709 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: adding mycrawl/MERGEDsegments/20180306143518 mycrawl/MERGEDsegments/20180306143518 changed input dirs SegmentMerger: adding mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: using segment data from: crawl_generate crawl_parse {code} The difference i can see, is that upon successful execution of the mergesegs command, locally I am left with two new directories, namely {code} ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} was (Author: lewismc): I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/ crawl_fetch/ crawl_generate/ crawl_parse/ parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments3 mycrawl/segments/20180306143139/ mycrawl/MERGEDsegments/20180306143518/ mycrawl/MERGEDsegments2/20180306143617/ -filter Merging 3 segments to mycrawl/MERGEDsegments3/20180306143709 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: adding mycrawl/MERGEDsegments/20180306143518 mycrawl/MERGEDsegments/20180306143518 changed input dirs SegmentMerger: adding mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: using segment data from: crawl_generate crawl_parse {code} The difference i can see, is that upon successful execution of the mergesegs command, locally I am left with two new directories, namely {code} ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} > mergesegs corrupts segment data > ------------------------------- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment > Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST > Reporter: Marco Ebbinghaus > Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)