[ 
https://issues.apache.org/jira/browse/NUTCH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriele Kahlout updated NUTCH-972:
-----------------------------------

    Attachment: check_empty.diff

> Mergedb doesn't merge with empty directory, as is the case with merge (for 
> indexes)
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-972
>                 URL: https://issues.apache.org/jira/browse/NUTCH-972
>             Project: Nutch
>          Issue Type: Bug
>          Components: storage
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: check_empty.diff
>
>
> Just an issue of unexpected behavior. This series of commands works with 
> bin/nutch merge to merge indexes but not with crawldb.
> allcrawldb="crawl/allcrawldb"
> temp_crawldb="crawl/temp_crawldb"
> merge_dbs="$it_crawldb $allcrawldb"
>       
> #     if [[ ! -d $allcrawldb ]]
> #     then
> #             merge_dbs="$it_crawldb"
> #     fi
> # uncomment the above and mergedb will work fine.     
> bin/nutch mergedb $temp_crawldb $merge_dbs    
> rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
> mv $temp_crawldb $allcrawldb
> This is the exception that occurs:
> bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
> CrawlDb merge: starting at 2011-03-27 10:13:06
> Adding crawl/crawldb
> Adding crawl/allcrawldb
> CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path 
> does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
>       at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>       at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>       at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>       at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
>       at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
> Beside the scripting workaround I've attached a patch which skips adding the 
> empty folder to the collection of dbs to merge. I've also added it a log of 
> which dbs actually get added, consistent with merge interface.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to