[
https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000979#comment-14000979
]
Hudson commented on NUTCH-1772:
-------------------------------
SUCCESS: Integrated in Nutch-trunk #2630 (See
[https://builds.apache.org/job/Nutch-trunk/2630/])
NUTCH-1772 Injector does not need merging if no pre-existing crawldb (jnioche:
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1595137)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
> Injector does not need merging if no pre-existing crawldb
> ---------------------------------------------------------
>
> Key: NUTCH-1772
> URL: https://issues.apache.org/jira/browse/NUTCH-1772
> Project: Nutch
> Issue Type: Improvement
> Components: injector
> Affects Versions: 1.8
> Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: NUTCH-1772-Logging&ErrorHandling.patch, NUTCH-1772.patch
>
>
> The injector currently works as following :
> * MapReduce job 1 - Mapper : converts input lines into CrawlDatum objects
> with normalisation and filtering
> * MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at
> this stage
> * MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) +
> output of previous job
> * MapReducer job 2 - Reducer : deduplication
> If there is no existing crawldb (which will often be the case at injection
> time) we don't really need to do the second mapreduce job and could simply
> take the output of the MR job #1 as CrawlDB provided that we do the
> deduplication as part of the reduce step.
> If there is a crawldb then the reduce step of the MR job #1 is not really
> needed and we could have that step as map only.
--
This message was sent by Atlassian JIRA
(v6.2#6252)