[jira] [Commented] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

Sebastian Nagel (JIRA) Tue, 13 May 2014 15:42:06 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996822#comment-13996822
 ]


Sebastian Nagel commented on NUTCH-1772:
----------------------------------------

+1 (works)
NUTCH-1712 would remove one of the two jobs even if there is already a CrawlDb: 
but there are open points (mainly the missing MapFileOutputFormat in Hadoop 
1.2.0 / new MapReduce-API).

> Injector does not need merging if no pre-existing crawldb
> ---------------------------------------------------------
>
>                 Key: NUTCH-1772
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1772
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.8
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1772.patch
>
>
> The injector currently works as following : 
> * MapReduce job 1 - Mapper :  converts input lines into CrawlDatum objects 
> with normalisation and filtering
> * MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at 
> this stage
> * MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + 
> output of previous job
> * MapReducer job 2 - Reducer : deduplication
> If there is no existing crawldb (which will often be the case at injection 
> time) we don't really need to do the second mapreduce job and could simply 
> take the output of the MR job #1 as CrawlDB provided that we do the 
> deduplication as part of the reduce step.
> If there is a crawldb then the reduce step of the MR job #1 is not really 
> needed and we could have that step as map only.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

Reply via email to