[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092924#comment-15092924
 ] 

Sebastian Nagel commented on NUTCH-1712:
----------------------------------------

The merging is done together with minor improvements 
(https://github.com/apache/nutch/compare/trunk...sebastian-nagel:NUTCH-1712), 
but still  need to adapt test unit (TestCrawlDbStates.java).


> Use MultipleInputs in Injector to make it a single mapreduce job
> ----------------------------------------------------------------
>
>                 Key: NUTCH-1712
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1712
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.7
>            Reporter: Tejas Patil
>            Assignee: Sebastian Nagel
>         Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to