[
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935823#comment-13935823
]
Sebastian Nagel commented on NUTCH-1712:
----------------------------------------
Thanks. Looks good in general, +1 for the nicely improved command-line help.
Open points:
# the resulting CrawlDb is not readable by some tools, e.g. {{nutch readdb
crawldb/ -url url}} fails. Output should be MapFile not SequenceFile.
Unluckily, o.a.h.mapreduce.lib.output.MapFileOutputFormat seems not available
in Hadoop 1.2.0 (later versions contain the class, see MAPREDUCE-375)
# URL normalizer scope could be changed by new config property
"crawldb.url.normalizers.scope". Do we need it? If yes, should place a
description into nutch-default.xml
# if this property is not set: per default URLNormalizers.SCOPE_CRAWLDB is used
instead of URLNormalizers.SCOPE_INJECT. Default should be still SCOPE_INJECT,
right?
> Use MultipleInputs in Injector to make it a single mapreduce job
> ----------------------------------------------------------------
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
> Issue Type: Improvement
> Components: injector
> Affects Versions: 1.7
> Reporter: Tejas Patil
> Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.2#6252)