[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

ASF GitHub Bot (JIRA) Wed, 02 Mar 2016 11:44:53 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176328#comment-15176328
 ]


ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------

Github user sebastian-nagel commented on a diff in the pull request:

    https://github.com/apache/nutch/pull/95#discussion_r54779042
  
    --- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ---
    @@ -166,235 +145,310 @@ private String filterUrl(String url) {
         return url;
       }
     
    -  public void map(Text key, Writable value,
    -      OutputCollector<Text, NutchWritable> output, Reporter reporter)
    -          throws IOException {
    +  /**
    +   * Implementation of {@link org.apache.hadoop.mapred.Mapper}
    +   * which optionally normalizes then filters a URL before simply
    +   * collecting key and values with the keys being URLs (manifested
    +   * as {@link org.apache.hadoop.io.Text}) and the 
    +   * values as {@link org.apache.nutch.crawl.NutchWritable} instances
    +   * of {@link org.apache.nutch.crawl.CrawlDatum}.
    +   */
    +  public static class IndexerMapReduceMapper implements Mapper<Text, 
Writable, Text, NutchWritable> {
    +
    +    @Override
    +    public void configure(JobConf job) {
    +    }
    +
    +    public void map(Text key, Writable value,
    +        OutputCollector<Text, NutchWritable> output, Reporter reporter)
    +            throws IOException {
    +
    +      String urlString = filterUrl(normalizeUrl(key.toString()));
    +      if (urlString == null) {
    +        return;
    +      } else {
    +        key.set(urlString);
    +      }
    +
    +      output.collect(key, new NutchWritable(value));
    +    }
     
    -    String urlString = filterUrl(normalizeUrl(key.toString()));
    -    if (urlString == null) {
    -      return;
    -    } else {
    -      key.set(urlString);
    +    @Override
    +    public void close() throws IOException {
         }
     
    -    output.collect(key, new NutchWritable(value));
       }
     
    -  public void reduce(Text key, Iterator<NutchWritable> values,
    -      OutputCollector<Text, NutchIndexAction> output, Reporter reporter)
    -          throws IOException {
    -    Inlinks inlinks = null;
    -    CrawlDatum dbDatum = null;
    -    CrawlDatum fetchDatum = null;
    -    Content content = null;
    -    ParseData parseData = null;
    -    ParseText parseText = null;
    -
    -    while (values.hasNext()) {
    -      final Writable value = values.next().get(); // unwrap
    -      if (value instanceof Inlinks) {
    -        inlinks = (Inlinks) value;
    -      } else if (value instanceof CrawlDatum) {
    -        final CrawlDatum datum = (CrawlDatum) value;
    -        if (CrawlDatum.hasDbStatus(datum)) {
    -          dbDatum = datum;
    -        } else if (CrawlDatum.hasFetchStatus(datum)) {
    -          // don't index unmodified (empty) pages
    -          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
    -            fetchDatum = datum;
    +  /**
    +   * Implementation of {@link org.apache.hadoop.mapred.Reducer}
    +   * which generates {@link org.apache.nutch.indexer.NutchIndexAction}'s
    +   * from combinations of various Nutch data structures. Essentially 
    +   * teh result is a key representing a URL and a value representing a
    --- End diff --
    
    typo teh -> the


> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Reply via email to