[jira] Commented: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs

Jim Kellerman (JIRA) Wed, 04 Oct 2006 19:37:33 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-371?page=comments#action_12440012 ] 
            
Jim Kellerman commented on NUTCH-371:
-------------------------------------


Let me copy my comments from Nutch-380 to here to explain why I linked it to 
this issue:

In Hadoop 0.6, JobConf.setInputKeyClass and JobConf.setInputValueClass are 
deprecated because the interface org.apache.hadoop.mapred.RecordReader has two 
new methods:

  /**
   * Create an object of the appropriate type to be used as a key.
   * @return a new key object
   */
  WritableComparable createKey();
  
  /**
   * Create an object of the appropriate type to be used as the value.
   * @return a new value object
   */
  Writable createValue();

This means that the key class and the value class need to be instantiable.

Making IndexDoc instantiable is not a big deal because it is always the same.
However, InputFormat is sometimes a Text and sometimes a MD5Hash in 
DeleteDuplicates2 (or in DeleteDuplicates, sometimes a UTF8 and sometimes a 
HashScore).

Since DeleteDuplicates(2).dedup knows what the key is for each phase, how about 
making two separate instantiable classes for the key classes and if sharing the 
code is that important, the classes can delegate to the static class?


> DeleteDuplicates should remove documents with duplicate URLs
> ------------------------------------------------------------
>
>                 Key: NUTCH-371
>                 URL: http://issues.apache.org/jira/browse/NUTCH-371
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Chris Schneider
>         Attachments: patch.txt
>
>
> DeleteDuplicates is supposed to delete documents with duplicate URLs (after 
> deleting documents with identical MD5 hashes), but this part is apparently 
> not yet implemented. Here's the comment from DeleteDuplicates.java:
> // 2. map indexes -> <<url, fetchdate>, <index,doc>>
> // partition by url
> // reduce, deleting all but most recent.
> //
> // Part 2 is not yet implemented, but the Indexer currently only indexes one
> // URL per page, so this is not a critical problem.
> It is apparently also known that re-fetching the same URL (e.g., one month 
> later) will result in more than one document with the same URL (this is 
> alluded to in NUTCH-95), but the comment above suggests that the indexer will 
> solve the problem before DeleteDuplicates, because it will only index one 
> document per URL.
> This is not necessarily the case if the segments are to be divided among 
> search servers, as each server will have its own index built from its own 
> portion of the segments. Thus, if the URL in question was fetched in 
> different segments, and these segments end up assigned to different search 
> servers, then the indexer can't be relied on to eliminate the duplicates.
> Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., 
> deleting documents with duplicate URLs) needs to be implemented. I agree with 
> Byron and Andrzej that the most recently fetched document (rather than the 
> one with the highest score) should be preserved.
> Finally, it's also possible to get duplicate URLs in the segments without 
> re-fetching an expired URL in the crawldb. This can happen if 3 different 
> URLs all redirect to the target URL. This is yet another consequence of 
> handling redirections immediately, rather than adding the target URL to the 
> crawldb for fetching in some subsequent segment (see NUTCH-273).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs

Reply via email to