[ 
https://issues.apache.org/jira/browse/NUTCH-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819947#comment-13819947
 ] 

Markus Jelsma commented on NUTCH-1324:
--------------------------------------

Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text> 
database where the DupeDatum is a compound type of digest, URL path section, 
domain. The Text is the host part of the URL. This is generated by reading the 
CrawlDB. This DupeDB is then ingested by NUTCH-1326 together with NUTCH-1325 to 
output rules for NUTCH-1319.

All these things are for solving the duplicate host problem in the CrawlDB  by 
using a HostNormalizer. We crawled the internet (without filtering rules) for 
over a year. We quickly saw the fetcher fetching the same pages from the same 
domains over and over. The most typical host duplication is a website 
accessible over http://www.example.org/ and http://example.org/. This means 
twice as many unique URL's for many domains. You can not use manual URL filters 
to solve the problem, nor can you manually edit the HostNormalizer on this 
scale.

These tools make it happen automatically.

Here's an  example of two DupeDB entries for the common www-problem (the first 
three columns make up the DupeDatum, the right is the host. The DupeDatum is 
the key in M/R):
a218daf4a39ed75b24d977bb90394a11        /grande-bretagne-c-248.html     
annuaire-loisirs-seniors.fr annuaire-loisirs-seniors.fr
a218daf4a39ed75b24d977bb90394a11        /grande-bretagne-c-248.html     
annuaire-loisirs-seniors.fr www.annuaire-loisirs-seniors.fr

Here's a more interesting problem:
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
znacky.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
siku-farmer.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
impag.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
koleje.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
lifetime.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
penove-dekorace.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
grand.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
maxi.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3        /grand/ katalog-hracek.cz 
groovy-pets.katalog-hracek.cz



> DupeDB for Nutch
> ----------------
>
>                 Key: NUTCH-1324
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1324
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>
> A DupeDB for Nutch and associated tools to create and read a database 
> containing information on duplicates.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to