[
https://issues.apache.org/jira/browse/NUTCH-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819947#comment-13819947
]
Markus Jelsma commented on NUTCH-1324:
--------------------------------------
Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text>
database where the DupeDatum is a compound type of digest, URL path section,
domain. The Text is the host part of the URL. This is generated by reading the
CrawlDB. This DupeDB is then ingested by NUTCH-1326 together with NUTCH-1325 to
output rules for NUTCH-1319.
All these things are for solving the duplicate host problem in the CrawlDB by
using a HostNormalizer. We crawled the internet (without filtering rules) for
over a year. We quickly saw the fetcher fetching the same pages from the same
domains over and over. The most typical host duplication is a website
accessible over http://www.example.org/ and http://example.org/. This means
twice as many unique URL's for many domains. You can not use manual URL filters
to solve the problem, nor can you manually edit the HostNormalizer on this
scale.
These tools make it happen automatically.
Here's an example of two DupeDB entries for the common www-problem (the first
three columns make up the DupeDatum, the right is the host. The DupeDatum is
the key in M/R):
a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html
annuaire-loisirs-seniors.fr annuaire-loisirs-seniors.fr
a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html
annuaire-loisirs-seniors.fr www.annuaire-loisirs-seniors.fr
Here's a more interesting problem:
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
znacky.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
siku-farmer.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
impag.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
koleje.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
lifetime.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
penove-dekorace.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
grand.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
maxi.katalog-hracek.cz
c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz
groovy-pets.katalog-hracek.cz
> DupeDB for Nutch
> ----------------
>
> Key: NUTCH-1324
> URL: https://issues.apache.org/jira/browse/NUTCH-1324
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.9
>
>
> A DupeDB for Nutch and associated tools to create and read a database
> containing information on duplicates.
--
This message was sent by Atlassian JIRA
(v6.1#6144)