Re: Updated Domain URLFilter

Dennis Kubes Mon, 29 Dec 2008 12:59:05 -0800


ML mail wrote:

I am going to go ahead and commit this code.  If we still
see errors popping up we will need to revisit it.Truthfully though I don't know how an error such as you
are describing could happen as all this filter does is match
suffix, domain name, and hostname.  There is no regex stuff
happening.
Could it be possible that redirected URLs are not catched by the URLFilter ? For example www.adobe.be is a redirect to www.adobe.com/be. That could explain why some .com and other TLD still get indexed...

Yeah that might be it, I think the url filter runs before the fetch ofthe url and doesn't interact with redirects. I will look into that more.

I know there are changes in the CrawlDb structure and
metadata.  Don't know why there would be that big of a
difference unless you

1) set the max content fetched higher


I have got "http.content.limit" set to 131072 as I had in nutch 0.9.

2) you were just fetching a set of pages that had more
content


Unfortunately no, I took the same base of URLs as before which simply are an 
extract of dmoz for domains ending in .be.


Is it the exact same extract or is it an updated one?

3) you set redirects to > 0 so redirects were fetched
and you were actually fetching a lot more pages.


"http.redirect.max" is set to 3 such as I had in nutch 0.9.

The only difference in my config setup for Nutch 1.0 is that I did all my 
modifications in conf/nutch-site.xml instead of changing the parameters in 
conf/nutch-default.xml. Could it be possible that Nutch didn't take in account 
nutch-site.xml ?


AFAIK that shouldn't cause those types of changes.

Dennis

Thanks again for your input!
Best regards

--- On Sat, 12/13/08, Dennis Kubes

<[email protected]> wrote:

From: Dennis Kubes <[email protected]>
Subject: Updated Domain URLFilter
To: [email protected]
Date: Saturday, December 13, 2008, 8:57 AM
An updated patch has been added for the domain

urlfilter. This now includes the matching against domain
suffix, domain

name, and hostname in that order.

Dennis

Re: Updated Domain URLFilter

Reply via email to