Re: Updated Domain URLFilter

ML mail Mon, 29 Dec 2008 12:37:51 -0800

> I am going to go ahead and commit this code.  If we still
> see errors popping up we will need to revisit it. 
> Truthfully though I don't know how an error such as you
> are describing could happen as all this filter does is match
> suffix, domain name, and hostname.  There is no regex stuff
> happening.

Could it be possible that redirected URLs are not catched by the URLFilter ? 
For example www.adobe.be is a redirect to www.adobe.com/be. That could explain 
why some .com and other TLD still get indexed... 

> I know there are changes in the CrawlDb structure and
> metadata.  Don't know why there would be that big of a
> difference unless you

> 1) set the max content fetched higher

I have got "http.content.limit" set to 131072 as I had in nutch 0.9.

> 2) you were just fetching a set of pages that had more
> content

Unfortunately no, I took the same base of URLs as before which simply are an 
extract of dmoz for domains ending in .be.

> 3) you set redirects to > 0 so redirects were fetched
> and you were actually fetching a lot more pages.

"http.redirect.max" is set to 3 such as I had in nutch 0.9.

The only difference in my config setup for Nutch 1.0 is that I did all my 
modifications in conf/nutch-site.xml instead of changing the parameters in 
conf/nutch-default.xml. Could it be possible that Nutch didn't take in account 
nutch-site.xml ?

Thanks again for your input!
Best regards

> > --- On Sat, 12/13/08, Dennis Kubes
> <[email protected]> wrote:
> > 
> >> From: Dennis Kubes <[email protected]>
> >> Subject: Updated Domain URLFilter
> >> To: [email protected]
> >> Date: Saturday, December 13, 2008, 8:57 AM
> >> An updated patch has been added for the domain
> urlfilter. This now includes the matching against domain
> suffix, domain
> >> name, and hostname in that order.
> >> 
> >> Dennis
> > 
> > 
> >

Re: Updated Domain URLFilter

Reply via email to