> I am going to go ahead and commit this code. If we still > see errors popping up we will need to revisit it. > Truthfully though I don't know how an error such as you > are describing could happen as all this filter does is match > suffix, domain name, and hostname. There is no regex stuff > happening.
Could it be possible that redirected URLs are not catched by the URLFilter ? For example www.adobe.be is a redirect to www.adobe.com/be. That could explain why some .com and other TLD still get indexed... > I know there are changes in the CrawlDb structure and > metadata. Don't know why there would be that big of a > difference unless you > 1) set the max content fetched higher I have got "http.content.limit" set to 131072 as I had in nutch 0.9. > 2) you were just fetching a set of pages that had more > content Unfortunately no, I took the same base of URLs as before which simply are an extract of dmoz for domains ending in .be. > 3) you set redirects to > 0 so redirects were fetched > and you were actually fetching a lot more pages. "http.redirect.max" is set to 3 such as I had in nutch 0.9. The only difference in my config setup for Nutch 1.0 is that I did all my modifications in conf/nutch-site.xml instead of changing the parameters in conf/nutch-default.xml. Could it be possible that Nutch didn't take in account nutch-site.xml ? Thanks again for your input! Best regards > > --- On Sat, 12/13/08, Dennis Kubes > <[email protected]> wrote: > > > >> From: Dennis Kubes <[email protected]> > >> Subject: Updated Domain URLFilter > >> To: [email protected] > >> Date: Saturday, December 13, 2008, 8:57 AM > >> An updated patch has been added for the domain > urlfilter. This now includes the matching against domain > suffix, domain > >> name, and hostname in that order. > >> > >> Dennis > > > > > >
