Thanks for the quick new patch. I have now tried it out but patch fails on the src/java/org/apache/nutch/util/URLUtil.java because it doesn't exist I suppose. I am using the stable Nutch 0.9 release so I guess your patch only works with an SVN release ?
We would prefer to use a stable release but if there are no other choices let me know and I will install an SVN release of Nutch and apply this patch. Thanks Regards --- On Thu, 12/4/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Re: How to effectively stop indexing javascript pages ending with .js > To: nutch-user@lucene.apache.org > Date: Thursday, December 4, 2008, 7:15 AM > Try it again with the latest patch. You will need to build > with Java 6 > if you are not doing so. My plugin.includes looks like > urlfilter-(domain|suffix|prefix)|... I did a clean build > from svn, > applied the 2nd patch and did a multistep fetch of > apache.org. Looks > like it works good for me. Let me know if you continue to > have problems. > > Dennis > > ML mail wrote: > > Dear Dennis, > > > > I have now applied this patch to my Nutch 0.9 (stable) > installation using "gpatch -p0 < patchfile" and > changed the plugin.includes parameter to include > "urlfilter-(domain|suffix)" in the > nutch-defaul.xml file. > > > > I gave it a try using a fresh new test crawl and index > but it somehow still indexes other top level domains. The > domain-urlfilter.txt file, which I have located in the conf > dir of Nutch, contains only "be" in order to index > only TLD ending with be. > > > > After checking the hadoop.log file and greping for > urlfilter-domain I noticed that the plugin doesn't get > loaded, as it never appears in the logfile. So I guess my > problem is that it doesn't even load the plugin. > > > > Based on my description, did I miss something ? Or do > I need to do something else to get it working ? > > > > Thanks > > Regards > > > > > > > > --- On Tue, 12/2/08, Dennis Kubes > <[EMAIL PROTECTED]> wrote: > > > >> From: Dennis Kubes <[EMAIL PROTECTED]> > >> Subject: Re: How to effectively stop indexing > javascript pages ending with .js > >> To: nutch-user@lucene.apache.org > >> Date: Tuesday, December 2, 2008, 5:47 PM > >> Patch has been posted to JIRA for the > DomainURLFilter > >> plugin. > >> > >> https://issues.apache.org/jira/browse/NUTCH-668 > >> > >> Dennis > >> > >> Dennis Kubes wrote: > >>> Trying to get a patch posted tonight. Will > probably > >> be in the 1.0 release yes. > >>> Dennis > >>> > >>> John Martyniak wrote: > >>>> That sounds like a good feature. Will > this be in > >> the 1.0 release? > >>>> -John > >>>> > >>>> On Dec 2, 2008, at 5:17 PM, Dennis Kubes > wrote: > >>>> > >>>>> > >>>>> John Martyniak wrote: > >>>>>> That will be awesome. Will there > be a > >> limit to the number of domains that can be > included? > >>>>> Only on what can be stored in memory > in a set. > >> So technical limit yes, practical limit, guessing > a few > >> million domains. > >>>>> Dennis > >>>>> > >>>>>> -John > >>>>>> On Dec 2, 2008, at 3:27 PM, Dennis > Kubes > >> wrote: > >>>>>>> I am in the process of writing > a > >> domain-urlfilter. It will allow fetching only > from a list > >> of top level domains. Should have a patch out > shortly. > >> Hopefully that will help you and others who are > wanting to > >> verticalize nutch. > >>>>>>> Dennis > >>>>>>> > >>>>>>> ML mail wrote: > >>>>>>>> Dear Dennis > >>>>>>>> Many thanks for your quick > >> response. Now everything is clear and I understand > why it > >> didn't work... > >>>>>>>> I will still use the > >> urlfilter-regex plugin as I would like to crawl > only domains > >> from a single top level domain but as suggested I > have added > >> the urlfilter-suffix plugin to avoid indexing > javascript > >> pages. In the past I already had deactivated the > parse-js > >> plugin. So I am now looking forward to the next > crawls being > >> freed of stupid file formats like js ;-) > >>>>>>>> Greetings --- On Tue, > 12/2/08, > >> Dennis Kubes <[EMAIL PROTECTED]> wrote: > >>>>>>>>> From: Dennis Kubes > >> <[EMAIL PROTECTED]> > >>>>>>>>> Subject: Re: How to > >> effectively stop indexing javascript pages ending > with .js > >>>>>>>>> To: > >> nutch-user@lucene.apache.org > >>>>>>>>> Date: Tuesday, > December 2, > >> 2008, 8:50 AM > >>>>>>>>> ML mail wrote: > >>>>>>>>>> Hello, > >>>>>>>>>> > >>>>>>>>>> I would definitely > like > >> not to index any javascript > >>>>>>>>> pages, this means any > pages > >> ending with ".js". So > >>>>>>>>> for this purpose I > simply > >> edited the crawl-urlfilter.txt > >>>>>>>>> file and changed the > default > >> suffix list not to be parsed to > >>>>>>>>> add the .js extension > so that > >> it looks like this now: > >>>>>>>>>> # skip image and > other > >> suffixes we can't yet parse > >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$ > >> > >>>>>>>>> The easiest way IMO is > to use > >> prefix and suffix urlfilters > >>>>>>>>> instead regex > urlfilter. > >> Change plugin.includes and replace > >>>>>>>>> urlfilter-regex with > >> urlfilter-(prefix|suffix). Then in the > >>>>>>>>> suffix-urlfilter.txt > file add > >> .js under .css in web formats. > >>>>>>>>> Also change > plugin.includes > >> from parse-(text|html|js) to be > >>>>>>>>> parse-(text|html). > >>>>>>>>> > >>>>>>>>>> Unfortunately I > noticed > >> that javascript pages are > >>>>>>>>> still getting indexed. > So what > >> does this exactly mean ? Is > >>>>>>>>> crawl-urlfilter.txt > not > >> working ? Did I miss something maybe > >>>>>>>>> ? > >>>>>>>>>> I was also > wondering what > >> is the difference between > >>>>>>>>> these two files: > >>>>>>>>>> > crawl-urlfilter.txt > >>>>>>>>>> > regex-urlfilter.txt > >>>>>>>>> crawl-urlfilter.txt > file is > >> used by the crawl command. The > >>>>>>>>> regex, suffix, prefix, > and > >> other urlfilter files and plugins > >>>>>>>>> are used when calling > commands > >> manually in various tools. > >>>>>>>>> Dennis > >>>>>>>>>> ? > >>>>>>>>>> > >>>>>>>>>> Many thanks > >>>>>>>>>> Regards > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > > > > > >