Wow this is a great idea! Thank you Dennis! Let us all know, when it's available...
Greetings --- On Tue, 12/2/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Re: How to effectively stop indexing javascript pages ending with .js > To: [email protected] > Date: Tuesday, December 2, 2008, 12:27 PM > I am in the process of writing a domain-urlfilter. It will > allow > fetching only from a list of top level domains. Should > have a patch out > shortly. Hopefully that will help you and others who are > wanting to > verticalize nutch. > > Dennis > > ML mail wrote: > > Dear Dennis > > > > Many thanks for your quick response. Now everything is > clear and I understand why it didn't work... > > > > I will still use the urlfilter-regex plugin as I would > like to crawl only domains from a single top level domain > but as suggested I have added the urlfilter-suffix plugin to > avoid indexing javascript pages. In the past I already had > deactivated the parse-js plugin. > > > > So I am now looking forward to the next crawls being > freed of stupid file formats like js ;-) > > > > Greetings > > > > > > --- On Tue, 12/2/08, Dennis Kubes > <[EMAIL PROTECTED]> wrote: > > > >> From: Dennis Kubes <[EMAIL PROTECTED]> > >> Subject: Re: How to effectively stop indexing > javascript pages ending with .js > >> To: [email protected] > >> Date: Tuesday, December 2, 2008, 8:50 AM > >> ML mail wrote: > >>> Hello, > >>> > >>> I would definitely like not to index any > javascript > >> pages, this means any pages ending with > ".js". So > >> for this purpose I simply edited the > crawl-urlfilter.txt > >> file and changed the default suffix list not to be > parsed to > >> add the .js extension so that it looks like this > now: > >>> # skip image and other suffixes we can't > yet parse > >>> > >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$ > >> > >> The easiest way IMO is to use prefix and suffix > urlfilters > >> instead regex urlfilter. Change plugin.includes > and replace > >> urlfilter-regex with urlfilter-(prefix|suffix). > Then in the > >> suffix-urlfilter.txt file add .js under .css in > web formats. > >> > >> Also change plugin.includes from > parse-(text|html|js) to be > >> parse-(text|html). > >> > >>> Unfortunately I noticed that javascript pages > are > >> still getting indexed. So what does this exactly > mean ? Is > >> crawl-urlfilter.txt not working ? Did I miss > something maybe > >> ? > >>> I was also wondering what is the difference > between > >> these two files: > >>> crawl-urlfilter.txt > >>> regex-urlfilter.txt > >> crawl-urlfilter.txt file is used by the crawl > command. The > >> regex, suffix, prefix, and other urlfilter files > and plugins > >> are used when calling commands manually in various > tools. > >> > >> Dennis > >>> ? > >>> > >>> Many thanks > >>> Regards > >>> > >>> > >>> > > > > > >
