Re: How to effectively stop indexing javascript pages ending with .js

ML mail Tue, 02 Dec 2008 12:43:42 -0800

Wow this is a great idea! Thank you Dennis!

Let us all know, when it's available...


Greetings


--- On Tue, 12/2/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: How to effectively stop indexing javascript pages ending with .js
> To: [email protected]
> Date: Tuesday, December 2, 2008, 12:27 PM
> I am in the process of writing a domain-urlfilter.  It will
> allow 
> fetching only from a list of top level domains.  Should
> have a patch out 
> shortly.  Hopefully that will help you and others who are
> wanting to 
> verticalize nutch.
> 
> Dennis
> 
> ML mail wrote:
> > Dear Dennis
> > 
> > Many thanks for your quick response. Now everything is
> clear and I understand why it didn't work...
> > 
> > I will still use the urlfilter-regex plugin as I would
> like to crawl only domains from a single top level domain
> but as suggested I have added the urlfilter-suffix plugin to
> avoid indexing javascript pages. In the past I already had
> deactivated the parse-js plugin. 
> > 
> > So I am now looking forward to the next crawls being
> freed of stupid file formats like js ;-)
> > 
> > Greetings 
> > 
> > 
> > --- On Tue, 12/2/08, Dennis Kubes
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Dennis Kubes <[EMAIL PROTECTED]>
> >> Subject: Re: How to effectively stop indexing
> javascript pages ending with .js
> >> To: [email protected]
> >> Date: Tuesday, December 2, 2008, 8:50 AM
> >> ML mail wrote:
> >>> Hello,
> >>>
> >>> I would definitely like not to index any
> javascript
> >> pages, this means any pages ending with
> ".js". So
> >> for this purpose I simply edited the
> crawl-urlfilter.txt
> >> file and changed the default suffix list not to be
> parsed to
> >> add the .js extension so that it looks like this
> now:
> >>> # skip image and other suffixes we can't
> yet parse
> >>>
> >>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
> >>
> >> The easiest way IMO is to use prefix and suffix
> urlfilters
> >> instead regex urlfilter.  Change plugin.includes
> and replace
> >> urlfilter-regex with urlfilter-(prefix|suffix). 
> Then in the
> >> suffix-urlfilter.txt file add .js under .css in
> web formats.
> >>
> >> Also change plugin.includes from
> parse-(text|html|js) to be
> >> parse-(text|html).
> >>
> >>> Unfortunately I noticed that javascript pages
> are
> >> still getting indexed. So what does this exactly
> mean ? Is
> >> crawl-urlfilter.txt not working ? Did I miss
> something maybe
> >> ? 
> >>> I was also wondering what is the difference
> between
> >> these two files:
> >>> crawl-urlfilter.txt
> >>> regex-urlfilter.txt
> >> crawl-urlfilter.txt file is used by the crawl
> command.  The
> >> regex, suffix, prefix, and other urlfilter files
> and plugins
> >> are used when calling commands manually in various
> tools.
> >>
> >> Dennis
> >>> ?
> >>>
> >>> Many thanks
> >>> Regards
> >>>
> >>>
> >>>
> > 
> > 
> >

Re: How to effectively stop indexing javascript pages ending with .js

Reply via email to