Re: How to effectively stop indexing javascript pages ending with .js

John Martyniak Tue, 02 Dec 2008 13:58:46 -0800

Thanks for the info, I will try it tonight.

Is there a practical limit on the size of the DB or the segments? Icurrently have 500K - 750K Urls. Broken down into around 15 or 20segments.


-John

On Dec 2, 2008, at 3:33 PM, Dennis Kubes wrote:

And you will need to use the filter option on both.

Dennis Kubes wrote:
John Martyniak wrote:
That is good information. Because I too have the same issue, Idon't want the js files in the index.
But what if you already have a bunch of .js files in your segmentsand want to remove them from the index/segments. is there anywayto effectively do that as well?
I believe (but haven't tested) that if you change the urlfilters asdiscussed and then run mergedb and mergesegs commands giving only asingle crawldb and segments as input, then those urls will befiltered out.
Dennis
-John

On Dec 2, 2008, at 12:56 PM, ML mail wrote:
Dear Dennis
Many thanks for your quick response. Now everything is clear andI understand why it didn't work...
I will still use the urlfilter-regex plugin as I would like tocrawl only domains from a single top level domain but assuggested I have added the urlfilter-suffix plugin to avoidindexing javascript pages. In the past I already had deactivatedthe parse-js plugin.
So I am now looking forward to the next crawls being freed ofstupid file formats like js ;-)
Greetings


--- On Tue, 12/2/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:
From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: How to effectively stop indexing javascript pagesending with .js
To: [email protected]
Date: Tuesday, December 2, 2008, 8:50 AM
ML mail wrote:
Hello,

I would definitely like not to index any javascript
pages, this means any pages ending with ".js". So
for this purpose I simply edited the crawl-urlfilter.txt
file and changed the default suffix list not to be parsed to
add the .js extension so that it looks like this now:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
The easiest way IMO is to use prefix and suffix urlfilters
instead regex urlfilter.  Change plugin.includes and replace
urlfilter-regex with urlfilter-(prefix|suffix).  Then in the
suffix-urlfilter.txt file add .js under .css in web formats.

Also change plugin.includes from parse-(text|html|js) to be
parse-(text|html).
Unfortunately I noticed that javascript pages are
still getting indexed. So what does this exactly mean ? Is
crawl-urlfilter.txt not working ? Did I miss something maybe
?
I was also wondering what is the difference between
these two files:
crawl-urlfilter.txt
regex-urlfilter.txt
crawl-urlfilter.txt file is used by the crawl command.  The
regex, suffix, prefix, and other urlfilter files and plugins
are used when calling commands manually in various tools.

Dennis
?

Many thanks
Regards

Re: How to effectively stop indexing javascript pages ending with .js

Reply via email to