Thanks for the info, I will try it tonight.
Is there a practical limit on the size of the DB or the segments? I
currently have 500K - 750K Urls. Broken down into around 15 or 20
segments.
-John
On Dec 2, 2008, at 3:33 PM, Dennis Kubes wrote:
And you will need to use the filter option on both.
Dennis Kubes wrote:
John Martyniak wrote:
That is good information. Because I too have the same issue, I
don't want the js files in the index.
But what if you already have a bunch of .js files in your segments
and want to remove them from the index/segments. is there anyway
to effectively do that as well?
I believe (but haven't tested) that if you change the urlfilters as
discussed and then run mergedb and mergesegs commands giving only a
single crawldb and segments as input, then those urls will be
filtered out.
Dennis
-John
On Dec 2, 2008, at 12:56 PM, ML mail wrote:
Dear Dennis
Many thanks for your quick response. Now everything is clear and
I understand why it didn't work...
I will still use the urlfilter-regex plugin as I would like to
crawl only domains from a single top level domain but as
suggested I have added the urlfilter-suffix plugin to avoid
indexing javascript pages. In the past I already had deactivated
the parse-js plugin.
So I am now looking forward to the next crawls being freed of
stupid file formats like js ;-)
Greetings
--- On Tue, 12/2/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:
From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: How to effectively stop indexing javascript pages
ending with .js
To: [email protected]
Date: Tuesday, December 2, 2008, 8:50 AM
ML mail wrote:
Hello,
I would definitely like not to index any javascript
pages, this means any pages ending with ".js". So
for this purpose I simply edited the crawl-urlfilter.txt
file and changed the default suffix list not to be parsed to
add the .js extension so that it looks like this now:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|
xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
The easiest way IMO is to use prefix and suffix urlfilters
instead regex urlfilter. Change plugin.includes and replace
urlfilter-regex with urlfilter-(prefix|suffix). Then in the
suffix-urlfilter.txt file add .js under .css in web formats.
Also change plugin.includes from parse-(text|html|js) to be
parse-(text|html).
Unfortunately I noticed that javascript pages are
still getting indexed. So what does this exactly mean ? Is
crawl-urlfilter.txt not working ? Did I miss something maybe
?
I was also wondering what is the difference between
these two files:
crawl-urlfilter.txt
regex-urlfilter.txt
crawl-urlfilter.txt file is used by the crawl command. The
regex, suffix, prefix, and other urlfilter files and plugins
are used when calling commands manually in various tools.
Dennis
?
Many thanks
Regards