From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: How to effectively stop indexing javascript pages ending with .js
To: nutch-user@lucene.apache.org
Date: Tuesday, December 2, 2008, 5:47 PM
Patch has been posted to JIRA for the DomainURLFilter
plugin.
https://issues.apache.org/jira/browse/NUTCH-668
Dennis
Dennis Kubes wrote:
Trying to get a patch posted tonight. Will probably
be in the 1.0 release yes.
Dennis
John Martyniak wrote:
That sounds like a good feature. Will this be in
the 1.0 release?
-John
On Dec 2, 2008, at 5:17 PM, Dennis Kubes wrote:
John Martyniak wrote:
That will be awesome. Will there be a
limit to the number of domains that can be included?
Only on what can be stored in memory in a set.
So technical limit yes, practical limit, guessing a few
million domains.
Dennis
-John
On Dec 2, 2008, at 3:27 PM, Dennis Kubes
wrote:
I am in the process of writing a
domain-urlfilter. It will allow fetching only from a list
of top level domains. Should have a patch out shortly.
Hopefully that will help you and others who are wanting to
verticalize nutch.
Dennis
ML mail wrote:
Dear Dennis
Many thanks for your quick
response. Now everything is clear and I understand why it
didn't work...
I will still use the
urlfilter-regex plugin as I would like to crawl only domains
from a single top level domain but as suggested I have added
the urlfilter-suffix plugin to avoid indexing javascript
pages. In the past I already had deactivated the parse-js
plugin. So I am now looking forward to the next crawls being
freed of stupid file formats like js ;-)
Greetings --- On Tue, 12/2/08,
Dennis Kubes <[EMAIL PROTECTED]> wrote:
From: Dennis Kubes
<[EMAIL PROTECTED]>
Subject: Re: How to
effectively stop indexing javascript pages ending with .js
To:
nutch-user@lucene.apache.org
Date: Tuesday, December 2,
2008, 8:50 AM
ML mail wrote:
Hello,
I would definitely like
not to index any javascript
pages, this means any pages
ending with ".js". So
for this purpose I simply
edited the crawl-urlfilter.txt
file and changed the default
suffix list not to be parsed to
add the .js extension so that
it looks like this now:
# skip image and other
suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
The easiest way IMO is to use
prefix and suffix urlfilters
instead regex urlfilter.
Change plugin.includes and replace
urlfilter-regex with
urlfilter-(prefix|suffix). Then in the
suffix-urlfilter.txt file add
.js under .css in web formats.
Also change plugin.includes
from parse-(text|html|js) to be
parse-(text|html).
Unfortunately I noticed
that javascript pages are
still getting indexed. So what
does this exactly mean ? Is
crawl-urlfilter.txt not
working ? Did I miss something maybe
?
I was also wondering what
is the difference between
these two files:
crawl-urlfilter.txt
regex-urlfilter.txt
crawl-urlfilter.txt file is
used by the crawl command. The
regex, suffix, prefix, and
other urlfilter files and plugins
are used when calling commands
manually in various tools.
Dennis
?
Many thanks
Regards