I think there is a missing file. I was trying to get it to build on a clean install and it was erroring out on building. I was relying on a URLUtil method that I have in another patch but hadn't posted. I will work through it and post it in a bit.

ML mail wrote:
Dear Dennis,

I have now applied this patch to my Nutch 0.9 (stable) installation using "gpatch -p0 < 
patchfile" and changed the plugin.includes parameter to include 
"urlfilter-(domain|suffix)" in the nutch-defaul.xml file.

I gave it a try using a fresh new test crawl and index but it somehow still indexes other top level domains. The domain-urlfilter.txt file, which I have located in the conf dir of Nutch, contains only "be" in order to index only TLD ending with be.

It wouldn't be a regex so just "be" wouldn't work. It would need to be "domain.be" one per line. When I said it can handle top level domains I meant that a line of "apache.org" would handle urls with the hostnames lucene.apache.org and www.apache.org, etc. You can also have it filter by hostname as described in the JIRA and javadoc.

Dennis

After checking the hadoop.log file and greping for urlfilter-domain I noticed 
that the plugin doesn't get loaded, as it never appears in the logfile. So I 
guess my problem is that it doesn't even load the plugin.

Based on my description, did I miss something ? Or do I need to do something 
else to get it working ?

Thanks
Regards



--- On Tue, 12/2/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: How to effectively stop indexing javascript pages ending with .js
To: nutch-user@lucene.apache.org
Date: Tuesday, December 2, 2008, 5:47 PM
Patch has been posted to JIRA for the DomainURLFilter
plugin.

https://issues.apache.org/jira/browse/NUTCH-668

Dennis

Dennis Kubes wrote:
Trying to get a patch posted tonight.  Will probably
be in the 1.0 release yes.
Dennis

John Martyniak wrote:
That sounds like a good feature.  Will this be in
the 1.0 release?
-John

On Dec 2, 2008, at 5:17 PM, Dennis Kubes wrote:


John Martyniak wrote:
That will be awesome.  Will there be a
limit to the number of domains that can be included?
Only on what can be stored in memory in a set.
 So technical limit yes, practical limit, guessing a few
million domains.
Dennis

-John
On Dec 2, 2008, at 3:27 PM, Dennis Kubes
wrote:
I am in the process of writing a
domain-urlfilter.  It will allow fetching only from a list
of top level domains. Should have a patch out shortly. Hopefully that will help you and others who are wanting to
verticalize nutch.
Dennis

ML mail wrote:
Dear Dennis
Many thanks for your quick
response. Now everything is clear and I understand why it
didn't work...
I will still use the
urlfilter-regex plugin as I would like to crawl only domains
from a single top level domain but as suggested I have added
the urlfilter-suffix plugin to avoid indexing javascript
pages. In the past I already had deactivated the parse-js
plugin. So I am now looking forward to the next crawls being
freed of stupid file formats like js ;-)
Greetings --- On Tue, 12/2/08,
Dennis Kubes <[EMAIL PROTECTED]> wrote:
From: Dennis Kubes
<[EMAIL PROTECTED]>
Subject: Re: How to
effectively stop indexing javascript pages ending with .js
To:
nutch-user@lucene.apache.org
Date: Tuesday, December 2,
2008, 8:50 AM
ML mail wrote:
Hello,

I would definitely like
not to index any javascript
pages, this means any pages
ending with ".js". So
for this purpose I simply
edited the crawl-urlfilter.txt
file and changed the default
suffix list not to be parsed to
add the .js extension so that
it looks like this now:
# skip image and other
suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$

The easiest way IMO is to use
prefix and suffix urlfilters
instead regex urlfilter.
Change plugin.includes and replace
urlfilter-regex with
urlfilter-(prefix|suffix).  Then in the
suffix-urlfilter.txt file add
.js under .css in web formats.
Also change plugin.includes
from parse-(text|html|js) to be
parse-(text|html).

Unfortunately I noticed
that javascript pages are
still getting indexed. So what
does this exactly mean ? Is
crawl-urlfilter.txt not
working ? Did I miss something maybe
?
I was also wondering what
is the difference between
these two files:
crawl-urlfilter.txt
regex-urlfilter.txt
crawl-urlfilter.txt file is
used by the crawl command.  The
regex, suffix, prefix, and
other urlfilter files and plugins
are used when calling commands
manually in various tools.
Dennis
?

Many thanks
Regards





Reply via email to