Re: How to effectively stop indexing javascript pages ending with .js

Dennis Kubes Wed, 03 Dec 2008 14:53:27 -0800

I think there is a missing file. I was trying to get it to build on aclean install and it was erroring out on building. I was relying on aURLUtil method that I have in another patch but hadn't posted. I willwork through it and post it in a bit.


ML mail wrote:

Dear Dennis,
I have now applied this patch to my Nutch 0.9 (stable) installation using "gpatch -p0 < 
patchfile" and changed the plugin.includes parameter to include 
"urlfilter-(domain|suffix)" in the nutch-defaul.xml file.
I gave it a try using a fresh new test crawl and index but it somehow still indexes other top level domains. The domain-urlfilter.txt file, which I have located in the conf dir of Nutch, contains only "be" in order to index only TLD ending with be.

It wouldn't be a regex so just "be" wouldn't work. It would need to be"domain.be" one per line. When I said it can handle top level domains Imeant that a line of "apache.org" would handle urls with the hostnameslucene.apache.org and www.apache.org, etc. You can also have it filterby hostname as described in the JIRA and javadoc.


Dennis

After checking the hadoop.log file and greping for urlfilter-domain I noticed 
that the plugin doesn't get loaded, as it never appears in the logfile. So I 
guess my problem is that it doesn't even load the plugin.

Based on my description, did I miss something ? Or do I need to do something 
else to get it working ?

Thanks
Regards



--- On Tue, 12/2/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: How to effectively stop indexing javascript pages ending with .js
To: nutch-user@lucene.apache.org
Date: Tuesday, December 2, 2008, 5:47 PM
Patch has been posted to JIRA for the DomainURLFilter
plugin.

https://issues.apache.org/jira/browse/NUTCH-668

Dennis

Dennis Kubes wrote:

Trying to get a patch posted tonight.  Will probably

be in the 1.0 release yes.

Dennis

John Martyniak wrote:

That sounds like a good feature.  Will this be in

the 1.0 release?

-John

On Dec 2, 2008, at 5:17 PM, Dennis Kubes wrote:


John Martyniak wrote:

That will be awesome.  Will there be a

limit to the number of domains that can be included?

Only on what can be stored in memory in a set.

 So technical limit yes, practical limit, guessing a few
million domains.

Dennis

-John
On Dec 2, 2008, at 3:27 PM, Dennis Kubes

wrote:

I am in the process of writing a

domain-urlfilter.  It will allow fetching only from a list

of top level domains. Should have a patch out shortly.Hopefully that will help you and others who are wanting to

verticalize nutch.

Dennis

ML mail wrote:

Dear Dennis
Many thanks for your quick

response. Now everything is clear and I understand why it
didn't work...

I will still use the

urlfilter-regex plugin as I would like to crawl only domains
from a single top level domain but as suggested I have added
the urlfilter-suffix plugin to avoid indexing javascript
pages. In the past I already had deactivated the parse-js
plugin. So I am now looking forward to the next crawls being
freed of stupid file formats like js ;-)

Greetings --- On Tue, 12/2/08,

Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes

<[EMAIL PROTECTED]>

Subject: Re: How to

effectively stop indexing javascript pages ending with .js

To:

nutch-user@lucene.apache.org

Date: Tuesday, December 2,

2008, 8:50 AM

ML mail wrote:

Hello,

I would definitely like

not to index any javascript

pages, this means any pages

ending with ".js". So

for this purpose I simply

edited the crawl-urlfilter.txt

file and changed the default

suffix list not to be parsed to

add the .js extension so that

it looks like this now:

# skip image and other

suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$

The easiest way IMO is to use

prefix and suffix urlfilters

instead regex urlfilter.

Change plugin.includes and replace

urlfilter-regex with

urlfilter-(prefix|suffix).  Then in the

suffix-urlfilter.txt file add

.js under .css in web formats.

Also change plugin.includes

from parse-(text|html|js) to be

parse-(text|html).

Unfortunately I noticed

that javascript pages are

still getting indexed. So what

does this exactly mean ? Is

crawl-urlfilter.txt not

working ? Did I miss something maybe

I was also wondering what

is the difference between

these two files:

crawl-urlfilter.txt
regex-urlfilter.txt

crawl-urlfilter.txt file is

used by the crawl command.  The

regex, suffix, prefix, and

other urlfilter files and plugins

are used when calling commands

manually in various tools.

Dennis

?

Many thanks
Regards

Re: How to effectively stop indexing javascript pages ending with .js

Reply via email to