Thanks for the lead!

Okay I try a test for just the nutch.org site
(so I'm following exactly what is in the tutorial)

In my conf/crawl-urlfilter.txt I have tried:

+^http://([a-z0-9]*\.)*nutch.org/

+^http://*.nutch.org/

+^http://www.nutch.org/

all of these produce the same results.

my urls file contains:

http://www.nutch.org

then I tried just

www.nutch.org

no luck!

At this point it must be something really simple,
only I cant seem to find it!

Thanks to all for any ideas,
Michael.

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Michael
Nebel
Sent: 18 November 2004 19:29
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] nutch crawl gets no pages and gives no
errors


Hi,

I had a smimiliar problem and I made a mistake withinin the e 
crawl-urlfilter.txt. Looking at your output:

...
 > 041118 122750 Starting URL processing
 > 041118 122750 Using URL filter: net.nutch.net.RegexURLFilter
 > 041118 122751 found resource crawl-urlfilter.txt at
 > file:/root/install/nutch-nightly/conf/crawl-urlfilter.txt
 > .041118 122751 Added 0 pages
...

none of the sites you crawled made it through your filter...

Regards

        Michael


This email, and any attachment, is confidential to the addressee. If you
have received this email and are not an authorised recipient please notify
the sender and delete this message from your system. If you are not an
authorised recipient you must not use, disclose, distribute, copy, print or
rely on this email.

Email transmission cannot be guaranteed to be secure, error-free or
virus-free. Although World Markets Research Centre ("WMRC plc") routinely
screens for viruses you are responsible for checking this email and any
attachments for viruses and WMRC plc accepts no responsibility for any
damage caused to your systems or for loss of data caused by any virus.  WMRC
plc does not accept liability resulting from errors or omissions in the
content of this message following email transmission.  If verification is
required please request a hard copy version.

If this email is of a personal nature any views expressed are solely those
of the author and are not made in the course of the author's employment with
WMRC.



-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to