Thanks Matthias,

works perfectly now!

I have one more question, does the url in the 'urls' file
need to match exactly to conf/crawl-urlfilter.txt?

It is just that I want to start my search from an .asp
page (including a query string).

Would I be able to have just the server domain in 
conf/crawl-urlfilter.txt?

Thanks so much,
Michael.

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Matthias
Jaekle
Sent: 19 November 2004 10:07
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] nutch crawl gets no pages and gives no
errors


Hi,

try:

> In my conf/crawl-urlfilter.txt I have tried:
> +^http://([a-z0-9]*\.)*nutch.org/
+^http://([a-z0-9]*\.)*nutch.org

> +^http://*.nutch.org/
This would never work.

Stars does not mean every sign. They are multipliers for the signs 
infront of the star.
Dots mean every sign.
\. means dots
Please google for "regex" or "perl regular expressions".


> my urls file contains:
> http://www.nutch.org
If you ask nutch to check against a string with slash at the end your 
url should have this also.
Try: http://www.nutch.org/

Bye

Matthias


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general


This email, and any attachment, is confidential to the addressee. If you
have received this email and are not an authorised recipient please notify
the sender and delete this message from your system. If you are not an
authorised recipient you must not use, disclose, distribute, copy, print or
rely on this email.

Email transmission cannot be guaranteed to be secure, error-free or
virus-free. Although World Markets Research Centre ("WMRC plc") routinely
screens for viruses you are responsible for checking this email and any
attachments for viruses and WMRC plc accepts no responsibility for any
damage caused to your systems or for loss of data caused by any virus.  WMRC
plc does not accept liability resulting from errors or omissions in the
content of this message following email transmission.  If verification is
required please request a hard copy version.

If this email is of a personal nature any views expressed are solely those
of the author and are not made in the course of the author's employment with
WMRC.



-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to