Thanks Matthias, works perfectly now!
I have one more question, does the url in the 'urls' file need to match exactly to conf/crawl-urlfilter.txt? It is just that I want to start my search from an .asp page (including a query string). Would I be able to have just the server domain in conf/crawl-urlfilter.txt? Thanks so much, Michael. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Matthias Jaekle Sent: 19 November 2004 10:07 To: [EMAIL PROTECTED] Subject: Re: [Nutch-general] nutch crawl gets no pages and gives no errors Hi, try: > In my conf/crawl-urlfilter.txt I have tried: > +^http://([a-z0-9]*\.)*nutch.org/ +^http://([a-z0-9]*\.)*nutch.org > +^http://*.nutch.org/ This would never work. Stars does not mean every sign. They are multipliers for the signs infront of the star. Dots mean every sign. \. means dots Please google for "regex" or "perl regular expressions". > my urls file contains: > http://www.nutch.org If you ask nutch to check against a string with slash at the end your url should have this also. Try: http://www.nutch.org/ Bye Matthias ------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general This email, and any attachment, is confidential to the addressee. If you have received this email and are not an authorised recipient please notify the sender and delete this message from your system. If you are not an authorised recipient you must not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure, error-free or virus-free. Although World Markets Research Centre ("WMRC plc") routinely screens for viruses you are responsible for checking this email and any attachments for viruses and WMRC plc accepts no responsibility for any damage caused to your systems or for loss of data caused by any virus. WMRC plc does not accept liability resulting from errors or omissions in the content of this message following email transmission. If verification is required please request a hard copy version. If this email is of a personal nature any views expressed are solely those of the author and are not made in the course of the author's employment with WMRC. ------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
