Hi,

I am trying to crawl a news website with archives. However the links in the
pages contain many whitespaces and newlines at times. This is where Nutch
fails to find any further content for crawling and stops.

For example a link like -

/archive/
                            Janurary-2001/123123.cms January 
/archive/
                            Februray-2001/2342342.cms Februrary 

etc....

Is there a way to direct nutch crawl to remove the whitespaces and newlines
and form a complete URL for fetching ???

thanks,
NikhilDx.

-- 
View this message in context: 
http://www.nabble.com/Whitespace---new-lines-in-href-links-tf4135561.html#a11761736
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to