Hi, I am trying to crawl a news website with archives. However the links in the pages contain many whitespaces and newlines at times. This is where Nutch fails to find any further content for crawling and stops.
For example a link like - /archive/ Janurary-2001/123123.cms January /archive/ Februray-2001/2342342.cms Februrary etc.... Is there a way to direct nutch crawl to remove the whitespaces and newlines and form a complete URL for fetching ??? thanks, NikhilDx. -- View this message in context: http://www.nabble.com/Whitespace---new-lines-in-href-links-tf4135561.html#a11761736 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general