Hi,

I am trying to crawl a news website with archives. However the links in the
pages contain many whitespaces and newlines at times. This is where Nutch
fails to find any further content for crawling and stops.

For example a link like -

/archive/
                            Janurary-2001/123123.cms January 
/archive/
                            Februray-2001/2342342.cms Februrary 

etc....

Is there a way to direct nutch crawl to remove the whitespaces and newlines
and form a complete URL for fetching ???

thanks,
NikhilDx.

-- 
View this message in context: 
http://www.nabble.com/Whitespace---new-lines-in-href-links-tf4135561.html#a11761736
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to