Fetched links contain html

Kirk Gillock Mon, 07 Dec 2009 03:48:34 -0800

Hello fellow Nutch users,

In a few days we'll start crawling a long list of Thai websites. Withprevious crawls we noticed there were A LOT of poorly formatted htmlpages and the crawler would sometimes fetch links that contain html code(ex: http://www.website.com/news/index.php</ul> ). How can we regexthose URLs so that the html code (</ul>) is removed? Would we use theregex-normalizer.xml file? If so, what would the code look like?


Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org

Fetched links contain html

Reply via email to