Links contain html

Kirk Gillock Sun, 06 Dec 2009 13:24:13 -0800

Hello fellow Nutch users,

In a few days we'll start crawling a long list of Thai websites. Withprevious crawls we noticed there were A LOT of poorly formatted htmlpages and the crawler would sometimes fetch links that contain html code(ex: http://www.website.com/news/index.php</ul> ). How can we regexthose URLs so that the html code is removed? Would we use theregex-normalizer.xml file? What would the code look like?


Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org

Links contain html

Reply via email to