Hello fellow Nutch users,
In a few days we'll start crawling a long list of Thai websites. With
previous crawls we noticed there were A LOT of poorly formatted html
pages and the crawler would sometimes fetch links that contain html code
(ex: http://www.website.com/news/index.php</ul> ). How can we regex
those URLs so that the html code is removed? Would we use the
regex-normalizer.xml file? What would the code look like?
Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org