Hello fellow Nutch users,

In a few days we'll start crawling a long list of Thai websites. With previous crawls we noticed there were A LOT of poorly formatted html pages and the crawler would sometimes fetch links that contain html code (ex: http://www.website.com/news/index.php</ul> ). How can we regex those URLs so that the html code (</ul>) is removed? Would we use the regex-normalizer.xml file? If so, what would the code look like?

Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org

Reply via email to