Hi,

I have one simple problem: doing the regular expression while parsing HTML
in Nutch parser.

For example, while crawling and parsing ton of web pages, I'd like to write
a plugin in Nutch so that it can matched some specific pattern, annotate it
and store it. As far as I know Nutch has the HTMLMetaTag argument in method
HtmlParseFilter.filter().

My concern is can we also have other html tags like span and so on ? If it
is which packages/classes should I look into ?


THanks

-- Khang

Reply via email to