On Fri, Aug 21, 2009 at 4:20 AM, Julien Nioche<[email protected]> wrote: > ou'll need to write a custom parser implementing HtmlParseFilter and get it > to store the keywords found in the Metadata, then write a custom Indexer. > > By default the HTML parser does not do anything about meta tags.
That's unfortunate, because org.apache.nutch.parse.html.HtmlParser actually extracts all the meta tags, and then takes a few and throws the rest away. It's mildly annoying that I'm going to have to re-implement all of HtmlParser just to add two lines to take that data out of "metaTags" and put it in "content.getMetaData()". -- http://www.linkedin.com/in/paultomblin
