For the record - Override implementations of BasicIndexingFilter and HtmlParser so that the metadata tags are returned. Can then parse for Dublin Core content and discard the rest.
-----Original Message----- From: Paul Williams Sent: 28 September 2005 10:51 To: [email protected] Subject: Parsing HTML meta tags Hi, I'm trying to parse an external site that contains meta tags encoded in the HTML, such as: <title>BBC - GCSE Bitesize - Homepage</title> <meta name="description" content="Index for GCSE Bitesize /> <meta name="keywords" content="BBC, bbc, GCSE, Revision, Revise, Bitesize" /> <meta name="created" content="20041101"> Nutch is able to see the pages but I'm not getting any of the meta tags indexed. Is there a way to do this? Thanks, Paul.
