RE: Parsing HTML meta tags

Paul Williams Fri, 30 Sep 2005 01:12:41 -0700

For the record -

Override implementations of BasicIndexingFilter and HtmlParser so that
the metadata tags are returned.  Can then parse for Dublin Core content
and discard the rest.

-----Original Message-----
From: Paul Williams 
Sent: 28 September 2005 10:51
To: [email protected]
Subject: Parsing HTML meta tags

Hi,

I'm trying to parse an external site that contains meta tags encoded in
the HTML, such as:

<title>BBC - GCSE Bitesize - Homepage</title>
<meta name="description" content="Index for GCSE Bitesize />
<meta name="keywords" content="BBC, bbc, GCSE, Revision, Revise,
Bitesize" />
<meta name="created" content="20041101">

Nutch is able to see the pages but I'm not getting any of the meta tags
indexed.  Is there a way to do this?

Thanks,
Paul.

RE: Parsing HTML meta tags

Reply via email to