For the record -

Override implementations of BasicIndexingFilter and HtmlParser so that
the metadata tags are returned.  Can then parse for Dublin Core content
and discard the rest.

-----Original Message-----
From: Paul Williams 
Sent: 28 September 2005 10:51
To: [email protected]
Subject: Parsing HTML meta tags

Hi,

 

I'm trying to parse an external site that contains meta tags encoded in
the HTML, such as:

 

<title>BBC - GCSE Bitesize - Homepage</title>
<meta name="description" content="Index for GCSE Bitesize />
<meta name="keywords" content="BBC, bbc, GCSE, Revision, Revise,
Bitesize" />
<meta name="created" content="20041101">
 
Nutch is able to see the pages but I'm not getting any of the meta tags
indexed.  Is there a way to do this?
 
Thanks,
Paul.

 




Reply via email to