Hi Melroy,
On Aug 24, 2009, at 12:20pm, melroyr wrote:
I have written a program to download html pages from harristeeter.
However,
when I run my program, I get the following
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<title>Your Personal Shopping List</title>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
[snip]
</frameset>
<frame src="actions.jsp" name="bottomFrame" scrolling="YES" noresize>
</frameset>
<noframes><body>
This application requires the use of frames, which your browser does
not
support.
</body></noframes>
</html>
The URL I am using to download the pages is
http://flyer.harristeeter.com/HT_eVIC/ThisWeek/ReviewAllSpecials.jsp
Please advise if there is some setting that I need do set in
HttpClient? I
have read about HtmlCleaner and stuff but I do not think they will
help.
Well, first it would help to know what you think is the problem. The
above page seems OK to me.
If I had to guess, the issue is that you want the content of the frame
(e.g. the <frame src="xxx"> link)
If so, then HttpClient can't automagically help you here. Easiest
approach would be to use a regex to extract the src="xxx" links,
convert them from relative to absolute, and fetch again...similar to
what a real web crawler might do.
-- Ken
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]