Given you mention .Filename property I assume you are using TXMLDocument. Forget that and use MSXML direct - its much better, you could load a URL direct without first downloading to a file. Import the MSXML 6.0 to create MSXML2_TLB. You will probably find that most web sites have xhtml tags but are still not valid. Try extracting from html opening tag down to the closing tag and processing that piece only as xml. In the website used in the sample below if you download it to a file and strip the headings before the html tag it will load properly. You might be able to find a way around this, I haven't looked any further.
unit Unit5; interface uses Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms, Dialogs, StdCtrls, MSXML2_TLB; type TForm5 = class(TForm) Button1: TButton; procedure Button1Click(Sender: TObject); private { Private declarations } public { Public declarations } end; EValidateXMLError = class(Exception) private FErrorCode: Integer; FReason: string; public constructor Create(aErrorCode: Integer; const aReason: string; const aLine, aChar, aFilePos : integer; const aSrcText, aURL, aXPath : string); property ErrorCode: Integer read FErrorCode; property Reason: string read FReason; end; var Form5: TForm5; implementation {$R *.dfm} resourcestring RsValidateError = 'XML Validation Error (%.8x) Reason: %s XPath: %s Line: %d Char: %d File Pos: %d URL: %s Src Text: %s'; constructor EValidateXMLError.Create(aErrorCode: Integer; const aReason: string; const aLine, aChar, aFilePos : integer; const aSrcText, aURL, aXPath : string); begin inherited CreateResFmt(@RsValidateError, [AErrorCode, aReason, aXPath, aLine, aChar, aFilePos, aURL, aSrcText]); FErrorCode := aErrorCode; FReason := aReason; end; procedure TForm5.Button1Click(Sender: TObject); var oXMLDoc: DOMDocument60; oError: IXMLDOMParseError2; begin oXMLDoc := CoDOMDocument60.Create; oXMLDoc.async := FALSE; oXMLDoc.setProperty('ProhibitDTD', TRUE); oXMLDoc.resolveExternals := FALSE; oXMLDoc.validateOnParse := FALSE; oXMLDoc.load('http://w3future.com/weblog/gems/xhtml2.xml'); //use oXMLDOc.load() also loads file paths. use oXMLDoc.loadXML to load XML in a string if oXMLDoc.parseError.errorCode <> S_OK then // validate is off above but you should still check for load errors. This is different to validation though check out schemacache if you want to validate against xsd begin oError := oXMLDoc.parseError as IXMLDOMParseError2; raise EValidateXMLError.Create(oError.errorCode, oError.reason, oError.line, oError.linepos, oError.filepos, oError.srcText, oError.url, oError.errorXPath); end; showmessage(oXMLDoc.xml); end; end. cameron From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Alister Christie Sent: Friday, 29 January 2010 2:40 p.m. To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] web scraping using IHTMLDocument2 Thanks Cameron, It does indeed have that header, how do I make this work? XMLDocument1.FileName := 'c:\temp\test.htm'; XMLDocument1.Active := True; Gives me various errors, I suspect that that the file is not valid xml, or is there some other way of parsing it? Alister Christie Computers for People Ph: 04 471 1849 Fax: 04 471 1266 http://www.salespartner.co.nz PO Box 13085 Johnsonville Wellington Cameron Hart wrote: Do you know if the websites are xhtml - do they have anything like below in the start of the page. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" <http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd> > <html xmlns="http://www.w3.org/1999/xhtml" <http://www.w3.org/1999/xhtml> > If they are it would be easier to load them into XML documents and process them that way using msxml DOMDocument60 cameron From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Alister Christie Sent: Friday, 29 January 2010 12:22 p.m. To: NZ Borland Developers Group - Delphi List Subject: [DUG] web scraping using IHTMLDocument2 I'm trying to do some web page scraping using IHTMLDocument2, which is working fairly well and I can grab the second paragraph on a web page by doing something like: p := iDoc.all.tags('P'); if p.Length >= 2 then result := p.Item(1).InnerText; Where iDoc is an isnstance of IHTMLDocument2. However say there there is an HTML element like <div class="propertyInfo">Price: <span>Negotiation</span></div> How would I be able to find the divs where class="propertyInfo"? (if anyone has much experience with IHTMLDocument2) -- Alister Christie Computers for People Ph: 04 471 1849 Fax: 04 471 1266 http://www.salespartner.co.nz PO Box 13085 Johnsonville Wellington ________________________________ _______________________________________________ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
_______________________________________________ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe