Hi all: I'm reviving a thread from long ago now that I've gotten a few minutes to look at this question again: How is XML data best parsed using a SAX parser in Pharo Smalltalk?
I tried to look at the GenomeTools project that Miguel references below, but it seems that the class he mentions (GTNCBIBlastParser) is no longer in it. Perhaps there's a newer, better example of how to drive the SAX parser somewhere? > Message: 4 > Date: Tue, 20 Jul 2010 12:25:29 -0500 > From: Miguel Enrique Cob? Mart?nez <[email protected]> > Subject: Re: [Pharo-project] Markup Builder in Smalltalk (XMLWriter) > To: [email protected] > Message-ID: <[email protected]> > Content-Type: text/plain; charset="UTF-8" > > This good summary should go directly to the collaboractive book. > > > El mar, 20-07-2010 a las 14:11 -0300, Hern?n Morales Durand escribi?: >> A XML parser just creates a representation of a XML document according >> to a parsing model. Ideally you should choose a XML parser >> specifically for your needs. You have different parsing models: >> >> -Tree Parser: This is what you will read everywhere as the "DOM parser" >> -Event Parser: This is denoted by S*X and could be >> --SAX Parser: Known as the "Push parser" >> --StAX Parser: Known also too as the "Pull parser" >> -VTD Parser : This is known as "Virtual Token Descriptor" >> >> Now there are several classifications depending of the parser >> characteristics and what you want to do or how. You may be interested >> in: >> >> Making modifications or just processing? >> -For modifications: The parser creates long-lived representations from >> the XML document (necessary for modifications): You should choose DOM >> or VTD >> --Do you *need* to query or modify the objects (parser creates nodes): DOM >> --You do not need the objects (parser creates integers and locations >> caches): VTD >> -For processing: The parser doesn't creates long-lived objects: SAX or StAX. >> >> Type of Access >> -Back-and-forth: Access the data after the parsing is complete: DOM or VTD >> --Massive or very frequent access: Choose DOM >> --Rare or simple access: Choose VTD >> -Sequential: Access the data while you're processing the document: SAX or >> StAX >> --Processing all tokens: SAX >> --Processing interested tokens (allows skipping forward): StAX >> >> Briefly >> -Streaming applications (very large documents): SAX or StAX >> -Database applications: DOM or VTD >> -Hardware acceleration?: VTD >> >> For the S*X parsers you need to know the XML token types because, for >> example in the case of XMLParser in Pharo/Squeak, you probably would >> subclass SAXHandler and override one or several methods in the content >> category to do your own processing. See GTNCBIBlastParser in >> http://www.squeaksource.com/GenomeTools.html for an example of a SAX >> Parser. >> >> XML token types: >> Start element: <Hit>.... >> End element: ...</Hit> >> Text: <...>Text value</...> >> etc. >> >> For DOM usage examples you may see >> http://community.ofset.org/index.php/Les_bases_de_XML_dans_Squeak (it >> is in french but is a good document) >> >> What we have in Pharo/Squeak >> >> Parsers: >> 1) XMLParser : Supports SAX and DOM. >> http://www.squeaksource.com/XMLSupport.html >> 2) VWXML Parser : Supports SAX and DOM (AFAIK) >> http://www.squeaksource.com/VWXML.html >> 3) XMLPullParser : Supports StAX. >> http://www.squeaksource.com/XMLPullParser.html >> >> XML Query tools >> 1) Pastell : Supports X-Path like queries. Requires XMLParser. >> http://www.squeaksource.com/Pastell.html >> 2) XPath library : Supports XPath partially. Requires XMLParser. >> http://www.squeaksource.com/XPath.html >> >> There are several additional tools in SqueakSource but I haven't reviewed >> yet. >> A VTD parser would be ideal for Smalltalk because it uses integer >> arrays reducing the object allocation overhead in memory. I haven't >> found implementations of a XML VTD parser in Smalltalk as of today. >> Cheers, >> Thanks, -- Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com PGP Sig: 917E DDB7 C911 9EC1 0CD9 C06B 06C4 835F 0BB8 7336
