On Mon, Jun 8, 2009 at 20:39, John Millikin<[email protected]> wrote: > I'm trying to convert an XML document, incrementally, into a sequence > of XML events. A simple example XML document: > > <doc xmlns="org:myproject:mainns" xmlns:x="org:myproject:otherns"> > <title>Doc title</title> > <x:ref>abc1234</x:ref> > <html xmlns="http://www.w3.org/1999/xhtml"><body>Hello world!</body></html> > </doc> > > The document can be very large, and arrives in chunks over a socket, > so I need to be able to "feed" the text data into a parser and receive > a list of XML events per chunk. Chunks can be separated in time by > intervals of several minutes to an hour, so pausing processing for the > arrival of the entire document is not an option. The type signatures > would be something like: > > type Namespace = String > type LocalName = String > > data Attribute = Attribute Namespace LocalName String > > data XMLEvent = > EventElementBegin Namespace LocalName [Attribute] | > EventElementEnd Namespace LocalName | > EventContent String | > EventError String > > parse :: Parser -> String -> (Parser, [XMLEvent]) > > I've looked at HaXml, HXT, and hexpat, and unless I'm missing > something, none of them can achieve this: > > + HaXml and hexpat seem to disregard namespaces entirely -- that is, > the root element is parsed to "doc" instead of > ("org:myproject:mainns", "doc"), and the second child is "x:ref" > instead of ("org:myproject:otherns", "ref"). Obviously, this makes > parsing mixed-namespace documents effectively impossible. I found an > email from 2004[1] that mentions a "filter" for namespace support in > HaXml, but no further information and no working code.
I would recommend hexpat to do the job. Contrary to what you are saying, hexpat does offer namespace handling: http://hackage.haskell.org/packages/archive/hexpat/0.8/doc/html/Text-XML-Expat-Namespaced.html Perhaps you need more than that? Personally I found hexpat to be fast, space efficient and easy to use. Here is the representation I got for your example. Please note the namespaces in right places. * > (toNamespaced ( toQualified t')) Element {eName = NName {nnNamespace = Just "org:myproject:mainns", nnLocalPart = "doc"}, eAttrs = [(NName {nnNamespace = Just "http://www.w3.org/2000/xmlns/", nnLocalPart = "x"},"org:myproject:otherns"),(NName {nnNamespace = Just "org:myproject:mainns", nnLocalPart = "xmlns"},"org:myproject:mainns")], eChildren = [Text "\n",Text " ",Element {eName = NName {nnNamespace = Just "org:myproject:mainns", nnLocalPart = "title"}, eAttrs = [], eChildren = [Text "Doc title"]},Text "\n",Text " ",Element {eName = NName {nnNamespace = Just "org:myproject:otherns", nnLocalPart = "ref"}, eAttrs = [], eChildren = [Text "abc1234"]},Text "\n",Text " ",Element {eName = NName {nnNamespace = Just "http://www.w3.org/1999/xhtml", nnLocalPart = "html"}, eAttrs = [(NName {nnNamespace = Just "http://www.w3.org/1999/xhtml", nnLocalPart = "xmlns"},"http://www.w3.org/1999/xhtml")], eChildren = [Element {eName = NName {nnNamespace = Just "http://www.w3.org/1999/xhtml", nnLocalPart = "body"}, eAttrs = [], eChildren = [Text "Hello world!"]}]},Text "\n"]} Best regards Krzysztof Skrzętnicki _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
