Hi, you may also want to look at: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/xml It knows about namespaces and, also, it's parser is lazy. -Iavor
On Mon, Jun 8, 2009 at 11:39 AM, John Millikin<[email protected]> wrote: > I'm trying to convert an XML document, incrementally, into a sequence > of XML events. A simple example XML document: > > <doc xmlns="org:myproject:mainns" xmlns:x="org:myproject:otherns"> > <title>Doc title</title> > <x:ref>abc1234</x:ref> > <html xmlns="http://www.w3.org/1999/xhtml"><body>Hello world!</body></html> > </doc> > > The document can be very large, and arrives in chunks over a socket, > so I need to be able to "feed" the text data into a parser and receive > a list of XML events per chunk. Chunks can be separated in time by > intervals of several minutes to an hour, so pausing processing for the > arrival of the entire document is not an option. The type signatures > would be something like: > > type Namespace = String > type LocalName = String > > data Attribute = Attribute Namespace LocalName String > > data XMLEvent = > EventElementBegin Namespace LocalName [Attribute] | > EventElementEnd Namespace LocalName | > EventContent String | > EventError String > > parse :: Parser -> String -> (Parser, [XMLEvent]) > > I've looked at HaXml, HXT, and hexpat, and unless I'm missing > something, none of them can achieve this: > > + HaXml and hexpat seem to disregard namespaces entirely -- that is, > the root element is parsed to "doc" instead of > ("org:myproject:mainns", "doc"), and the second child is "x:ref" > instead of ("org:myproject:otherns", "ref"). Obviously, this makes > parsing mixed-namespace documents effectively impossible. I found an > email from 2004[1] that mentions a "filter" for namespace support in > HaXml, but no further information and no working code. > > + HXT looks promising, because I see explicit mention in the > documentation of recording and propagating namespaces. However, I > can't figure out if there's an incremental mode. A page on the wiki[2] > suggests that SAX is supported in the "html tag soup" parser, but I > want incremental parsing of *valid* documents. If incremental parsing > is supported by the standard "arrow" interface, I don't see any > obvious way to pull events out into a list -- I'm a Haskell newbie, > and still haven't quite figured out monads yet, let alone Arrows. > > Are there any libraries that support namespace-aware incremental parsing? > > [1] http://www.haskell.org/pipermail/haskell-cafe/2004-June/006252.html > [2] > http://www.haskell.org/haskellwiki/HXT/Conversion_of_Haskell_data_from/to_XML > _______________________________________________ > Haskell-Cafe mailing list > [email protected] > http://www.haskell.org/mailman/listinfo/haskell-cafe > _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
