Re: [go-nuts] XML Parsing Nested Elements

Konstantin Khomoutov Tue, 07 Nov 2017 08:07:02 -0800

On Tue, Nov 07, 2017 at 03:35:45AM -0800, lesm...@gmail.com wrote:
> I am really struggling to access nested elements of an XML string and 
> suspect it is down to the namespaces.  This string is obtained from a 
> larger document and is the "innerXML" of some elements.  A simplified 
> version is at...
> 
> I could probably do this with multiple structs but want to have this in a 
> single struct.
> 
> https://play.golang.org/p/Een-guMNP9
> 
> I can seem to read things at the root but cannot get them using the ">" 
> syntax at all.  What am I doing wrong?  Can I "insert" a namespace element 
> to assist it at all?
> 
> I have manually removed the namespaces from this example to show what I 
> think should happen!?
> https://play.golang.org/p/eCzbzgBYMq


The chief problem with your approach is lack of error checking.
The encoding/xml.Unmarshal() function returns an error value.
Had you checked it for being set (not nil), it would have given you an
immediate idea of what was wrong with your approach.

Regarding namespaces, your hunch is correct: since your XML document is
a fragment extracted from another document by a seemingly "textual"
method, all those "XML namespace prefixes" — parts in the names of the
elements which come before the ':' characters — have no meaning to the
XML parser since they are not defined in the document itself.

Unfortunately, currently there's no way to somehow explicitly define
them anywhere (say, in an instance of encoding/xml.Decoder) before
decoding, so you basically have three options:

- Somehow textually stick their definition on the top element of your
  XML document fragrems, so, say, it reads something like

    <fdm:trackInformation xmlns:fdm="urn:whatever:ns1"
         xmlns:nxcm="http://example.com/another/namespace/uri/";
         ...>

  …and then parse the resulting document into a value of a struct
  type the tags on whose fields contain full namespaces in the names
  of the XML elements they're supposed to decode.

- Use iterative approach by creating an instance of encoding/xml.Decoder
  and calling its Token() method.

  When it returns a token of the types StartElement or EndElement,
  their Name property can be examined to see what its "Space" and
  "Local" fields are.

- Ignore the XML namespace prefixes completely.

  In your case this appears to be the simplest solution as the
  names of the elements appear to be unique anyway.

The variant which checks for errors, ignores the XML namespace prefixes
and also defines the field named "XMLName" on the type to check the
name of the element it's supposed to unmarshal can be implemented
as follows:

--------------------------------8<--------------------------------
    package main
    
    import (
        "encoding/xml"
        "log"
    )
    
    type TrackInformation struct {
        XMLName struct{} `xml:"trackInformation"`
    
        TimeAtPosition string `xml:"timeAtPosition"`
        Speed          int    `xml:"speed"`
    
        DepApt string `xml:"qualifiedAircraftId>departurePoint>airport"`
        ArrApt string `xml:"qualifiedAircraftId>arrivalPoint>airport"`
        Gufi   string `xml:"qualifiedAircraftId>gufi"`
    }
    
    func main() {
    
        xmlToParse := `
    <fdm:trackInformation>
        <nxcm:qualifiedAircraftId>
                <nxce:aircraftId>TEST</nxce:aircraftId>
                <nxce:gufi>KR32642300</nxce:gufi>
                <nxce:departurePoint>
                        <nxce:airport>KJFK</nxce:airport>
                </nxce:departurePoint>
                <nxce:arrivalPoint>
                        <nxce:airport>KJFK</nxce:airport>
                </nxce:arrivalPoint>
        </nxcm:qualifiedAircraftId>
        <nxcm:speed>245</nxcm:speed>
        <nxcm:timeAtPosition>2017-11-07T11:20:43Z</nxcm:timeAtPosition>
    </fdm:trackInformation>`
    
        var trackInfo TrackInformation
        err := xml.Unmarshal([]byte(xmlToParse), &trackInfo)
        if err != nil {
                log.Fatal(err)
        }
        log.Println(trackInfo)
    }
--------------------------------8<--------------------------------

Playground [1].


A couple of more notes.

- You can't use namespaces when defining the names of the nested
  elements.  The wording of the documentation is a bit moot but it does
  explicitly state this: «If the XML element contains a sub-element
  whose name matches the prefix of a tag formatted as "a" or "a>b>c"…» —
  notice that "the prefix of a tag" bit which actually means "the local
  name of an element".

  So when you need to match on full names of the elements, you'd have to
  use nested structs so that each field stands for an element without
  nesting, and the nesting is defined via your types rather than
  tags on their fields.

- The XML decoder implements a "strict" mode, which is "on" by default.

  What's interesting about it is that even when it's on, it turns a
  blind eye on undefined XML namespace prefixes: «Strict mode does not
  enforce the requirements of the XML name spaces TR. In particular it
  does not reject name space tags using undefined prefixes. Such tags
  are recorded with the unknown prefix as the name space URL.»

  This means that you can use your undefined namespace prefixes "as is"
  when decoding. [2] demonstrates this approach applied to the top-level
  XML elements.  You can't do this for that "a>b>c" notation in the tags
  but you still can apply it when implementing parsing using the nested
  types.

- Another trick up the sleeve of the XML decoder is support for custom
  unmarshaling functions for your custom types.

  Any of your types (such as TrackInformation) can implement a function

    UnmarshalXML(d *xml.Decoder, start xml.StartElement) error

  to make that type implement the encoding/xml.Unmarshaler interface.

  When the decoder sees a type implements this interface, it calls the
  UnmarshalXML function instead of dealing with the element's contents
  itself.

  What follows, is that you can have a hierarchy of low-level unexported
  types and a top-level "facade" type defining UnmarshalXML which
  internally first unmarshals the element using that hierarchy of types
  and then populates your "facade" type with the information ended up
  in that hierarchy of values.


Hope this helps.

1. https://play.golang.org/p/KJvvWg9apu
2. https://play.golang.org/p/AR5vDTKX0Q

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] XML Parsing Nested Elements

Reply via email to