[lxml] Re: Why does it fail cleaning GPX file?

Holger.Joukl Fri, 22 Jul 2022 01:02:35 -0700

Hi,

> I run this script to remove unneeded elements.
>
> For some reason, the input file is left as-is, when I try to get rid of the 
> <metadata>
> block; If works as expected when I ignore that element.
>
> Any idea why?


The reason is that lxml (sensibly) uses fully qualified tag names in Clark 
notation
(see http://www.jclark.com/xml/xmlns.htm)


>
> ========== INPUT.GPX
>
> <?xml version="1.0" encoding="UTF-8"?>
> <gpx version="1.1" creator="GPSBabel - http://www.acme.com";
> xmlns="http://www.topografix.com/GPX/1/1";
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
>    <metadata>
>      <time>2022-07-21T20:29:48.309Z</time>
>      <bounds minlat="44.456597300" minlon="3.007453400"
> maxlat="45.803699000" maxlon="5.251047400"/>
>    </metadata>
>    <wpt lat="45.042569200" lon="5.040802200">
>      <name>way/4749044</name>
>      <cmt>landuse=cemetery</cmt>
>      <desc>landuse=cemetery</desc>
>      <link href="http://osm.org/browse/way/4749044"/>
>    </wpt>
> </gpx>


Here, your actual xml elements are e.g. "{http://www.topografix.com/GPX/1/1}cmt";
(in Clark notation), not plain "cmt", since all elements without a namespace 
prefix
due to this namespace declaration xmlns="http://www.topografix.com/GPX/1/1"; 
belong
to said namespace.

So you must iter() over the proper fully-qualified names.
Here's a slightly adapted version of your example code that hopefully shows 
what's
going on and how to do that:

# modified element cleaning sample
xmldata = """<?xml version="1.0"?>
<gpx version="1.1" creator="GPSBabel - http://www.acme.com";
xmlns="http://www.topografix.com/GPX/1/1";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
   <metadata>
    ┆<time>2022-07-21T20:29:48.309Z</time>
    ┆<bounds minlat="44.456597300" minlon="3.007453400"
maxlat="45.803699000" maxlon="5.251047400"/>
   </metadata>
   <wpt lat="45.042569200" lon="5.040802200">
    ┆<name>way/4749044</name>
    ┆<cmt>landuse=cemetery</cmt>
    ┆<desc>landuse=cemetery</desc>
    ┆<link href="http://osm.org/browse/way/4749044"/>
   </wpt>
</gpx>
"""

import lxml.etree as et
parser = et.XMLParser(remove_blank_text=True, strip_cdata=False)
root = et.fromstring(xmldata, parser)

# Show the fully qualified tag names (Clark-Notation)
for el in root.iter():
    print(el.tag)

# Remove by fully-qualified Clark tag name or using a wildcard, if appropriate
# (unqualified link will not get removed here since this doesn't exist - left 
in for
# demo purposes)
for el in root.iter('{http://www.topografix.com/GPX/1/1}cmt','{*}desc','link'):
    ┆parent = el.getparent()
    ┆parent.remove(el)

print(et.tostring(root, pretty_print=True, encoding='unicode'))


(I skipped the file reading/writing for my convenience; you might also want to
look at the pathlib standard library module for you file/path name 
constructions,
which is nice for handling such stuff, sometimes)

HTH, Holger






Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz

Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen 
Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Why does it fail cleaning GPX file?

Reply via email to