Hi, > I run this script to remove unneeded elements. > > For some reason, the input file is left as-is, when I try to get rid of the > <metadata> > block; If works as expected when I ignore that element. > > Any idea why?
The reason is that lxml (sensibly) uses fully qualified tag names in Clark notation (see http://www.jclark.com/xml/xmlns.htm) > > ========== INPUT.GPX > > <?xml version="1.0" encoding="UTF-8"?> > <gpx version="1.1" creator="GPSBabel - http://www.acme.com" > xmlns="http://www.topografix.com/GPX/1/1" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> > <metadata> > <time>2022-07-21T20:29:48.309Z</time> > <bounds minlat="44.456597300" minlon="3.007453400" > maxlat="45.803699000" maxlon="5.251047400"/> > </metadata> > <wpt lat="45.042569200" lon="5.040802200"> > <name>way/4749044</name> > <cmt>landuse=cemetery</cmt> > <desc>landuse=cemetery</desc> > <link href="http://osm.org/browse/way/4749044"/> > </wpt> > </gpx> Here, your actual xml elements are e.g. "{http://www.topografix.com/GPX/1/1}cmt" (in Clark notation), not plain "cmt", since all elements without a namespace prefix due to this namespace declaration xmlns="http://www.topografix.com/GPX/1/1" belong to said namespace. So you must iter() over the proper fully-qualified names. Here's a slightly adapted version of your example code that hopefully shows what's going on and how to do that: # modified element cleaning sample xmldata = """<?xml version="1.0"?> <gpx version="1.1" creator="GPSBabel - http://www.acme.com" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <metadata> ┆<time>2022-07-21T20:29:48.309Z</time> ┆<bounds minlat="44.456597300" minlon="3.007453400" maxlat="45.803699000" maxlon="5.251047400"/> </metadata> <wpt lat="45.042569200" lon="5.040802200"> ┆<name>way/4749044</name> ┆<cmt>landuse=cemetery</cmt> ┆<desc>landuse=cemetery</desc> ┆<link href="http://osm.org/browse/way/4749044"/> </wpt> </gpx> """ import lxml.etree as et parser = et.XMLParser(remove_blank_text=True, strip_cdata=False) root = et.fromstring(xmldata, parser) # Show the fully qualified tag names (Clark-Notation) for el in root.iter(): print(el.tag) # Remove by fully-qualified Clark tag name or using a wildcard, if appropriate # (unqualified link will not get removed here since this doesn't exist - left in for # demo purposes) for el in root.iter('{http://www.topografix.com/GPX/1/1}cmt','{*}desc','link'): ┆parent = el.getparent() ┆parent.remove(el) print(et.tostring(root, pretty_print=True, encoding='unicode')) (I skipped the file reading/writing for my convenience; you might also want to look at the pathlib standard library module for you file/path name constructions, which is nice for handling such stuff, sometimes) HTH, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz. _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com