[lxml] Re: Turn three-line block into single?

Adrian Bool Mon, 08 Aug 2022 15:56:45 -0700

Hi Gilles,

I guess you're intending on using 'sort -u' on your data?  An alternative would 
be to de-dup the data as XML instead of as text.


Here is something to play with...

For the input file:

<data>
    <entries>
         <wpt lat="46.98520" lon="6.8831">
            <name>London</name>
        </wpt>
         <wpt lat="46.98520" lon="2.8831">
            <name>Paris</name>
        </wpt>
         <wpt lat="46.98520" lon="-4.8831">
            <name>Manhattan</name>
        </wpt>
        <wpt lat="46.98520" lon="6.8831">
            <name>London 2</name>
        </wpt>
         <wpt lat="46.98520" lon="-4.8831">
            <name>New York</name>
        </wpt>
    </entries>
</data>


We can process it with the following code, using python' set() object to remove 
duplicates:

#!/usr/bin/env python3

from lxml import etree

# Create a custom class that knows which attributes of wbt 
# we care about to consider them unique or not.
#
# Note that both eq() and hash() need to be supported. I was
# originally expecting that just hash() would have been sufficient
# for set() to cull duplicates.
class WPT(etree.ElementBase):
    def __eq__(self, b):
        return self.attrib['lat'] == b.attrib['lat'] and self.attrib['lon'] == 
b.attrib['lon']
    def __hash__(self):
        return hash( (self.attrib['lat'], self.attrib['lon']) )

# Create a parser that returns WPT objects in place of _Elements
# but only for elements with a name of 'wpt'
def get_wpt_parser():
    lookup = etree.ElementNamespaceClassLookup()
    parser = etree.XMLParser()
    parser.set_element_class_lookup(lookup)
    namespace = lookup.get_namespace('')
    namespace['wpt'] = WPT
    return parser

# Load the XML data and find the parent of the data we're interested in
wbt_parser = get_wpt_parser()
root = etree.parse('input.xml', wbt_parser)
entries = root.find('entries')

# Some sanity checking: Print out the Python type of the entries
# element (should be a traditional _Element) and each of the children,
# which should be of type WPT.
print(f"type(entries) = {type(entries)}")
print(f"type(entries.children = {','.join(str(type(c)) for c in 
entries.getchildren())}")

# Read the child elements of the parent into a set; which will cause
# duplicated entries to be removed; with set() leveraging the __eq__ and 
# __hash__ functions of the WBT class above
children = set(entries.iterchildren())

# Replace the original children with the unique children
entries[:] = children

# Write out the resultant XML
with open('output.xml', 'wb') as output_file:
    output_file.write(etree.tostring(root))



This results in the following output:

<data>
    <entries>
         <wpt lat="46.98520" lon="6.8831">
            <name>London</name>
        </wpt>
         <wpt lat="46.98520" lon="-4.8831">
            <name>Manhattan</name>
        </wpt>
        <wpt lat="46.98520" lon="2.8831">
            <name>Paris</name>a
        </wpt>
         </entries>
</data>

Which may well be what you're after...  If the contents of the <name> elements 
should also be part of the "is equal" then the WBT class can be updated to 
include this data too in the __eq__ and __hash__ functions.

Cheers,

aid



> On 8 Aug 2022, at 20:32, Gilles <codecompl...@free.fr> wrote:
> 
> Hello,
> 
> Before I  resort to a regex, I figured I should ask here.
> 
> To find and remove possible duplicates, I need to turn each block into a 
> single line:
> 
> FROM
> 
>   <wpt lat="46.98520" lon="6.8831">
>     <name>blah</name>
>   </wpt>
> 
> TO
> 
>   <wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>
> 
> Do you know of a way to do this in lxml?
> 
> Thank you.
> 
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: a...@logic.org.uk

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Turn three-line block into single?

Reply via email to