[lxml] html diff changes the structure of the html (and a mystery about debugging)

Steve Mon, 05 Sep 2022 12:01:30 -0700

Hi,

I'm using lxml's html diff functionality in a project and it has been working 
well so far. Sometime back though I
noticed that the generated diff changes the structure of the html in a manner 
that's less than ideal for our use case.


An example:

>>> from lxml.html import diff
>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new 
>>> text</div><div id='last'>more old text</div>"
>>> diff.htmldiff(a, b)
('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and 
new</ins> <del>some old</del> text</div><div
id="last">more old '
 'text</div>')
>>>

As you can see, the div with id=middle has been inserted at the beginning of 
the document and it encloses the div with
id=first. I believe this happens because lxml unconditionally inserts 
'unbalanced tags' at the beginning of set of
'chunks' when it surrounds the chunks with the <ins> tags:

https://github.com/lxml/lxml/blob/master/src/lxml/html/diff.py#L241

Could we potentially be a bit smarter about this and insert the unbalanced 
tags, as we encounter them instead ? For
instance, closing out the opened `<ins>` tag, inserting the unbalanced tag and 
opening a new `<ins>` tag. Something
like:


Secondly, it would be great if someone could help me understand (or give me 
pointers to) why I'm seeing this differences
of behaviour between what's being returned from the compiled version of the 
code and executing the exact same set of
functions from the REPL !?!

(assuming the same context in the REPL as above)

>>> tokens_a = diff.tokenize(a)
>>> tokens_b = diff.tokenize(b)
>>> diff.htmldiff_tokens(tokens_a, tokens_b)
['<div id="middle"> ', '<ins>', '<div id="first">', 'some ', 'old ', 'text', 
'</div>', 'and ', 'new', '</ins> ',
'<del>', 'some ', 'old', '</del> ', 'text', '</div>', '<div id="last">', 'more 
', 'old ', 'text', '</div>']
>>> s = diff.InsensitiveSequenceMatcher(tokens_a, tokens_b)
>>> commands = s.get_opcodes()
>>> list(commands)
[('delete', 0, 9, 0, 0)]

The htmldiff_tokens() is obviously getting 2 opcodes, either an insert or 
replace, followed by a delete, but if I
instantiate InsensitiveSequenceMatcher() from the REPL, it generates only the 
delete ! This is driving me nuts ! Am I
doing something wrong ?

cheers,
Steve


_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

[lxml] html diff changes the structure of the html (and a mystery about debugging)

Reply via email to