Hi,

Just following up on this. I noticed that I hadn't provided the actual 
alternate code in my previous mail, although I
referenced it. Sorry about that.

In any case, I've submitted a PR with a potential fix for the issue as 
described: https://github.com/lxml/lxml/pull/350

I'm unsure why the diff generation bothers with the twiddling with the leading 
/ trailing spaces around the tags but
I've retained the behaviour. Note that the implementation passes all existing 
tests and also includes a additional test
with the example from my previous mail.

I'd be happy to receive any kind of feedback about the issue.

As an aside, the mystery w.r.t the debugging on the REPL continues to confound 
me !

cheers,
Steve


On Mon, 2022-09-05 at 19:13 +0100, Steve wrote:
> Hi,
> 
> I'm using lxml's html diff functionality in a project and it has been working 
> well so far. Sometime back though I
> noticed that the generated diff changes the structure of the html in a manner 
> that's less than ideal for our use case.
> 
> An example:
> 
> > > > from lxml.html import diff
> > > > a = "<div id='first'>some old text</div><div id='last'>more old 
> > > > text</div>"
> > > > b = "<div id='first'>some old text</div><div id='middle'>and new 
> > > > text</div><div id='last'>more old text</div>"
> > > > diff.htmldiff(a, b)
> ('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and 
> new</ins> <del>some old</del>
> text</div><div
> id="last">more old '
>  'text</div>')
> > > > 
> 
> As you can see, the div with id=middle has been inserted at the beginning of 
> the document and it encloses the div with
> id=first. I believe this happens because lxml unconditionally inserts 
> 'unbalanced tags' at the beginning of set of
> 'chunks' when it surrounds the chunks with the <ins> tags:
> 
> https://github.com/lxml/lxml/blob/master/src/lxml/html/diff.py#L241
> 
> Could we potentially be a bit smarter about this and insert the unbalanced 
> tags, as we encounter them instead ? For
> instance, closing out the opened `<ins>` tag, inserting the unbalanced tag 
> and opening a new `<ins>` tag. Something
> like:
> 
> 
> Secondly, it would be great if someone could help me understand (or give me 
> pointers to) why I'm seeing this
> differences
> of behaviour between what's being returned from the compiled version of the 
> code and executing the exact same set of
> functions from the REPL !?!
> 
> (assuming the same context in the REPL as above)
> 
> > > > tokens_a = diff.tokenize(a)
> > > > tokens_b = diff.tokenize(b)
> > > > diff.htmldiff_tokens(tokens_a, tokens_b)
> ['<div id="middle"> ', '<ins>', '<div id="first">', 'some ', 'old ', 'text', 
> '</div>', 'and ', 'new', '</ins> ',
> '<del>', 'some ', 'old', '</del> ', 'text', '</div>', '<div id="last">', 
> 'more ', 'old ', 'text', '</div>']
> > > > s = diff.InsensitiveSequenceMatcher(tokens_a, tokens_b)
> > > > commands = s.get_opcodes()
> > > > list(commands)
> [('delete', 0, 9, 0, 0)]
> 
> The htmldiff_tokens() is obviously getting 2 opcodes, either an insert or 
> replace, followed by a delete, but if I
> instantiate InsensitiveSequenceMatcher() from the REPL, it generates only the 
> delete ! This is driving me nuts ! Am I
> doing something wrong ?
> 
> cheers,
> Steve
> 
> 
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: st...@lonetwin.net

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to