Bug#553490: wdiff: Does not handle UTF-8 properly (fwd)

2011-10-20 Thread Santiago Vila
Hello.

I received this from the Debian bug system.
I've checked and the current version (1.0.1) still shows the bug.
[ Please keep the Cc: lines when replying, thanks ].

[ Apologies to the submitter for taking so long to process this ]

-- Forwarded message --
From: Josh Triplett j...@joshtriplett.org
To: Debian Bug Tracking System sub...@bugs.debian.org
Date: Sat, 31 Oct 2009 11:39:08 -0700
Subject: wdiff: Does not handle UTF-8 properly

Package: wdiff
Version: 0.5-19
Severity: normal

wdiff -a uses backspace and overstrike to provide emphasis; thus, it
will emphasize 'x' by printing 'x^Hx'.  When it encounters a UTF-8
character, it does this for each byte, rather than for each character;
thus, emphasis of E28099 (U+2019 RIGHT SINGLE QUOTATION MARK)
looks like 'E2^HE280^H8099^H99', when it should look
like 'E28099^HE28099'.

- Josh Triplett

[...]



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#553490: [wdiff-bugs] Bug#553490: wdiff: Does not handle UTF-8 properly (fwd)

2011-10-20 Thread Martin von Gagern
Dear Santiago, Dear Josh,

I've already noticed that bug in your bug tracker, and added it to the
wdiff bug tracker at Savannah: https://savannah.gnu.org/bugs/?34224

Right now, I'm not sure how best to handle this case. Unicode support is
a big problem for the current wdiff implementation, in many ways. For
example, I guess that the most sensible way to really simulate
overstrike printing would be detecting grapheme clusters, i.e. even
treat sequences ofmultiple code points as a single entity if some of the
codepoints are combining.
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries has the
details on this, but I don't think I'll implement this in wdiff myself.
I've been toying with the idea of writing wdiff up from scratch with
stuff like this in mind, using ICU break iterators or similar. Won't
happen too soon, though.

I'm also not sure what versions of less are behaving in what ways. For
one, I doubt that all of them will know about grapheme clusters when
reading their input, so they might fail to turn it back into character
attributes as expected. I also think that most less implementations
these days will handle terminal control codes just fine, particularly if
called as less -R. So that overstriking thing might be obsolete in any
case.

Therefore I hope to roll a release soon which will pass terminal control
sequences to less, thus avoiding that overstrike stuff. I'll have to
give a bit more thought to the best combination of configure switches,
environment variables and command line options, though.

Greetings,
 Martin von Gagern



signature.asc
Description: OpenPGP digital signature