Bruno Haible wrote on 2000-09-14 14:42 UTC:
> Karlsson Kent writes:
> > Markus Kuhn wrote:
> > > For the soft hyphen (SHY, 173=0xAD), the discussion might be a bit more
> > > tricky (see <http://www.hut.fi/~jkorpela/shy.html> for a good
> > > discussion), but I would also classify that one as printable as well,
> > > and so does Unicode.
> > 
> > It's visible in rendering when an auto-linewrap follows it.  It should not
> > be shown if there is no (auto)line-wrap immediately after it.
> 
> That's only one of the positions that various standards take on this
> point. The other position is that the SHY should be always visible.
> Therefore you are not getting into trouble if you create text with SHY
> at the end of a line, but you *will* get trouble if you create text with
> SHY in the middle.
> 
> > And the character itself should NOT be removed when a paragraph is
> > reflowed.

If you intend to display your text on a VT100-style terminal emulator,
then you better remove SHY's and the immediately following whitespace
each time before reflowing paragraphs, because terminals have always
treated SHY just like a normal graphical character.

> I'd on the contrary recommend that paragraph reformatters convert
> U+00AD "SOFT HYPHEN" to U+2027 "HYPHENATION POINT" when removing the
> line break following it.

No! U+2027 "HYPHENATION POINT" is a completely normal graphical
character for use in dictionaries, where you write things like
hy�phe�na�tion. Unicode does at the moment *not* have a control
character ZERO-WIDTH HYPHENATION POINT that indicates (like \- in TeX)
to a paragraph reformatter that this position is a suitable point for
hyphenating a word. Some Unicode folks argue that SHY should be used for
this. Others point out that 

  - the original ISO 8859-1 meaning of SHY was clearly to represent
    the graphical character HYPHEN when it was automatically inserted
    by a paragraph reformatting algorithm at the end of the line (such
    that it can be removed when the paragraph is formatted again later),
  - SHY should never appear inside a line (see
    <http://www.hut.fi/~jkorpela/shy.html>)
  - any attempt of the Unicode Consortium to redefine SHY into a
    potentially invisible control character are unwise, misguided
    and blasphemous.

I concur with this latter view.

If users want to have a ZERO-WIDTH HYPHENATION POINT in Unicode (which
becomes only visible as a hyphen during a display process if a line
break occurs right after it), then they should add this character and
not abuse a well-established existing one like SHY for that purpose,
which was added for a very different processing model (more that of
Wordstar/Emacs/vi/etc. than that of Word/HTML/etc.)!

In general, adding markers for potential hyphenation points in the
middle of a text is a mostly bad idea. The Correct Thing[TM] to do is to
add a hyphenation exceptions dictionary (like TeX's \hyphenat{...}) to a
word processing file if the user wants to manually override decisions
made by a hyphenation algorithm. This way, the override can be applied
consistently throughout the entire text. The only reason for having
explicit hyphenation points inserted directly into the text are the
exotic few occasions, where the same word (usually curious combinations
of joined nouns) has to be hyphenated differently based on the
semantics. For example in German: Staub-ecken (dust corners) versus
Stau-becken (water reservoir).

I suggest to add to Unicode ZERO-WIDTH HYPHENATION POINT in order to
distinguish the functionality of a potential usually invisible
hyphenation point from that of SHY. Unicode is free to discourage use of
SHY in word processing applications, just like it discourages already
the use of LINE FEED. TTY/VT100 style plain text files which store the
formatting result of a paragraph in the form of LINE FEEDS will continue
to use SHY to mark those hyphens that have to be removed before
reformatting a plain text paragraph.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to