Re: [Groff] Having a problem with parsing output to html...

justin Fri, 25 Mar 2011 11:20:41 -0700


Hello


Yes, Keith was right, all the mumbo jumbo I wrote with the exception of
the two sentences swapping spaces was indeed related to the hyphenation.

I would have caught this, if my observation was of the parsed html text and not of the actual html file.

I now installed groff 1.21, seeing if it would made a difference. The problem with the two sentences swapping placing is now resolved.


The problem with hyphens, apostrophes, and dashes still remains.

I'm including a sample of the results.



On Fri, 25 Mar 2011, Keith Marshall wrote:

On 25 March 2011 04:38, Werner LEMBERG wrote:


Justin,

a simple example says more than thousand words...  So please give us
an example we can examine.


Hear!  Hear!

At a first glance, it seems you have an encoding problem (but this
doesn't explain the strange things you see).  The default encoding of
groff is latin1, and your input file is probably UTF8.  Starting with
version 1.20, groff can handle UTF8 by use a new preprocessor.

The HTML output driver is still experimental (and basically
unmaintained currently due to lack of time and interest); it is easily
possible that you've found a bug.


Equally -- perhaps more -- likely, Justin has encountered a hyphenation
issue.  This:

On the 11th in my groff file, an "â" character is found after 64
characters have been printed, within the word hamburger, the text gets
parsed and printed as "hamâburger". If I change hamburger to donations
I have the "â" character show up at the 60th character on the line,
with donations being "donaâtions".


is reminiscent of an issue I myself observed, earlier this week.  I had
run some informally structured ASCII text through a sed filter, and then
through nroff, (v1.20.1), to produce an alternative layout.  Although I
had suppressed hyphenation (.hy 0), I did have several explicit ASCII
hyphen characters in the input stream; each of these was replaced, in
the output stream, by the three byte octal sequence 342 200 220, (which
I guess represents u2010 -- the Unicode hyphen which groff_char(7)
documents as the output form for hyphen).

Viewing this output with "less", on my UTF-8 aware console, it looked
absolutely fine, but after uploading as a package description file on my
SourceForge downloads page, each hyphen was rendered, by Firefox, with
unwanted whitespace surrounding it; rendered by Internet Explorer, each
hyphen was replaced by three characters of garbage, amongst it being the
"â" observed by Justin, IIRC.

So yes, I guess what you actually see is dependent on encoding, (and how
the viewer interprets the u2010 sequence, however it is encoded).  In my
case, I wanted real ASCII hyphens in my output stream; adding "-Tascii"
to my nroff command gave me that.

--
Regards,
Keith.

bill_hicks.tr
Description: groff file

Re: [Groff] Having a problem with parsing output to html...

Reply via email to