Hi Bruce,
On Mon, 14 Oct 2024 16:31:11 -0400
Bruce Momjian <[email protected]> wrote:
> On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote:
> > I did some more research and we able to clarify our behavior in
> > release.sgml:
>
> I have specified some more details in my patched version:
>
> We can only use Latin1 characters, not all UTF8 characters,
> because some rendering engines do not support non-Latin1 UTF8
> characters. Specifically, the HTML rendering engine can display
> all UTF8 characters, but the PDF rendering engine can only display
> Latin1 characters. In PDF files, non-Latin1 UTF8 characters are
> displayed as "###".
>
> In the SGML files we encode non-ASCII Latin1 characters as HTML
> entities, e.g., Álvaro. Oddly, it is possible to safely
> represent Latin1 characters in SGML files as UTF8 for HTML and
> PDF output, but we we currently disallow this via the Makefile
> "check-non-ascii" rule.
>
I agree with encoding non-Latin1 characters and disallowing non-ASCII
characters totally.
I found your patch includes fixes in *.svg files, so how about checking
also them by check-non-ascii? Also, I think it is better to use perl instead
of grep because non-GNU grep doesn't support hex escape sequences. I've attached
a updated patch for Makefile. The changes in release.sgml above is not applied
yet, though.
Regards,
Yugo Nagata
--
Yugo NAGATA <[email protected]>
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 65ed32cd0a..3d992ebd84 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -143,7 +143,7 @@ postgres.txt: postgres.html
## Print
##
-postgres.pdf:
+postgres.pdf pdf:
$(error Invalid target; use postgres-A4.pdf or postgres-US.pdf as targets)
XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
+check: postgres.sgml $(ALLSGML) check-tabs check-non-ascii
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -257,15 +257,16 @@ endif # sqlmansectnum != 7
# tabs are harmless, but it is best to avoid them in SGML files
check-tabs:
- @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdr)/images/*.svg) ) || \
(echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
-# Non-breaking spaces are harmless, but it is best to avoid them in SGML files.
-# Use perl command because non-GNU grep or sed could not have hex escape sequence.
-check-nbsp:
- @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
- $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
- (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+# Disallow non-ASCII characters because some rendering engines do not
+# support non-Latin1 UTF8 characters. Use perl command because non-GNU grep
+# or sed could not have hex escape sequence.
+check-non-ascii:
+ @ ( $(PERL) -ne '/[^\x00-\x7f]/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.svg) ) || \
+ (echo "Non-ASCII characters appear in SGML/XML files; use HTML entities for Latin1 characters" 1>&2; exit 1)
##
## Clean
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 1ef5322b91..f5e115e8d6 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1225,7 +1225,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
<programlisting>
-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
-SELECT 'Ã
' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
-- upper case letters sort before lower case.
@@ -1282,7 +1282,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<entry><literal>'ab' = U&'a\2063b'</literal></entry>
<entry><literal>'x-y' = 'x_y'</literal></entry>
<entry><literal>'g' = 'G'</literal></entry>
- <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
<entry><literal>'y' = 'z'</literal></entry>
</row>
</thead>
@@ -1346,7 +1346,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<para>
At every level, even with full normalization off, basic normalization is
- performed. For example, <literal>'á'</literal> may be composed of the
+ performed. For example, <literal>'á'</literal> may be composed of the
code points <literal>U&'\0061\0301'</literal> or the single code
point <literal>U&'\00E1'</literal>, and those sequences will be
considered equal even at the <literal>identic</literal> level. To treat
@@ -1430,8 +1430,8 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
<entry><literal>false</literal></entry>
<entry>
Backwards comparison for the level 2 differences. For example,
- locale <literal>und-u-kb</literal> sorts <literal>'Ã e'</literal>
- before <literal>'aé'</literal>.
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
</entry>
</row>
diff --git a/doc/src/sgml/images/genetic-algorithm.svg b/doc/src/sgml/images/genetic-algorithm.svg
index fb9fdd1ba7..2ce5f1b271 100644
--- a/doc/src/sgml/images/genetic-algorithm.svg
+++ b/doc/src/sgml/images/genetic-algorithm.svg
@@ -72,7 +72,7 @@
<title>a4->end</title>
<path fill="none" stroke="#000000" d="M259,-312.5834C259,-312.5834 259,-54.659 259,-54.659"/>
<polygon fill="#000000" stroke="#000000" points="262.5001,-54.659 259,-44.659 255.5001,-54.6591 262.5001,-54.659"/>
-<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true  </text>
+<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true</text>
</g>
<!-- a5 -->
<g id="node7" class="node">
@@ -85,7 +85,7 @@
<title>a4->a5</title>
<path fill="none" stroke="#000000" d="M144,-298.269C144,-298.269 144,-286.5248 144,-286.5248"/>
<polygon fill="#000000" stroke="#000000" points="147.5001,-286.5248 144,-276.5248 140.5001,-286.5249 147.5001,-286.5248"/>
-<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false   </text>
+<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false</text>
</g>
<!-- a6 -->
<g id="node8" class="node">
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
index 8433690dea..65c86f54c0 100644
--- a/doc/src/sgml/release.sgml
+++ b/doc/src/sgml/release.sgml
@@ -26,13 +26,15 @@ non-ASCII characters find using grep -P '[\x80-\xFF]' or
http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
- We cannot use UTF8 because rendering engines have to
- support the referenced characters.
-
- Do not use numeric _UTF_ numeric character escapes (&#nnn;),
- we can only use Latin1.
-
- Example: Alvaro Herrera is Álvaro Herrera
+ We can only use Latin1 characters, not all UTF8 characters,
+ because rendering engines must support the referenced characters,
+ and they currently only support Latin1. In the SGML files we
+ encode non-ASCII Latin1 characters as HTML entities, e.g.,
+ Álvaro Herrera. Oddly, it is possible to add Latin1
+ characters as UTF8, but we we currently prevent this via the
+ Makefile "check-non-ascii" check.
+
+ Do not use numeric _UTF_ numeric character escapes (&#nnn;).
wrap long lines
diff --git a/doc/src/sgml/stylesheet-man.xsl b/doc/src/sgml/stylesheet-man.xsl
index fcb485c293..2e2564da68 100644
--- a/doc/src/sgml/stylesheet-man.xsl
+++ b/doc/src/sgml/stylesheet-man.xsl
@@ -213,12 +213,12 @@
<!-- Slight rephrasing to indicate that missing sections are found
in the documentation. -->
<l:context name="xref-number-and-title">
- <l:template name="chapter" text="Chapter %n, %t, in the documentation"/>
- <l:template name="sect1" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect2" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect3" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect4" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect5" text="Section %n, â%tâ, in the documentation"/>
+ <l:template name="chapter" text="Chapter %n, "%t", in the documentation"/>
+ <l:template name="sect1" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect2" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect3" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect4" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect5" text="Section %n, "%t", in the documentation"/>
</l:context>
</l:l10n>
</l:i18n>