On Wed, Oct 9, 2024 at 11:49:29AM +0900, Tatsuo Ishii wrote: > >> On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote: > >> > > On 30 Sep 2024, at 11:03, Tatsuo Ishii <is...@postgresql.org> wrote: > >> > > > >> > >>>> I think there's an unnecessary underscore in config.sgml. > >> > > > >> > > I was wrong. The particular byte sequences just looked an underscore > >> > > on my editor but the byte sequence is actually 0xc2a0, which must be a > >> > > "non breaking space" encoded in UTF-8. I guess someone mistakenly > >> > > insert a non breaking space while editing config.sgml. > >> > > >> > I wonder if it would be worth to add a check for this like we have to > >> > tabs? > >> > The attached adds a rule to "make -C doc/src/sgml check" for trapping > >> > nbsp > >> > (doing so made me realize we don't have an equivalent meson target). > >> > >> Can we check for any character outside the support range of SGML? > > > > What we can define the range of allowed characters range in SGML? > > > > We can detect non-ASCII characters by using regexp /\P{ascii}/ or > > /[^\x00-\x7f]/, > > but they are used in some places in charset.sgml and some names in > > release-*.sgml. > > I failed to find any standard regarding what characters are allowed in > SGML/XML. Assuming that any valid Unicode characters are allowed in > our *sgml files, I am afraid the best we can do is grepping non-ASCII > characters against the files and checking the results by a visual > inspection. Besides nbsp, there are tons of confusing Unicode > characters out there. For example there are many "hyphen like > characters". > > https://www.compart.com/en/unicode/category/Pd > > If one of them is used in the sgml files, it may be possible that it > was accidentally inserted.
Can we use Unicode in the SGML files? -- Bruce Momjian <br...@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"