Joel de Guzman wrote:
'''
normalized names with name="section_id.normalized_header_text"
(i.e. valid characters are a-z, A-Z, 0-9 and _. All non-valid
characters are converted to underscore and all upper-case are
converted to lower-case. For example: Heading 1 in section
Section 2 will be normalized to section_2.heading_1)
'''
This might be too conservative (only C identifiers are allowed),
and the reason is simplification. I did not want to bother with
having to find the "universal" URL, link, etc, that would work
for all. I figured C identifiers are acceptable almost everywhere.
It's quite easy to change the behavior if it can be shown to be
safe for all output formats including HTML, PDF etc. The relevant
code is just:
char
filter_identifier_char(char ch)
{
if (!std::isalnum(static_cast<unsigned char>(ch)))
ch = '_';
return static_cast<char>(
std::tolower(static_cast<unsigned char>(ch)));
}
Comments?
To analyze this issue, we should start by considering the "Boost
documentation toolchain", which QuickBook integrates. Basically, the
toolchain can be represented by
QuickBook --> BoostBook --> DocBook
Doxygen --^
DocBook --> distinct processors and/or stylesheets are used to
produce the final output, e.g., PDF, HTML, PS, Latex...
Here, I consider that BoostBook and DocBook are the primary targets for
QuickBook and these should be our first concerns, although we should
also keep an eye for the "secondary" targets which are the final
user-readable docs.
Looking at DocBook's docs (http://docbook.org/tdg/en/html/docbook.html),
we find that an ID must be unique within the document and must begin
with a letter. Furthermore, it is not stated, but can be implied from
IDREFS' definition that IDs shouldn't contain whitespace characters.
Because DocBook is ultimately an XML format, I also looked into XML
constraints on id's (xml:id, http://www.w3.org/TR/xml-id/, which links
into 'Namespaces in XML' both versions 1.0 and 1.1, and these, in turn,
reference the corresponding XML specifications.). Here id's are taken to
be XML names excluding the ':' character. Quoting from the XML 1.1 standard,
« The first character of a Name MUST be a NameStartChar, and any
other characters MUST be NameChars; this mechanism is used to
prevent names from beginning with European (ASCII) digits or with
basic combining characters. Almost all characters are permitted in
names, except those which either are or reasonably could be used as
delimiters. The intention is to be inclusive rather than exclusive,
so that writing systems not yet encoded in Unicode can be used in
XML names. »
And also,
« Document authors are encouraged to use names which are
meaningful words or combinations of words in natural languages, and
to avoid symbolic or white space characters in names. Note that
COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and
MIDDLE DOT are explicitly permitted. »
Further below, the XML spec explicitly defines the ranges of characters
that can be used in names, but I won't bother you with those.
In an ideal world, I think QuickBook should aim for the same inclusive
ranges as XML and let DocBook and output-specific processors deal with
the oddities of individual formats. However, we must also consider what
I called the secondary targets, because those have the highest impact on
users. Specifically, I looked at PDF and HTML, believing those are the
most used.
From what I gather of the PDF reference (version 1.3), the equivalent
to an id would be a 'Name Object'. In practical terms, PDF Name objects
are binary in nature, allowing any 8-bit character to be in a name
(through the use of escape codes, for some characters), thus imposing no
restriction other than unique names in a document.
As for HTML (http://www.w3.org/TR/html4), it turns out to be the most
restrictive of all. In html, the id and name attributes share the same
namespace and can be used (almost) interchangeably as IDs.
« ID and NAME tokens must begin with a letter ([A-Za-z]) and may
be followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods ("."). »
Furthermore, «names that differ only in case may not appear in the same
document».
So what do we make of all this?
Well, without further investigating DocBook's processing of IDs when
generating the different output formats I think the current conservative
approach that QuickBook follows is appropriate in general. Normalizing
the case of identifiers also seems sensible because of html's specificities.
However, I'd like to propose the following specific changes in the
handling of identifiers,
- identifiers should be verified to begin with a letter, possibly
allowing an underscore as well, although that goes against html
rules.
- the hyphen should be allowed inside identifiers, since it seems to
be generally allowed.
- QuickBook should keep track of the identifiers it generates to
avoid reusing identifiers when it sanitizes input. This may be
particularly important for languages that use characters outside
the ASCII character set (como o Português ;-) where overlapping
IDs could appear too easily.
Even though these changes would fix Andy's issue, I think it still is
important to consider the general case: it is cumbersome to refer to
sanitized references. Maybe QuickBook could provide the means to
generate the same sanitized reference on the spot. For instance, the mark-up
[link [A long winded section title]]
could be used to generate,
<link linkend="a_long_winded_section_title">A long winded section
title</link>
For nested sections, perhaps
[link [Section 2][Heading 1] Heading 1 of Section 2]
?
Thoughts?
João
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Boost-docs mailing list
[email protected]
Unsubscribe and other administrative requests:
https://lists.sourceforge.net/lists/listinfo/boost-docs