Joel de Guzman wrote:
'''
normalized names with name="section_id.normalized_header_text"
(i.e. valid characters are a-z, A-Z, 0-9 and _. All non-valid
characters are converted to underscore and all upper-case are
converted to lower-case. For example: Heading 1 in section
Section 2 will be normalized to section_2.heading_1)
'''

This might be too conservative (only C identifiers are allowed),
and the reason is simplification. I did not want to bother with
having to find the "universal" URL, link, etc, that would work
for all. I figured C identifiers are acceptable almost everywhere.
It's quite easy to change the behavior if it can be shown to be
safe for all output formats including HTML, PDF etc. The relevant
code is just:

    char
    filter_identifier_char(char ch)
    {
        if (!std::isalnum(static_cast<unsigned char>(ch)))
            ch = '_';
        return static_cast<char>(
            std::tolower(static_cast<unsigned char>(ch)));
    }

Comments?

To analyze this issue, we should start by considering the "Boost documentation toolchain", which QuickBook integrates. Basically, the toolchain can be represented by

    QuickBook --> BoostBook --> DocBook
            Doxygen --^

    DocBook --> distinct processors and/or stylesheets are used to
                produce the final output, e.g., PDF, HTML, PS, Latex...

Here, I consider that BoostBook and DocBook are the primary targets for QuickBook and these should be our first concerns, although we should also keep an eye for the "secondary" targets which are the final user-readable docs.

Looking at DocBook's docs (http://docbook.org/tdg/en/html/docbook.html), we find that an ID must be unique within the document and must begin with a letter. Furthermore, it is not stated, but can be implied from IDREFS' definition that IDs shouldn't contain whitespace characters.

Because DocBook is ultimately an XML format, I also looked into XML constraints on id's (xml:id, http://www.w3.org/TR/xml-id/, which links into 'Namespaces in XML' both versions 1.0 and 1.1, and these, in turn, reference the corresponding XML specifications.). Here id's are taken to be XML names excluding the ':' character. Quoting from the XML 1.1 standard,

      « The first character of a Name MUST be a NameStartChar, and any
    other characters MUST be NameChars; this mechanism is used to
    prevent names from beginning with European (ASCII) digits or with
    basic combining characters. Almost all characters are permitted in
    names, except those which either are or reasonably could be used as
    delimiters. The intention is to be inclusive rather than exclusive,
    so that writing systems not yet encoded in Unicode can be used in
    XML names. »

And also,

      « Document authors are encouraged to use names which are
    meaningful words or combinations of words in natural languages, and
    to avoid symbolic or white space characters in names. Note that
    COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and
    MIDDLE DOT are explicitly permitted. »

Further below, the XML spec explicitly defines the ranges of characters that can be used in names, but I won't bother you with those.

In an ideal world, I think QuickBook should aim for the same inclusive ranges as XML and let DocBook and output-specific processors deal with the oddities of individual formats. However, we must also consider what I called the secondary targets, because those have the highest impact on users. Specifically, I looked at PDF and HTML, believing those are the most used.

From what I gather of the PDF reference (version 1.3), the equivalent to an id would be a 'Name Object'. In practical terms, PDF Name objects are binary in nature, allowing any 8-bit character to be in a name (through the use of escape codes, for some characters), thus imposing no restriction other than unique names in a document.

As for HTML (http://www.w3.org/TR/html4), it turns out to be the most restrictive of all. In html, the id and name attributes share the same namespace and can be used (almost) interchangeably as IDs.

      « ID and NAME tokens must begin with a letter ([A-Za-z]) and may
    be followed by any number of letters, digits ([0-9]), hyphens ("-"),
    underscores ("_"), colons (":"), and periods ("."). »

Furthermore, «names that differ only in case may not appear in the same document».

So what do we make of all this?

Well, without further investigating DocBook's processing of IDs when generating the different output formats I think the current conservative approach that QuickBook follows is appropriate in general. Normalizing the case of identifiers also seems sensible because of html's specificities.

However, I'd like to propose the following specific changes in the handling of identifiers,

    - identifiers should be verified to begin with a letter, possibly
      allowing an underscore as well, although that goes against html
      rules.
    - the hyphen should be allowed inside identifiers, since it seems to
      be generally allowed.
    - QuickBook should keep track of the identifiers it generates to
      avoid reusing identifiers when it sanitizes input. This may be
      particularly important for languages that use characters outside
      the ASCII character set (como o Português ;-) where overlapping
      IDs could appear too easily.

Even though these changes would fix Andy's issue, I think it still is important to consider the general case: it is cumbersome to refer to sanitized references. Maybe QuickBook could provide the means to generate the same sanitized reference on the spot. For instance, the mark-up

    [link [A long winded section title]]

could be used to generate,

    <link linkend="a_long_winded_section_title">A long winded section
        title</link>

For nested sections, perhaps

    [link [Section 2][Heading 1] Heading 1 of Section 2]

?

Thoughts?


João


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Boost-docs mailing list
[email protected]
Unsubscribe and other administrative requests: 
https://lists.sourceforge.net/lists/listinfo/boost-docs

Reply via email to