[Boost-docs] Re: Any reason that hyphens are not allowed in anchors

João Abecasis Wed, 25 Jan 2006 06:49:59 -0800

Joel de Guzman wrote:

'''
normalized names with name="section_id.normalized_header_text"
(i.e. valid characters are a-z, A-Z, 0-9 and _. All non-valid
characters are converted to underscore and all upper-case are
converted to lower-case. For example: Heading 1 in section
Section 2 will be normalized to section_2.heading_1)
'''


This might be too conservative (only C identifiers are allowed),
and the reason is simplification. I did not want to bother with
having to find the "universal" URL, link, etc, that would work
for all. I figured C identifiers are acceptable almost everywhere.
It's quite easy to change the behavior if it can be shown to be
safe for all output formats including HTML, PDF etc. The relevant
code is just:

    char
    filter_identifier_char(char ch)
    {
        if (!std::isalnum(static_cast<unsigned char>(ch)))
            ch = '_';
        return static_cast<char>(
            std::tolower(static_cast<unsigned char>(ch)));
    }

Comments?

To analyze this issue, we should start by considering the "Boostdocumentation toolchain", which QuickBook integrates. Basically, thetoolchain can be represented by


    QuickBook --> BoostBook --> DocBook
            Doxygen --^

    DocBook --> distinct processors and/or stylesheets are used to
                produce the final output, e.g., PDF, HTML, PS, Latex...

Here, I consider that BoostBook and DocBook are the primary targets forQuickBook and these should be our first concerns, although we shouldalso keep an eye for the "secondary" targets which are the finaluser-readable docs.

Looking at DocBook's docs (http://docbook.org/tdg/en/html/docbook.html),we find that an ID must be unique within the document and must beginwith a letter. Furthermore, it is not stated, but can be implied fromIDREFS' definition that IDs shouldn't contain whitespace characters.

Because DocBook is ultimately an XML format, I also looked into XMLconstraints on id's (xml:id, http://www.w3.org/TR/xml-id/, which linksinto 'Namespaces in XML' both versions 1.0 and 1.1, and these, in turn,reference the corresponding XML specifications.). Here id's are taken tobe XML names excluding the ':' character. Quoting from the XML 1.1 standard,


      « The first character of a Name MUST be a NameStartChar, and any
    other characters MUST be NameChars; this mechanism is used to
    prevent names from beginning with European (ASCII) digits or with
    basic combining characters. Almost all characters are permitted in
    names, except those which either are or reasonably could be used as
    delimiters. The intention is to be inclusive rather than exclusive,
    so that writing systems not yet encoded in Unicode can be used in
    XML names. »

And also,

      « Document authors are encouraged to use names which are
    meaningful words or combinations of words in natural languages, and
    to avoid symbolic or white space characters in names. Note that
    COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and
    MIDDLE DOT are explicitly permitted. »

Further below, the XML spec explicitly defines the ranges of charactersthat can be used in names, but I won't bother you with those.

In an ideal world, I think QuickBook should aim for the same inclusiveranges as XML and let DocBook and output-specific processors deal withthe oddities of individual formats. However, we must also consider whatI called the secondary targets, because those have the highest impact onusers. Specifically, I looked at PDF and HTML, believing those are themost used.

From what I gather of the PDF reference (version 1.3), the equivalentto an id would be a 'Name Object'. In practical terms, PDF Name objectsare binary in nature, allowing any 8-bit character to be in a name(through the use of escape codes, for some characters), thus imposing norestriction other than unique names in a document.

As for HTML (http://www.w3.org/TR/html4), it turns out to be the mostrestrictive of all. In html, the id and name attributes share the samenamespace and can be used (almost) interchangeably as IDs.


      « ID and NAME tokens must begin with a letter ([A-Za-z]) and may
    be followed by any number of letters, digits ([0-9]), hyphens ("-"),
    underscores ("_"), colons (":"), and periods ("."). »

Furthermore, «names that differ only in case may not appear in the samedocument».


So what do we make of all this?

Well, without further investigating DocBook's processing of IDs whengenerating the different output formats I think the current conservativeapproach that QuickBook follows is appropriate in general. Normalizingthe case of identifiers also seems sensible because of html's specificities.

However, I'd like to propose the following specific changes in thehandling of identifiers,


    - identifiers should be verified to begin with a letter, possibly
      allowing an underscore as well, although that goes against html
      rules.
    - the hyphen should be allowed inside identifiers, since it seems to
      be generally allowed.
    - QuickBook should keep track of the identifiers it generates to
      avoid reusing identifiers when it sanitizes input. This may be
      particularly important for languages that use characters outside
      the ASCII character set (como o Português ;-) where overlapping
      IDs could appear too easily.

Even though these changes would fix Andy's issue, I think it still isimportant to consider the general case: it is cumbersome to refer tosanitized references. Maybe QuickBook could provide the means togenerate the same sanitized reference on the spot. For instance, the mark-up


    [link [A long winded section title]]

could be used to generate,

    <link linkend="a_long_winded_section_title">A long winded section
        title</link>

For nested sections, perhaps

    [link [Section 2][Heading 1] Heading 1 of Section 2]

?

Thoughts?


João


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Boost-docs mailing list
[email protected]
Unsubscribe and other administrative requests: 
https://lists.sourceforge.net/lists/listinfo/boost-docs

[Boost-docs] Re: Any reason that hyphens are not allowed in anchors

Reply via email to