Re: [netmod] [Gen-art] Gen-ART IETF Last Call review of draft-ietf-netmod-rfc6020bis-12 (part 2)

Dale R. Worley Mon, 06 Jun 2016 15:29:13 -0700

(This is the second part of my response.)

> > > > - section 6.1
> > > > 
> > > >    This section details the rules for recognizing tokens from an input
> > > >    stream.
> > > > 
> > > > Generally, language definitions intersperse the narrative text with
> > > > the relevant grammar definitions.  Yang's statement grammar is simple
> > > > enough that one doesn't need to see the context-free part of the
> > > > grammar to understand the narrative for statements.  But when reading
> > > > about tokenization, not having the grammar presented at the same time
> > > > is quite a burden.  I'd recommend duplicating the relevant productions
> > > > from section 14 into the subsections of section 6.
> > > > 
> > > > There is some sort of exposition problem.  The result of
> > > > "tokenization" is that the sequence of characters of the source is
> > > > converted into a sequence of "tokens".  Then some subset of the tokens
> > > > is discarded as being non-significant (e.g., whitespace and comments),
> > > > and the remainder is parsed with a context-free grammar.  Here I can't
> > > > figure out what the set of tokens is.  Looking at the grammar in
> > > > section 14, it seems to be a context-free grammar on characters.  But
> > > > that implies that there is no separate tokenization phase.
> > > > 
> > > > An example that shows the problems:
> > > > 
> > > >    mod:ext
> > > > 
> > > > Is this one token, which is also an extension keyword, or is it a
> > > > sequence of three tokens?
> > > 
> > > The text says:
> > > 
> > >   A token in YANG is either a keyword, a string, a semicolon (";"), or
> > >   braces ("{" or "}").
> > > 
> > > and:
> > > 
> > >   A keyword is [...] or a prefix identifier, followed by a colon
> > >   (":"), followed by a language extension keyword.
> > > 
> > > So "mod:ext" is one token.
> > 
> > Certainly it can be one token.  My question is how do verify that it is
> > not a string?  I think that may be the origin of my confusion here is
> > that I haven't spotted a clear syntax for unquoted string.  In most
> > programming languages, mod:ext would be parsed as an identifier, a
> > colon, and an identifier.  In YANG, identifiers are usually tokenized as
> > strings, so I ask whether YANG tokenizes it as a string, a colon, and a
> > string.
> > 
> > Looking at the beginning of 6.1.3, it doesn't appear that an unquoted
> > string is forbidden from containing a colon.
> > 
> > I think that the underlying problem is that I'm not clear on what gets
> > tokenized as an unquoted string.
> 
> Note that this is legal YANG:
> 
>    leaf type {
>      type string;
>    }


So keywords aren't reserved; they can also be used as identifiers.

> I think there are two ways to look at this.  Either we describe the
> tokenizer as being context-dependent, or we describe the "argument" in
> a "statement" to be a "string or keyword".
> 
> In the latter case maybe we can do:
> 
> OLD:
> 
>   If a string contains any space, tab, or newline characters, a single
>   or double quote character, a semicolon (";"), braces ("{" or "}"),
>   or comment sequences ("//", "/*", or "*/"), then it MUST be enclosed
>   within double or single quotes.
> 
> NEW:
> 
>   An unquoted string is any sequence of characters that does not start
>   with a double or single quote character, is not a keyword, and does
>   not contain any space, tab, or newline characters, a single or
>   double quote character, a semicolon (";"), braces ("{" or "}"), or
>   comment sequences ("//", "/*", or "*/").

That's a lot clearer.  Though you can shorten it to:

   An unquoted string is any sequence of characters that is not a
   keyword, and does not contain any space, tab, or newline
   characters, a single or double quote character, a semicolon (";"),
   braces ("{" or "}"), or comment sequences ("//", "/*", or "*/").

> In section 6.3 we must also do:
> 
> OLD:
> 
>    The argument is a string, as defined in Section 6.1.2.
> 
> NEW:
> 
>    The argument is a string or a keyword, as defined in Section 6.1.2.

If I understand correctly, the tokens of Yang (as the term is usually
used in programming languages) are:

    whitespace (which is ignored)
    comments (which is ignored)
    single-quoted strings
    double-quoted strings
    unquoted strings (including keywords)
    ;
    {
    }

>From the point of view of the tokenizer, these tokens fall into the
obvious classes:

        type    unquoted string
        "type"  double-quoted string
        abc     unquoted-string
        "abc"   double-quoted string
        '---'   single-quoted string

I'm not quite sure how they are classified from the parser's point of
view, though.

                                type    "type"  abc     "abc"   '---'

Is a string?                    ?       Y       ?       Y       Y
(Can it appear as the
argument of "description"?)

Is a keyword?                   Y       ?       N       N       N
(Can it appear as the first
token of some statement?)

Is an identifier?               Y       ?       Y       ?       N
(Can it appear as the second
token of a type statement?)

Usually programming languages use the particular syntax of different
types of tokens to determine where they can be used in the
context-free grammar.  Yang seems to be more relaxed, but I'm not sure
whether it is so relaxed thay any of the types of string tokens can be
used anywhere.

> > > > -- it must be an unquoted string.
> > > > 
> > > >    If a double-quoted string contains a line break followed by space or
> > > >    tab characters that are used to indent the text according to the
> > > >    layout in the YANG file, this leading whitespace is stripped from the
> > > >    string, up to and including the column of the double quote character,
> > > >    or to the first non-whitespace character, whichever occurs first.  In
> > > >    this process, a tab character is treated as 8 space characters.
> > > > 
> > > > This description isn't quite careful enough.  Better:
> > > > 
> > > >    If a double-quoted string contains a line break followed by space or
> > > >    tab characters, an initial part of this whitespace is removed from 
> > > > the
> > > >    string.  The amount removed is the longest prefix whose width is no
> > > >    larger than the width of the prefix of Yang source line containing
> > > >    the opening double quote character of the string to and including the
> > > >    opening double quote character.  For this purpose, the width of a
> > > >    tab character is 8 and the width of any other character is 1.
> > > > 
> > > > This does assume that all tabs are considered to have width 8, that
> > > > is, tabs do not have the usual semantics of "advance to the next
> > > > column that is divisible by 8".  That will sometimes cause unexpected
> > > > results, e.g., if some source lines start with SPC TAB.  (Consider
> > > > that whitespace before a line break is removed, which suggests the
> > > > intention is that the value of the string should depend only on its
> > > > visual appearance.)
> > > > 
> > > > Also, we're using the convention that "whitespace" does NOT include CR
> > > > or LF, which is not always how the term is used.  Perhaps a definition
> > > > of "whitespace" should be put in section 3.
> > > > 
> > > > There is also the special case:
> > > > 
> > > >    SPC " LF
> > > >    TAB x "
> > > > 
> > > > Is the initial TAB of the second line to be removed or not?  There is
> > > > no whitespace removal in the second line that will exactly reach the
> > > > opening double quote.  As I've written it, the TAB is not removed.
> > 
> > Don't forget this ugly special case.
> 
> So, let's follow the rules.  We need to trim to the column of the
> double quote character (2).  The second line starts with "space or
> tab" so we do whitespace trimming, while treating the tab as 8
> spaces.  So from 8 spaces we subtract 2, and get the resulting string
> of 6 characters:
> 
>   LF SPC SPC SPC SPC SPC SPC x

OK, but that process wasn't clear to me.  I take it that any tab that
appears before the starting double-quote counts as 8 spaces, and any
tab that needs to be examined for deletion is turned into 8 spaces --
but any other tabs in the string are unconverted.

I think it would be clearer to insert "starting" where I've indicated
it, and replace the final sentence:

   If a double-quoted string contains a line break followed by space or
   tab characters that are used to indent the text according to the
   layout in the YANG file, this leading whitespace is stripped from the
   string, up to and including the column of the >starting< double quote 
character,
   or to the first non-whitespace character, whichever occurs first.
   In this process, any tab character before the starting double quote
   character is treated as 8 spaces.  Any tab character in a succeeding
   line that must be examined to for stripping is first converted into 8
   spaces.

> > Actually, there is a somewhat subtle problem:  If I say "the system can
> > sort them any way it wants", I am asserting that *there is a sorting
> > order*.  Which means that if value A is put before value B at one time,
> > then if values A and B are in the list at some other time, A will
> > precede B.
> 
> The next sentences says:
> 
>   An implementation SHOULD use the same order for the same data,
>   regardless of how the data were created.  Using a deterministic
>   order will make comparisons possible using simple tools like "diff".

OK, I'm willing to go with that.  I mis-read the application of those
sentences through an even more arcane ambiguity in the term "the same
data".  But I'm willing to ignore that.

> > > > - section 7.21.4
> > > > 
> > > >    The "reference" statement takes as an argument a string ...
> > > > 
> > > > Perhaps s/a string/a human-readable string/.
> > > 
> > > "string" refers to the YANG token "string".  The same wording is used
> > > across the document for all arguments.
> > 
> > I was thinking that it is a string, but in this particular case, it is
> > supposed to be human-readable, whereas strings in other contexts aren't
> > expected to be.
> 
> Ok.  Maybe:
> 
> OLD:
> 
>   The "reference" statement takes as an argument a string that is used
>   to specify a textual cross-reference to an external document,
> 
> NEW:
> 
>   The "reference" statement takes as an argument a string that is used
>   to specify a human-readable cross-reference to an external document,

Or even "is a human-readable cross-reference ...", but either is OK
with me.

> > > > - section 7.21.5
> > > > 
> > > > Note that if a data definition has both an "if-feature" and a "when",
> > > > then the "if-feature" is tested first.
> > > > 
> > > >    If the XPath expression references any node that also has associated
> > > >    "when" statements, these "when" expressions MUST be evaluated first.
> > > >    There MUST NOT be any circular dependencies in these "when"
> > > >    expressions.
> > > > 
> > > > I think this could be better phrased:
> > > > 
> > > >    If the XPath expression references any node that also has
> > > >    associated "when" statements, then the "when" expressions of the
> > > >    referenced nodes MUST be evaluated first.  There MUST NOT be any
> > > >    circular dependencies among "when" expressions.
> > > 
> > > Ok to the last sentence.  Do you think that the word "these" in the
> > > first sentence is ambigious?
> > 
> > I must have thought it was unclear when I read it, otherwise I would not
> > have suggested changing it.  But reading it again, I think that there is
> > no ambiguity.  Perhaps it would be a little clearer to use 'those "when"
> > expressions' rather than 'these "when" expressions'.  (I can't explain
> > clearly why "those" seems less ambiguous than "these".)
> 
> Ok, as a non-native english speaker I trust you that "those" is better.

I can't tell that you're non-native.  Perhaps leave it as is and let
the RFC Editor review it.

> > By implication, the leafref's value is considered to be a pointer to a
> > particular leaf instance, the one with the matching value.  But that
> > idea is not embedded in the Yang semantics of leafref types in any way
> > (other than the output of the deref function), so the fact that there
> > might be more than one matching leaf instance does not matter.
> > 
> > As stated in 9.9.4 and 9.9.5, the lexical representations of its values
> > are the same as those of the referenced nodes.
> > 
> > How is the leafref's value compared to the values of the referenced
> > nodes?  I can see that question getting ugly for the more complex types
> > (e.g., anyxml)
> 
> You can't have a leafref to an anyxml node; just to a leaf or
> leaf-list.
> 
> > which do not have canonical forms.  I suspect the
> > intention is that values are equal if they have the same canonical form
> 
> No, the idea is that they are equal if their *value* is equal,
> regardless of the lexical representation.

There are two types that don't have a canonical form, identityref and
instance-identifier.  It seems that comparisons in XPath expressions
are inexact if the type doesn't have a canonical form (section 6.4).
But if I understand you correctly, the implicit comparisons in leafref
are done based on the abstract values involved, not the lexical
representation.

> > The current ABNF doesn't allow for "+" for joining quoted strings.
> > Also, it doesn't show that \" can be included in a double quoted string
> > to include a literal ", and allows the string contents to continue --
> > the current ABNF "DQUOTE string DQUOTE" matches "abcd\", despite that
> > the latter is not a proper double-quoted string.
> 
> Note that the prose text (within <...>) says "a string that
> matches...".  That string can be any YANG token string, for example
> one of:
> 
>    "hello"
>    "he" + "llo"

If I haven't gotten confused, you're referring to

   string              = < an unquoted string as returned by >
                         < the scanner, that matches the rule >
                         < yang-string >

   yang-string        = *yang-char

   ;; any Unicode or ISO/IEC 10646 character including tab, carriage
   ;; return, and line feed, but excluding the other C0 control
   ;; characters, the surrogate blocks, and the noncharacters.
   yang-char = %x09 / %x0A / %x0D / %x20-D7FF /
                               ; exclude surrogate blocks %xD800-DFFF
              %xE000-FDCF /    ; exclude noncharacters %xFDD0-FDEF
              %xFDF0-FFFD /    ; exclude noncharacters %xFFFE-FFFF
              %x10000-1FFFD /  ; exclude noncharacters %x1FFFE-1FFFF
              %x20000-2FFFD /  ; exclude noncharacters %x2FFFE-2FFFF
              %x30000-3FFFD /  ; exclude noncharacters %x3FFFE-3FFFF
              %x40000-4FFFD /  ; exclude noncharacters %x4FFFE-4FFFF
              %x50000-5FFFD /  ; exclude noncharacters %x5FFFE-5FFFF
              %x60000-6FFFD /  ; exclude noncharacters %x6FFFE-6FFFF
              %x70000-7FFFD /  ; exclude noncharacters %x7FFFE-7FFFF
              %x80000-8FFFD /  ; exclude noncharacters %x8FFFE-8FFFF
              %x90000-9FFFD /  ; exclude noncharacters %x9FFFE-9FFFF
              %xA0000-AFFFD /  ; exclude noncharacters %xAFFFE-AFFFF
              %xB0000-BFFFD /  ; exclude noncharacters %xBFFFE-BFFFF
              %xC0000-CFFFD /  ; exclude noncharacters %xCFFFE-CFFFF
              %xD0000-DFFFD /  ; exclude noncharacters %xDFFFE-DFFFF
              %xE0000-EFFFD /  ; exclude noncharacters %xEFFFE-EFFFF
              %xF0000-FFFFD /  ; exclude noncharacters %xFFFFE-FFFFF
              %x100000-10FFFD  ; exclude noncharacters %x10FFFE-10FFFF

But if that's taken at face value, you can lex as single "string"s not only

    "hello"

and

    "he" + "llo"

but also

    "The MTU of the interface."; myext:c-define "MY_MTU"

Doing that would allow the incorrect lexing of

         leaf mtu {
           type uint32;
           description "The MTU of the interface."; myext:c-define "MY_MTU";
         }

as having a long description (starting with 'The MTU' and ending with
'MY_MTU') and no myext:c-define statement.

What we need is a production that matches "strings possibly combined
with +" and nothing else.  That is, including '"he" + "llo"' but not the
last example.

Dale

_______________________________________________
netmod mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/netmod

Re: [netmod] [Gen-art] Gen-ART IETF Last Call review of draft-ietf-netmod-rfc6020bis-12 (part 2)

Reply via email to