Re: [rust-dev] A plea for removing context-free ambiguity / context-required parsing

Nathan Thu, 16 Aug 2012 18:11:11 -0700

I caught up a little on the state of this topic in IRC, and now I see
it's an old topic.  I don't want to rub old callouses, but I do hope
my use cases are motivating.



On Thu, Aug 16, 2012 at 2:31 PM, Niko Matsakis <[email protected]> wrote:
> We have discussed this point for some time, though I don't know if anyone
> raised the security review angle specifically before.
>
> I think there is a good case to be made that lint warnings adequately
> address this problem.
>

I almost feel like a lint warning which can be disabled is worse.
Instead of a "soft" convention, we have a "checked" convention, so
violations of the convention are even more surprising.

One of my proposed use cases was a reviewer on a desert island with a
paper print out.  That was a bit of shorthand, but that's intended to
cover book publications, diff hunks, github source listings,
presentation slides, mailing list patch posts, etc...

My argument was specifically about an auditor who does not have access
to an automated tool.  In a non-ambiguous grammar case, this is no
real disadvantage  at least for knowing how to interpret a snippet.
But if the solution is you must run an automated static tool, you lose
the benefit of analysis in all those peripheral use cases.

I'd argue that a default-on lint check for a convention makes
violations of that convention even more surprising.  Newbies learning
the language have the violation negatively reinforced as "not
probable", but it's much easier to have one big "not possible" mental
stamp instead of a long list of fine grained, rarely used "not
probable except when rare thing X happens" facts.

In the world with the default-on lint check, I suspect it'd be easier
to submit a patch that contains a backdoor-in-plain sight, because
people's plain sight has been trained with blinders.


> You can set the mode to cause an error if
> case-distinctions are not observed in a pattern binding. Then a reviewer can
> be confident that, as long as the code compiles, nothing surprising is going
> on.
>

A lint check that's on by default is a good 80% solution so long as we
consider the maxim "with enough eyes, all bugs are shallow" really
means "automated tools" for "eyes".

That's practical, because any sane patch review process will compile
and run some tests before merging.  I still say only 80%, because of
all the "human without tools" scenarios, such as everyone on the
mailing list who sees the patch but doesn't apply it and run lint
themselves.  It's hard to say how valuable those other cases are.
They might capture a long tail of valuable contributions to open
source or maybe they have little impact.

The backdoor case is compelling to me, but I'm probably overly
paranoid.  I hope I am, at least.  ;-)

I'm happy with an 80% solution that the community easily accepts
versus a 100% solution that is rejected.


>
>
> Niko
>
>


Nathan


> On 8/16/12 1:26 PM, Nathan wrote:
>>
>> Hello,
>>
>> I'm brand new to rust.  I've read the tutorial once and not the
>> manual.  My first attempt at running "make" failed with a compiler
>> error (I'll email or irc separately).  I apologize if I'm re-covering
>> ground or missing something, but I felt this general point was
>> important to make as early as possible in a language's design.  I hope
>> I'm not too late.
>>
>> In another thread about naming conventions, it sounds as if there is a
>> grammatical ambiguity which is resolved by scoping rules at compile
>> time:
>>
>> match myfoo {
>>    bar => /* ... stuff ... */
>> }
>>
>> What does the left hand side of the match rule represent?  IIUC, it is
>> impossible to tell in rust without understanding what "bar" signifies
>> in the surrounding scope.  If there is an enum discriminator named
>> "bar" and myfoo is of that type, then it means "match if myfoo is a
>> bar value".  On the other hand if this is not the case, it means
>> "create a new binding called bar and bind it to the value in
>> question".
>>
>> Is this true?
>>
>> If so, I suggest this is a serious problem which naming conventions
>> will not solve.  If it is not the case, please ignore this email and
>> tell me to rtfm.  ;-)
>>
>> So see why it is a problem, consider two use cases: a person learning
>> the language, and a person auditing code for bugs.  (For bonus
>> material, see the post-script.)
>>
>>
>> A person learning the language may learn only one semantic
>> interpretation first, or may learn both but forget about one.  They
>> write code depending on the one semantic interpretation they are
>> familiar with.  It works.  Then one day, there's a strange compiler
>> error.  Hopefully it's a very clear compiler error.  That could save
>> some time.  Either way they probably need to revisit two different
>> parts of language reference documentation to get a full understanding
>> of the issue.  Even if the "official documentation" has a single point
>> that contrasts these two semantic possibilities, any 3rd party books,
>> blogs, tutorials, etc... will reinforce the misunderstanding.
>>
>>
>> Case two: A code auditor is looking for bugs, possibly subtle bugs or
>> security flaws.  They don't have a compiler.  They're looking at a
>> printout in an underground layer with no electronics allowed, or
>> equivalently it's an interview question.
>>
>> Now, they see a pattern match rule where the left hand side is "bar".
>> Even though the know the language perfectly, they cannot know the
>> semantics here without understanding the scope of enum discriminators.
>>   Does this require looking at more than one file?  If so, the problem
>> complexity branches out indefinitely.  They must do this for *every*
>> such matching rule, even though many rules may be simple binding
>> patterns rather than enum discriminators.
>>
>> See my last C example to illustrate the compound problems of grammar
>> ambiguity *and* importing definitions from.  If the imported
>> definitions are not explicitly named in relation to where they are
>> imported from, then an auditor must now read *every* file imported,
>> and they must do this recursively.  (Let's hope they have ide
>> support.)  If the imported names are explicitly associated with which
>> source they are imported from, the auditor must recurse but at least
>> it's linear instead of exponential.
>>
>>
>> The solution I'm proposing is to alter the grammar so that it's
>> possible by looking at only the pattern matching text, without knowing
>> any other context whether it is a discriminator match or a new
>> binding.  There are at least two ways to do this:
>>
>> One is to ensure that it's always possible when looking at an
>> identifier in *any* context whether or not it is a discriminator or a
>> binding/reference.  Haskell does this elegantly, IMO, by forcing
>> discriminators to start with upper case and bindings/references to
>> start with lower case.  Any other rule that prevents the identifiers
>> from overlapping is sufficient.  I prefer this approach because it
>> solves the ambiguity problem for *every* grammar production which
>> involves either a reference/binding *or* an enum discriminator.
>>
>> Another is to change the specific match syntax so you say something like:
>>
>> match myfoo {
>>    discriminator bar => /* yes, this is a klunky new keyword, so I
>> don't recommend this in practice, but it makes the point. */
>>    bar => /* bare identifiers are always bindings. */
>> }
>>
>> -or-
>>
>> match myfoo {
>>    'bar => /* This is just the same as the last, except we use a sigil
>> instead of a keyword.  It's compact.  This could be considered an
>> identifier disambiguation approach if all discriminator identifiers
>> always begin with ' or some other sigil. */
>>    bar => /* bare identifiers are always bindings. */
>> }
>>
>> -or-
>>
>> match myfoo {
>>    MyEnum.bar => /* Always require the type for discriminators, at
>> least in this context.  Klunky if other contexts do not require the
>> type.  Klunky since the type of myfoo is already specified. */
>>    bar => /* bare identifiers are always bindings. */
>> }
>>
>> -or-
>>
>> match myfoo {
>>    bar => /* bare identifiers are always discriminators. */
>>    let bar => /* bindings always use let (because it is similar to a
>> let binding). kind of klunky and maybe confusing placement of the
>> keyword.  Plus nested patterns get klunky: */
>>    [bar, let bar] => /* match a list/sequence/array thingy with a bar
>> discriminator value and any other value which is bound to bar.
>> Contrived but shows the grammar distinction in compound matches. */
>> }
>>
>>
>>
>> Anyway, please understand that those proposed syntaxes are just
>> "ballpark" since I don't understand the grammar well, nor the
>> style/community/taste.  The main point is that grammars which are
>> ambiguous without compile/run-time context are fraught with peril.
>>
>>
>> Regards,
>> Nathan Wilcox
>>
>>
>> PS:
>>
>> Maybe a simpler way to state my desire is: Make it so that it's very
>> hard to compete in an "underhanded backdoor" competition for rust and
>> very easy to audit code for bugs.
>>
>> See for example this competition where entries look like correct C
>> code to tally votes, but they surreptitiously skew the results in the
>> favor of the author:
>> http://graphics.stanford.edu/~danielrh/vote/scores.html
>>
>> When I am emperor, all language designers will be forced to audit all
>> entries for all "underhanded backdoor" competitions for all other
>> languages before they are allowed to design their language.  ;-)  (You
>> may surmise that I was a security auditor in the past...)
>>
>> One of my favorites is here:
>> http://graphics.stanford.edu/~danielrh/vote/mzalewski.c
>>
>> That entry is a case where examining a bit of text does not tell you
>> its semantics because it may either be a variable reference *or* a
>> macro instance and the only way to know is to have a mental model of
>> the macros and variables in scope.  If instead, all macro expansions
>> required a $ prefix or whatever, there would be no ambiguity and the
>> bug would be much easier to track down.
>>
>>
>> PPS:
>>
>> Some other simple ambiguities in languages I kind of know off the top
>> of my head which have lead to real world bugs I wrote or had to fix:
>>
>> javascript:
>> x = 5; // ambiguity: Either reassign a declared binding in a
>> containing scope, or create a new global binding.
>>
>> erlang:
>> x = 5; // ambiguity: Either create a new binding called "x" referring
>> to 5 *or* try to match the existing binding "x" to the value 5.
>>
>> C:
>> #include "define_foo.h"
>>
>> static int bar = 42;
>>
>> int main(int argc, char** argv) {
>>    foo(bar); // Ambiguous, even without macros, depending on the
>> contents of define_foo.h
>>    bar = 7; // Possibly invalid, depending on the contents of define_foo.h
>>    return bar;
>> }
>>
>> Here are at least two possible contents for define_foo.h:
>> int foo(int bar);
>>
>> -or-
>>
>> typedef char foo;
>> _______________________________________________
>> Rust-dev mailing list
>> [email protected]
>> https://mail.mozilla.org/listinfo/rust-dev
>
>
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] A plea for removing context-free ambiguity / context-required parsing

Reply via email to