[Python-ideas] Re: Custom string prefixes

Andrew Barnert via Python-ideas Wed, 28 Aug 2019 11:16:34 -0700

On Aug 28, 2019, at 00:40, Chris Angelico <ros...@gmail.com> wrote:
> 
> On Wed, Aug 28, 2019 at 2:40 PM Andrew Barnert <abarn...@yahoo.com> wrote:
>>> People can be trusted with powerful features that can introduce
>>> complexity. There's just not a lot of point introducing a low-value
>>> feature that adds a lot of complexity.
>> 
>> But it really doesn’t add a lot of complexity.
>> 
>> If you’re not convinced that really-raw string processing is doable, drop 
>> that.
>> 
>> Since the OP hasn’t given a detailed version of his grammar, just take mine: 
>> a literal token immediately followed by one or more identifier characters 
>> (that couldn’t have been munched by the literal) is a user-suffix literal. 
>> This is compiled into code that looks up the suffix in a central registry 
>> and calls it with the token’s text. That’s all there is to it.
>> 
> 
> What is a "literal token", what is an "identifier character",


Literals and identifier characters are already defined today, so I don’t need 
new definitions for them. 

The existing tokens are already implemented in the tokenizer and in the 
tokenize module, which is why I was able to slap together multiple variations 
on a proof of concept 4 years ago in a few minutes as a token-stream-processing 
import hook. 

My import hook version is a hack, of course, but it serves as a counter to your 
argument that there’s no simple thing that could work by being a dead simple 
thing that does work. And there’s no reason to believe a real version wouldn’t 
be at least as simple.

> and how
> does this apply to your example of having digits, a decimal point, and
> then a suffix


We add a`suffixedfloatnumber` production defined as `floatnumber identifier`. 
So, the `2.34` parses as a `floatnumber` the same as always. That `d` can’t be 
part of a `floatnumber`, but it can be the start of an `idenfifier`, and those 
two nodes together can make up a `suffixedfloatnumber`. No need for any new 
lookahead or other context. And for the concrete implementation in CPython, it 
should be obvious that the suffix can be pushed down into the tokenizer, at 
which point the parse becomes trivial.

If you’re asking how my hacky version works, you could just read the code, 
which is simpler than an explanation, but here goes (from memory, because I’m 
on my phone): To the existing tokenizer, `d` isn’t a delimiter character, so it 
tries to match the whole `2.34d`. That doesn’t match anything. But `2.34` does 
match something, etc., so ultimately it emits two tokens, `floatnumber('2.34'), 
error('d')`. My import hook reads the stream of tokens. When it sees a 
`floatnumber` followed by an `error`, it checks whether the error body could be 
an identifier token. If so, it replaces those two tokens in the steam with… I 
forget, but probably I just hand-parsed the lookup and call and emit the tokens 
for that.

I can’t _guarantee_ that the real version would be simpler until I try it. And 
I don’t want to hijack the OP’s thread and replace his proposal (which does 
give me what I want) with mine (which doesn’t give him what he wants), unless 
he abandons the idea of attempting to implement his version. But I’m pretty 
confident it would be as simple as it sounds, which is even simpler than the 
hacky version (which, again, is dead simple and works today).

And most variations on the idea you could design would be just as simple. Maybe 
the OP will perversely design one that isn’t. If so, it’s his job to show that 
it can be implemented. And if he gives up, then I’ll argue for something that I 
can implement simply. But I don’t think that’s even going to come up.

> What if you want to have a string, and what if you want
> to have that string contain backslashes or quotes? If you want to say
> that this doesn't add complexity, give us some SIMPLE rules that
> explain this.

Well, that works exactly the same way a string does today (including the 
optional r prefix). The closing quote can now be followed by a string of 
identifier characters, but everything up to there is exactly the same as today. 
So, it doesn’t add any complexity, because it uses the same rules as today.

I did suggest, as a throwaway addon to the OP’s proposal, that you could 
instead do raw strings or even really-raw (the string ends at the first 
matching quote; backslashes mean nothing). I don’t know if he wants either of 
those, but if he does, raw string literals are already defined in the grammar 
and implemented in the tokenizer, and really-raw is an even simpler grammar 
(identical to the existing grammar except that instead of `longstringchar | 
stringescapeseq` there’s a `<any source character except the quote>` node, and 
the same for `shortstringitem`).

> And make absolutely sure that the rules are identical for EVERY
> possible custom prefix/suffix,

Well, in my version, since the rule for suffixedstringliteral is just 
`stringliteral identifier`, of course it’s the same for every possible suffix; 
there’s no conceivable way it could be different.

If the OP wants to propose something more complicated that provides some way of 
selecting different rules, he could, but I don’t think he has, and if he 
doesn’t, then the issue will be equally nonexistent. I don’t know whether he 
wants to interact with the existing string prefixes (or, if so, how that 
works), or always do normal strings, or always do really-raw strings, or what, 
but there are multiple plausible designs, most of which are not impossible or 
even complicated, so the fact that you can imagine that there might be a design 
that would be impossible really isn’t relevant.

Just to show how easy it is to come up with something (but which, again, may 
not be what the OP actually wants here): a stringliteral is now a stringprefix 
followed by shortstring or longstring (as today) or an identifier followed by 
rrshortstring or rrlongstring. The rr tokens are defined as I described above: 
they end at the first matching quote, no backslashing.

This option would have some limitations— people can’t use \” to escape quotes 
in prefixed strings, there’s no way to get prefixed bytes, you probably can’t 
call a prefix “bub”… does that make some of the OP’s desired use cases or some 
of the 2013 use cases no longer viable? I don’t know. If so, the OP presumably 
won’t use this option and will use a different one. Any option will have some 
limitations, and I don’t know which one he wants, but there are a huge number 
of simple, and nonmagical, options that he could pick.

>> Compare that adding Decimal (and Fraction, as you said last time) literals 
>> when the types aren’t even builtin. That’s more complexity, for less 
>> benefit. So why is it better?
> 
> Actually no, it's a lot less complexity, because it's all baked into
> the language.

Making the language definition and the interpreter and the compiler more 
complicated doesn’t eliminate the complexity, it just moves it somewhere else.

> You don't have to have the affix registry to figure out
> how to parse a script into AST.

You don’t need the registry to parse to an AST for my proposal either; it’s 
only used at runtime.

And, while the OP didn’t give us a grammar, he did give us proposed bytecode 
output of (one version of) his idea, and it’s pretty obvious that the registry 
isn’t getting involved until the interpreter eval loop processes the new 
registry-lookup opcode, so it clearly isn’t involved in parsing.

And why would it get involved in parsing? It’s not like someone is proposing 
Rust or Dylan macros here.

> The definition of a "literal" is given
> by the tokenizer, and for instance, "-1+2j" is not a literal. How is
> this going to impact your registry?

Not at all. Why would it?

> The distinction doesn't matter to
> Decimal or Fraction, because you can perform operations on them at
> compile time and retain the results, so "-1.23d" would syntactically
> be unary negation on the literal Decimal("1.23"), and -4/5f would be
> unary negation on the integer 4 and division between that and
> Fraction(5). But does that work with your proposed registry?

Yes, of course it does. That should be obvious from the fact that I said that 
`1/2F` would end up equivalent to `1/Fraction(2)`.

Concretely, it ends up as something like `1/sys.__user_suffixes__['F']('2')` 
Except probably with nicer error handling, so you don’t get a KeyError for an 
unknown suffix. Notice that the only thing looked up in the registry is the 
function to process the text, and this doesn’t need to happen until runtime, 
long after the code has been not just parsed, but compiled.

Of course the OP’s version will be a little different. He wants to handle both 
prefixes and suffixes by looking up the prefix and passing the suffix as an 
second argument. And I’m not sure what exactly he wants as the main argument. 
But I still don’t see any reason it would need to look in the registry at 
tokenizer or parse or compile time. And again, his proposed bytecode 
translation implies that it doesn’t do so. So why imagine that it has to when 
there’s no visible reason for it?

> What is a
> "literal token", and would it need to include these kinds of things?

How could this not be obvious? I deliberately chose the phrase “literal token”, 
and you clearly understand what this means because you invoked that meaning 
just one paragraph above. I also provided a link to a hacky implementation that 
blatantly relies on the tokenizer’s current processing of literals. And I gave 
examples that make it clear that `2` is a literal token and `1/2` is not. So 
why do you even need to ask whether `-4/5` is one? How could `-4/5` possibly be 
a literal token it '1/2` is not, or if it isn’t a token at all?

> What if some registered types need to include them and some don't?

They can’t.

The simple rule works for every numeric example everyone has come up with so 
far, even Steven’s facetious quaternion example that he proposed as too 
ridiculous for anyone to actually want.

Is it a flaw that there may or may not be some examples that nobody has been 
able to think of that might work with a much more complicated feature but won’t 
work with this feature? Of course not. That’s true for every feature ever. 
There’s no reason to ignore the obvious simple design and try to imagine more 
complicated designs that may or may not solve additional problems that nobody’s 
even imagined just so you can dismiss the idea as too complicated.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/IO2IKMYE6B3FI327IMLJIACK5HCB4DNM/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Custom string prefixes

Reply via email to