[Python-ideas] Improve handling of Unicode quotes and hyphens

Steve Barnes Sun, 10 May 2020 00:10:08 -0700

Hi All,

Apologies if this has already been discussed to death.


Python 3 allows Unicode characters in strings and identifiers but the actual 
quotation marks are only accepted in plain ASCII, i.e. the following all 
successfully initialise strings:

```
S1 = "Double Quoted" # Opened and closed with chr(34)0x22
S2 = 'Single Quoted' # Opened and closed with chr(39)0x27
```
But the following all result in an error - "SyntaxError: invalid character in 
identifier":

```
S1 = "Double Quoted" # Opened with \u201c and closed with \u201d
S2 = 'Single Quoted' # Opened with \u2018 and closed with \u2019
```
To the experienced eye, and depending on the character font used, it is 
"obvious" what the problem is! The wrong quotation marks were used. The big 
problem, especially for beginners, is that the same keys were typed, just in 
the "wrong" editor or even the wrong editor mode or context I have found that 
in Outlook if the font is FixSys or I am replying to a plain text email it is 
fine but otherwise it is "helpful" - unfortunately, especially on Windows, 
"wrong" editors abound and include, but are not limited to, MS-Outlook, 
MS-Word, some online editing environments such as Quora.

On top of that is the helpful substitution of a m-hyphen for minus when you 
press space a word later so:

A = 3 - 2 # With a space syntax error due to \u2013
A = 3 - 2 # No Space or CR after I last typed it is OK as 0x2d

Use cases that catch people out:

  1.  Sending a code snipped by email using Outlook
  2.  User manuals written in MS-Word - (many peoples work environment)
  3.  Articles published on Quora - people expect to be able to copy and paste 
the code for some reason.

I am sure that many us have encountered these issues or similar.

What can be done?

  1.  Persuade Microsoft, and others, to stop being so helpful by default - 
good luck with that!
  2.  Tell all users that they need to use a "proper" editor or IDE - This 
seems like adding an additional barrier to new & casual users.
  3.  Better yet tell them to use a "proper" OS like .... - At the very least 
many of us have to use Windows at work.
  4.  Start accepting hyphens as minus & Unicode quotation marks - this would 
be the ideal answer for pasted code but has a lot of possible things to iron 
out such as do we require that the quotes match and are in the typographically 
correct order. It is also quite a big & complex change to the python 
interpreter.
  5.  Normalise the input to the python interpreter (at least for these 
characters and possibly a few others) so that entering or reading from a file 
S1 = "Double Quoted" becomes S1 = "Double Quoted", etc. - this should be a 
easier change to the interpreter but, from a purist point of view, could be 
said to make us as bad as the others because we are not honouring what the user 
entered.
  6.  Change the error message "SyntaxError: invalid character in identifier" 
to include which character and it's Unicode value so that it becomes  
"SyntaxError: invalid character 0x201c "  in identifier" - this is almost 
certainly the easiest change and fits well with explicit is better than 
implicit but still leaves it to the user to correct the erroneous input (which 
could be argued is both good and bad).

I would like to suggest that an incremental approach might be the best - 
clarifying the existing error message being the thing that should not break 
anything and either substituting for problem characters or processing them 
"properly" as a later enhancement.

Steve Barnes

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ILMNJ46EAL4ENYK7LLDLGIMYQKZAMMWU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Improve handling of Unicode quotes and hyphens

Reply via email to