[issue12486] tokenize module should have a unicode API

Martin Panter Sun, 04 Oct 2015 20:28:16 -0700

Martin Panter added the comment:

I agree it would be very useful to be able to tokenize arbitrary text without 
worrying about encoding tokens. I left some suggestions for the documentation 
changes. Also some test cases for it would be good.


However I wonder if a separate function would be better for the text mode 
tokenization. It would make it clearer when an ENCODING token is expected and 
when it isn’t, and would avoid any confusion about what happens when readline() 
returns a byte string one time and a text string another time. Also, having 
untokenize() changes its output type depending on the ENCODING token seems like 
bad design to me.

Why not just bless the existing generate_tokens() function as a public API, 
perhaps renaming it to something clearer like tokenize_text() or 
tokenize_text_lines() at the same time?

----------
nosy: +martin.panter
stage:  -> patch review

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12486>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12486] tokenize module should have a unicode API

Reply via email to