On 05Apr2019 0912, Victor Stinner wrote:
About PyPreConfig and encodings.
[...]
* ``PyInitError Py_PreInitialize(const PyPreConfig *config)``
* ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config,
int argc, char **argv)``
* ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig
*config, int argc, wchar_t **argv)``

I hope to one day be able to support multiple runtimes per process - can
we have an opaque PyRuntime object exposed publicly now and passed into
these functions?

I hesitated to include a "_PyRuntimeState*" parameter somewhere, but I
chose to not do so.

Currently, there is a single global variable _PyRuntime which has the type
_PyRuntimeState. The _PyRuntime_Initialize() API is designed around this
global variable. For example, _PyRuntimeState contains the registry of
interpreters: you don't want to have multiple registries :-)

I understood that we should only have a single instance of
_PyRuntimeState. So IMHO it's fine to keep it private at this point.
There is no need to expose it in the API.

So I didn't want to expose that particular object right now, but just some sort of "void*" parameter in the new APIs (and require either NULL or a known value be passed). That gives us the freedom to enable multiple runtimes in the future without having to change the API shape.

FYI I tried to design an internal API with a "context" to pass
_PyRuntimeState, PyPreConfig, _PyConfig, the current interpreter, etc.
[...]
There are 2 possible implementations:

* Modify *all* functions to add a new "context" parameter and modify *all*
   functions to pass this parameter to sub-functions.
* Store the current "context" as a thread local variable or something like
   that.
[...]
For the second option: well, there is no API change needed!
It can be done later.
Moreover, we already have such API! PyThreadState_Get() gets the Python
thread state of the current thread: the current interpreter can be
accessed from there.

Yes, this is what I had in mind as a transition. I think eventually it would be best to have the context parameter, as thread-local variables have overhead and add significant complexity (particularly when debugging crashes), but making that change is huge.

``PyPreConfig`` fields:

* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale
   is coerced.
* ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to
   1, read the LC_CTYPE to decide if it should be coerced.

Can we use another value for coerce_c_locale to determine whether to
warn or not? Save a field.

coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2.
I prefer keep a separated field.

Moreover, I understood that you might want to coerce the C locale *and*
get the warning, or get the warning but *not* coerce the locale.

If we define meaningful constants, then it doesn't matter how many values it has. We could have PY_COERCE_LOCALE_AND_WARN, PY_COERCE_LOCALE_SILENTLY, PY_WARN_WITHOUT_COERCE etc. to represent the states. These actually make things simpler than trying to reason about how two similar parameters interact.

* ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the
   Python filesystem encoding to ``"mbcs"``.
* ``utf8_mode``: if non-zero, enable the UTF-8 mode

Why not just set the encodings here?

For different technical reasons, you simply cannot specify an encoding
name. You can also pass options to tell Python that you have some
preferences (PyPreConfig and PyConfig fields).

Python doesn't support any encoding and encoding errors combinations. In
practice, it only supports a narrow set of choices. The main implementation are
Py_EncodeLocale() and Py_DecodeLocale() functions which uses the C codec
of the current locale encoding to implement the filesystem encoding,
before the codec implemented in Python can be used.

Basically, only the current locale encoding or UTF-8 are supported.
If you want UTF-8, enable the UTF-8 Mode.

If we already had a trivial way to specify the default encodings as a string before any initialization has occurred, I think we would have made UTF-8 mode enabled by setting them to "utf-8" rather than a brand new flag.

Again, we either have a huge set of flags to infer certain values at certain times, or we can just make them directly settable. If we make them settable, it's much easier for users to reason about what is going to happen.

To load the Python codec, you need importlib. importlib needs to access
the filesystem which requires a codec to encode/decode file names
(PyConfig.module_search_paths uses Unicode wchar_t* strings, but the C API
only supports bytes char* strings).

Right, and the few places where we need an encoding *before* we can load any arbitrary ones we can easily compare the strings and fail if someone's trying to do something unusual (or if the platform can do the lookup itself, it could succeed). If we say "passing NULL means use the default" then we have that handled, and the actual encoding just gets set to the real default once we figure out what that is.

Py_PreInitialize() doesn't set the filesystem encoding. It initializes the
LC_CTYPE locale and Python global configuration variables (Py_UTF8Mode and
Py_LegacyWindowsFSEncodingFlag).

Right, I'm proposing a simplification here where it *does* set the filesystem encoding (even though it doesn't get used until Py_Initialize() is called). That way we can use the filesystem encoding to access the filesystem during initialization, provided it's one of the built-in supported ones (e.g. NULL, which means the C locale, or "utf-8" which means UTF-8) rather than relying on the tables in the standard library.

Oh look, I said all this in my original email:

Obviously we are not ready to import most encodings after pre
initialization, but I think that's okay. Embedders who set something
outside the range of what can be used without importing encodings will
get an error to that effect if we try.

You need a C implementation of the Python filesystem encoding very early
in Python initialization. You cannot start with one encoding and "later"
switch the encoding. I tried multiple times the last 10 years and I always
failed to do that. All attempts failed with mojibake at different
levels.

Again, this is for embedders. Regular Python users will only ever request "NULL" or "utf-8", depending on the UTF-8 mode flag. And embedders have to make sure they get what they ask for and also can't change it later.

The problems you've hit in the past have always been to do with trying to infer or guess the actual encoding, rather than simply letting someone tell you what it is (via config) and letting them deal with the failure.

In fact, I'd be totally okay with letting embedders specify their own
function pointer here to do encoding/decoding between Unicode and the OS
preferred encoding.

In my experience, when someone wants to get a specific encoding: they
only want UTF-8. There is now the UTF-8 Mode which ignores the locale
and forces the usage of UTF-8.

Your experience here sounds like it's limited to POSIX systems. I've wanted UTF-16 before, and been able to provide it (if Python had allowed me to provide a callback to encode/decode).

And again, all this is about "why do we need to define a boolean that determines what the encoding is when we can just let people tell us what encoding they want". There's a good chance that an embedded Python isn't going to touch the real filesystem anyway.

I'm not sure that there is a need to have a custom codec. Moreover, if
there an API to pass a codec in C, you will need to expose it somehow
at the Python level for os.fsencode() and os.fsdecode().

We need to expose those operations anyway, and os.fsencode/fsdecode have their own issues (particularly since there *are* ways to change filesystem encoding while running). Turning them into actual native functions that might call out to a host-provided callback would not be difficult.

Currently, Python ensures during early stage of startup that
codecs.lookup(sys.getfilesystemencoding()) works: there is a existing
Python codec for the requested filesystem encoding.

Right, it's a validation step. But we can also make codecs.lookup("whatever the file system encoding is") return something based on os.fsencode() and os.fsdecode(). We're not actually beholden to the current implementations here - we are allowed to change them! ;)


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to