Re: [Python-Dev] New Python Initialization API

Steve Dower Tue, 09 Apr 2019 13:43:08 -0700

On 05Apr2019 0912, Victor Stinner wrote:

About PyPreConfig and encodings.
[...]

* ``PyInitError Py_PreInitialize(const PyPreConfig *config)``
* ``PyInitError Py_PreInitializeFromArgs( const PyPreConfig *config,

int argc, char **argv)``

* ``PyInitError Py_PreInitializeFromWideArgs( const PyPreConfig

*config, int argc, wchar_t **argv)``


I hope to one day be able to support multiple runtimes per process - can
we have an opaque PyRuntime object exposed publicly now and passed into
these functions?


I hesitated to include a "_PyRuntimeState*" parameter somewhere, but I
chose to not do so.

Currently, there is a single global variable _PyRuntime which has the type
_PyRuntimeState. The _PyRuntime_Initialize() API is designed around this
global variable. For example, _PyRuntimeState contains the registry of
interpreters: you don't want to have multiple registries :-)

I understood that we should only have a single instance of
_PyRuntimeState. So IMHO it's fine to keep it private at this point.
There is no need to expose it in the API.

So I didn't want to expose that particular object right now, but justsome sort of "void*" parameter in the new APIs (and require either NULLor a known value be passed). That gives us the freedom to enablemultiple runtimes in the future without having to change the API shape.

FYI I tried to design an internal API with a "context" to pass
_PyRuntimeState, PyPreConfig, _PyConfig, the current interpreter, etc.
[...]
There are 2 possible implementations:

* Modify *all* functions to add a new "context" parameter and modify *all*
   functions to pass this parameter to sub-functions.
* Store the current "context" as a thread local variable or something like
   that.
[...]
For the second option: well, there is no API change needed!
It can be done later.
Moreover, we already have such API! PyThreadState_Get() gets the Python
thread state of the current thread: the current interpreter can be
accessed from there.

Yes, this is what I had in mind as a transition. I think eventually itwould be best to have the context parameter, as thread-local variableshave overhead and add significant complexity (particularly whendebugging crashes), but making that change is huge.

``PyPreConfig`` fields:

* ``coerce_c_locale_warn``: if non-zero, emit a warning if the C locale
   is coerced.
* ``coerce_c_locale``: if equals to 2, coerce the C locale; if equals to
   1, read the LC_CTYPE to decide if it should be coerced.


Can we use another value for coerce_c_locale to determine whether to
warn or not? Save a field.


coerce_c_locale is already complex, it can have 4 values: -1, 0, 1 and 2.
I prefer keep a separated field.

Moreover, I understood that you might want to coerce the C locale *and*
get the warning, or get the warning but *not* coerce the locale.

If we define meaningful constants, then it doesn't matter how manyvalues it has. We could have PY_COERCE_LOCALE_AND_WARN,PY_COERCE_LOCALE_SILENTLY, PY_WARN_WITHOUT_COERCE etc. to represent thestates. These actually make things simpler than trying to reason abouthow two similar parameters interact.

* ``legacy_windows_fs_encoding`` (Windows only): if non-zero, set the
   Python filesystem encoding to ``"mbcs"``.
* ``utf8_mode``: if non-zero, enable the UTF-8 mode


Why not just set the encodings here?


For different technical reasons, you simply cannot specify an encoding
name. You can also pass options to tell Python that you have some
preferences (PyPreConfig and PyConfig fields).

Python doesn't support any encoding and encoding errors combinations. In
practice, it only supports a narrow set of choices. The main implementation are
Py_EncodeLocale() and Py_DecodeLocale() functions which uses the C codec
of the current locale encoding to implement the filesystem encoding,
before the codec implemented in Python can be used.

Basically, only the current locale encoding or UTF-8 are supported.
If you want UTF-8, enable the UTF-8 Mode.

If we already had a trivial way to specify the default encodings as astring before any initialization has occurred, I think we would havemade UTF-8 mode enabled by setting them to "utf-8" rather than a brandnew flag.

Again, we either have a huge set of flags to infer certain values atcertain times, or we can just make them directly settable. If we makethem settable, it's much easier for users to reason about what is goingto happen.

To load the Python codec, you need importlib. importlib needs to access
the filesystem which requires a codec to encode/decode file names
(PyConfig.module_search_paths uses Unicode wchar_t* strings, but the C API
only supports bytes char* strings).

Right, and the few places where we need an encoding *before* we can loadany arbitrary ones we can easily compare the strings and fail ifsomeone's trying to do something unusual (or if the platform can do thelookup itself, it could succeed). If we say "passing NULL means use thedefault" then we have that handled, and the actual encoding just getsset to the real default once we figure out what that is.

Py_PreInitialize() doesn't set the filesystem encoding. It initializes the
LC_CTYPE locale and Python global configuration variables (Py_UTF8Mode and
Py_LegacyWindowsFSEncodingFlag).

Right, I'm proposing a simplification here where it *does* set thefilesystem encoding (even though it doesn't get used untilPy_Initialize() is called). That way we can use the filesystem encodingto access the filesystem during initialization, provided it's one of thebuilt-in supported ones (e.g. NULL, which means the C locale, or "utf-8"which means UTF-8) rather than relying on the tables in the standardlibrary.


Oh look, I said all this in my original email:

Obviously we are not ready to import most encodings after pre
initialization, but I think that's okay. Embedders who set something
outside the range of what can be used without importing encodings will
get an error to that effect if we try.


You need a C implementation of the Python filesystem encoding very early
in Python initialization. You cannot start with one encoding and "later"
switch the encoding. I tried multiple times the last 10 years and I always
failed to do that. All attempts failed with mojibake at different
levels.

Again, this is for embedders. Regular Python users will only everrequest "NULL" or "utf-8", depending on the UTF-8 mode flag. Andembedders have to make sure they get what they ask for and also can'tchange it later.

The problems you've hit in the past have always been to do with tryingto infer or guess the actual encoding, rather than simply lettingsomeone tell you what it is (via config) and letting them deal with thefailure.

In fact, I'd be totally okay with letting embedders specify their own
function pointer here to do encoding/decoding between Unicode and the OS
preferred encoding.


In my experience, when someone wants to get a specific encoding: they
only want UTF-8. There is now the UTF-8 Mode which ignores the locale
and forces the usage of UTF-8.

Your experience here sounds like it's limited to POSIX systems. I'vewanted UTF-16 before, and been able to provide it (if Python had allowedme to provide a callback to encode/decode).

And again, all this is about "why do we need to define a boolean thatdetermines what the encoding is when we can just let people tell us whatencoding they want". There's a good chance that an embedded Python isn'tgoing to touch the real filesystem anyway.

I'm not sure that there is a need to have a custom codec. Moreover, if
there an API to pass a codec in C, you will need to expose it somehow
at the Python level for os.fsencode() and os.fsdecode().

We need to expose those operations anyway, and os.fsencode/fsdecode havetheir own issues (particularly since there *are* ways to changefilesystem encoding while running). Turning them into actual nativefunctions that might call out to a host-provided callback would not bedifficult.

Currently, Python ensures during early stage of startup that
codecs.lookup(sys.getfilesystemencoding()) works: there is a existing
Python codec for the requested filesystem encoding.

Right, it's a validation step. But we can also makecodecs.lookup("whatever the file system encoding is") return somethingbased on os.fsencode() and os.fsdecode(). We're not actually beholden tothe current implementations here - we are allowed to change them! ;)



_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Python Initialization API

Reply via email to