[Python-Dev] Re: A better way to freeze modules

Paul Moore Fri, 03 Sep 2021 00:50:04 -0700

My quick reaction was somewhat different - it would be a great idea, but
it’s entirely possible to implement this outside the stdlib as a 3rd party
module. So the fact that no-one has yet done so means there’s less general
interest than the OP is suggesting.


And from my experience, the reason for that is that zipimport is almost
always sufficient. That’s what tools like pyinstaller use, for example.

Paul

On Fri, 3 Sep 2021 at 06:25, Guido van Rossum <gu...@python.org> wrote:

> Quick reaction: This feels like a bait and switch to me. Also, there are
> many advantages to using a standard format like zip (many formats are
> really zip with some conventions). Finally, the bytecode format you are
> using is “marshal”, and is fully portable — as is zip.
>
> On Thu, Sep 2, 2021 at 21:44 Gregory Szorc <gregory.sz...@gmail.com>
> wrote:
>
>> Over in https://bugs.python.org/issue45020 there is some exciting work
>> around expanding the use of the frozen importer to speed up Python
>> interpreter startup. I wholeheartedly support the effort and don't want to
>> discourage progress in this area.
>>
>> Simultaneously, I've been down this path before with PyOxidizer and feel
>> like I have some insight to share.
>>
>> I don't think I'll be offending anyone by saying the existing CPython
>> frozen importer is quite primitive in terms of functionality: it does the
>> minimum it needs to do to support importing module bytecode embedded in the
>> interpreter binary [for purposes of bootstrapping the Python-based
>> importlib modules]. The C struct representing frozen modules is literally
>> just the module name and a pointer to a sized buffer containing bytecode.
>>
>> In issue45020 there is talk of enhancing the functionality of the frozen
>> importer to support its potential broader use. For example, setting
>> __file__ or exposing .__loader__.get_source(). I support the overall
>> initiative.
>>
>> However, introducing enhanced functionality of the frozen importer will
>> at the C level require either:
>>
>> a) backwards incompatible changes to the C API to support additional
>> metadata on frozen modules (or at the very least a supplementary API that
>> fragments what a "frozen" module is).
>> b) CPython only hacks to support additional functionality for "freezing"
>> the standard library for purposes of speeding up startup.
>>
>> I'm not a CPython core developer, but neither "a" nor "b" seem ideal to
>> me. "a" is backwards incompatible. "b" seems like a stop-gap solution until
>> a more generic version is available outside the CPython standard library.
>>
>> From my experience with PyOxidizer and software in general, here is what
>> I think is going to happen:
>>
>> 1. CPython enhances the frozen importer to be usable in more situations.
>> 2. Python programmers realize this solution has performance and
>> ease-of-distribution wins and want to use it more.
>> 3. Limitations in the frozen importer are found. Bugs are reported.
>> Feature requests are made.
>> 4. The frozen importer keeps getting incrementally extended or Python
>> developers grow frustrated that its enhancements are only available to the
>> standard library. You end up slowly reimplementing the importing mechanism
>> in C (remember Python 2?) or disappoint users.
>>
>> Rather than extending the frozen importer, I would suggest considering an
>> alternative solution that is far more useful to the long-term success of
>> Python: I would consider building a fully-featured, generic importer that
>> is capable of importing modules and resource data from a well-defined and
>> portable serialization format / data structure that isn't defined by C
>> structs and APIs.
>>
>> Instead of defining module bytecode (and possible additional minimal
>> metadata) in C structs in a frozen modules array (or an equivalent C API),
>> what if we instead defined a serialization format for representing the
>> contents of loadable Python data (module source, module bytecode, resource
>> files, extension module library data, etc)? We could then point the Python
>> interpreter at instances of this data structure (in memory or in files) so
>> it could import/load the resources within using a meta path importer.
>>
>> What if this serialization format were designed so that it was extremely
>> efficient to parse and imports could be serviced with the same trivially
>> minimal overhead that the frozen importer currently has? We could embed
>> these data structures in produced binaries and achieve the same desirable
>> results we'll be getting in issue45020 all while delivering a more generic
>> solution.
>>
>> What if this serialization format were portable across machines? The
>> entire Python ecosystem could leverage it as a container format for
>> distributing Python resources. Rather than splatting dozens or hundreds of
>> files on the filesystem, you could write a single file with all of a
>> package's resources. Bugs around filesystem implementation details such as
>> case (in)sensitivity and Unicode normalization go away. Package installs
>> are quicker. Run-time performance is better due to faster imports.
>>
>> (OK, maybe that last point brings back bad memories of eggs and you
>> instinctively reject the idea. Or you have concerns about development
>> ergonomics when module source code isn't in standalone editable files.
>> These are fair points!)
>>
>> What if the Python interpreter gains an "app mode" where it is capable of
>> being paired with a single "resources file" and running the application
>> within? Think running zip applications today, but a bit faster, more
>> tailored to Python, and more fully featured.
>>
>> What if an efficient binary serialization format could be leveraged as a
>> cache to speed up subsequent interpreter startups?
>>
>> These were all considerations on my mind in the early days of PyOxidizer
>> when I realized that the frozen importer and zip importers were lacking the
>> features I desired and I would need to find an alternative solution.
>>
>> One thing led to another and I have incrementally developed the "Python
>> packed resources" data format (
>> https://pyoxidizer.readthedocs.io/en/stable/pyoxidizer_packed_resources.html).
>> This is a binary format for representing Python source code, bytecode,
>> resource files, extension modules, even shared libraries that extension
>> modules rely on!
>>
>> Coupled with this format is the oxidized_importer meta path finder (
>> https://pypi.org/project/oxidized-importer/ and
>> https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer.html)
>> capable of servicing imports and resource loading from these "Python packed
>> resources" data structures.
>>
>> From a super high level, PyOxidizer assembles an instance of "Python
>> packed resources" containing the CPython standard library and any
>> additional Python packages you point it at and produces an executable with
>> a main() that starts a Python interpreter, configures
>> oxidized_importer.OxidizedFinder to read from the configured packed
>> resources data structure (which may be embedded in the binary or loaded
>> from a mmap()d file), and invokes some Python code inside to run your
>> application.
>>
>> oxidized_importer has an API for reading and writing "Python packed
>> resources" data structures. You can even use it to build your own
>> PyOxidizer-like utilities (
>> https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html
>> ).
>>
>> I bring this work up because I believe that if you set yourself on a path
>> to build a performant and fully featured importer/finder, you will
>> inevitably build something with properties very similar to what I have
>> built. To be uncompromising on performance, you'll want to roll your own
>> data format that is in tune with Python's specific needs and avoids I/O and
>> overhead when possible. To fully support the long-tail of features in
>> Python's importing mechanism, you need the ability to richly - and
>> efficiently - express metadata like whether a module is a package. It is
>> possible to shoehorn this [meta]data into formats like tar and zip. But it
>> won't be as efficient as rolling your own data structure. And when it comes
>> to interpreter startup overhead, performance does matter.
>>
>> Am I suggesting CPython use oxidized_importer? No. It is implemented in
>> Rust and CPython can't take a Rust dependency.
>>
>> Am I suggesting CPython support the "Python packed resources" data format
>> as-is? No. The exact format today isn't suitable for CPython: I didn't
>> design it with consideration for use beyond PyOxidizer's use case and there
>> are still a ton of missing features.
>>
>> What I am suggesting is that Python developers think about the idea of
>> standardizing a Python-centric container format for holding "Python
>> resources" and a built-in/stdlib meta path finder for using it. Think of
>> this as "frozen/zip importer 2.0" but with a more strongly defined and
>> portable data format that is detached from C struct definitions. This could
>> potentially solve a lot of problems around startup/import performance. And
>> if you wanted to extend it to packaging/distribution, I think it could
>> solve a lot of problems there too. (If you designed the format properly, I
>> think it would be possible to converge with the use case of wheels.) (But I
>> understand the skepticism about making the leap to packaging: that is an
>> absurdly complex problem space!)
>>
>> If this idea sounds radical to you, I get the skepticism. I didn't want
>> to incur this work/complexity when writing PyOxidizer either. But a long
>> series of investigations and ruling out alternatives lead me down this
>> path. With the benefit of hindsight I believe the type of solution is sound
>> and it is inevitable Python gains something like this in the standard
>> library or at least sees something like this in wide use in the wild. I say
>> that because multiple Python app distribution tools have reinvented
>> solutions to the general problem of "package multiple modules/resources in
>> a single, efficient-to-load file/binary" in different ways because the
>> solutions in the standard library (frozen and zip importers) or package
>> distribution (wheels or eggs) just aren't sufficient because they each lack
>> critical features. oxidized_importer _might_ be the most robust of these
>> solutions to also be available as a standalone package on PyPI.
>>
>> I would encourage you to play around with oxidized_importer outside the
>> context of PyOxidizer. I think you'll be pleasantly surprised by its
>> performance and ability to emulate most of the common parts of the
>> importlib APIs. The API for working with "Python packed resources" data
>> structures isn't great. But only because I haven't spent much effort in
>> making it so.
>>
>> I believe there's a path to adding a meta path importer to the stdlib
>> that - like oxidized_importer - reads resource data from a well-defined
>> data structure while retaining the performance of the frozen importer with
>> the full feature set of PathFinder. I would suggest this as a better longer
>> term solution than trying to incrementally evolve the frozen or zip
>> importers to fit this use case. You could probably implement most of it in
>> Python and freeze the bytecode into the interpreter like we do with
>> PathFinder, leaving only the performance-sensitive parser to be implemented
>> in C.
>>
>> All that being said, what I advocate for is obviously a lot of scope
>> bloat versus doing some quick work to enable use of the frozen importer on
>> a few dozen stdlib modules to speed up interpreter startup as is being
>> discussed in issue45020. The practical engineer in me supports doing the
>> quick and dirty solution now for the quick win. But I do encourage thinking
>> bigger towards longer-term solutions, especially if you find yourself
>> tempted to incrementally add features to frozen importer. I believe there
>> is a market need for a stdlib meta path importer that reads a highly
>> optimized and portable format similar to the solutions I've devised for
>> PyOxidizer. Let me know how I can help incorporate one in the standard
>> library.
>>
>> Gregory
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/XRJTN37WYVIPLFGXFGHAJZ6FSQC4NETD/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> --
> --Guido (mobile)
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/3Y2A5ZNYINH7IKJAT76ERMAKXXNKMILB/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZLMBY3BVVN2VRBN57U64D2J6FVZR4XHC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: A better way to freeze modules

Reply via email to