[Python-Dev] Re: A better way to freeze modules

Guido van Rossum Thu, 02 Sep 2021 22:25:40 -0700

Quick reaction: This feels like a bait and switch to me. Also, there are
many advantages to using a standard format like zip (many formats are
really zip with some conventions). Finally, the bytecode format you are
using is “marshal”, and is fully portable — as is zip.


On Thu, Sep 2, 2021 at 21:44 Gregory Szorc <gregory.sz...@gmail.com> wrote:

> Over in https://bugs.python.org/issue45020 there is some exciting work
> around expanding the use of the frozen importer to speed up Python
> interpreter startup. I wholeheartedly support the effort and don't want to
> discourage progress in this area.
>
> Simultaneously, I've been down this path before with PyOxidizer and feel
> like I have some insight to share.
>
> I don't think I'll be offending anyone by saying the existing CPython
> frozen importer is quite primitive in terms of functionality: it does the
> minimum it needs to do to support importing module bytecode embedded in the
> interpreter binary [for purposes of bootstrapping the Python-based
> importlib modules]. The C struct representing frozen modules is literally
> just the module name and a pointer to a sized buffer containing bytecode.
>
> In issue45020 there is talk of enhancing the functionality of the frozen
> importer to support its potential broader use. For example, setting
> __file__ or exposing .__loader__.get_source(). I support the overall
> initiative.
>
> However, introducing enhanced functionality of the frozen importer will at
> the C level require either:
>
> a) backwards incompatible changes to the C API to support additional
> metadata on frozen modules (or at the very least a supplementary API that
> fragments what a "frozen" module is).
> b) CPython only hacks to support additional functionality for "freezing"
> the standard library for purposes of speeding up startup.
>
> I'm not a CPython core developer, but neither "a" nor "b" seem ideal to
> me. "a" is backwards incompatible. "b" seems like a stop-gap solution until
> a more generic version is available outside the CPython standard library.
>
> From my experience with PyOxidizer and software in general, here is what I
> think is going to happen:
>
> 1. CPython enhances the frozen importer to be usable in more situations.
> 2. Python programmers realize this solution has performance and
> ease-of-distribution wins and want to use it more.
> 3. Limitations in the frozen importer are found. Bugs are reported.
> Feature requests are made.
> 4. The frozen importer keeps getting incrementally extended or Python
> developers grow frustrated that its enhancements are only available to the
> standard library. You end up slowly reimplementing the importing mechanism
> in C (remember Python 2?) or disappoint users.
>
> Rather than extending the frozen importer, I would suggest considering an
> alternative solution that is far more useful to the long-term success of
> Python: I would consider building a fully-featured, generic importer that
> is capable of importing modules and resource data from a well-defined and
> portable serialization format / data structure that isn't defined by C
> structs and APIs.
>
> Instead of defining module bytecode (and possible additional minimal
> metadata) in C structs in a frozen modules array (or an equivalent C API),
> what if we instead defined a serialization format for representing the
> contents of loadable Python data (module source, module bytecode, resource
> files, extension module library data, etc)? We could then point the Python
> interpreter at instances of this data structure (in memory or in files) so
> it could import/load the resources within using a meta path importer.
>
> What if this serialization format were designed so that it was extremely
> efficient to parse and imports could be serviced with the same trivially
> minimal overhead that the frozen importer currently has? We could embed
> these data structures in produced binaries and achieve the same desirable
> results we'll be getting in issue45020 all while delivering a more generic
> solution.
>
> What if this serialization format were portable across machines? The
> entire Python ecosystem could leverage it as a container format for
> distributing Python resources. Rather than splatting dozens or hundreds of
> files on the filesystem, you could write a single file with all of a
> package's resources. Bugs around filesystem implementation details such as
> case (in)sensitivity and Unicode normalization go away. Package installs
> are quicker. Run-time performance is better due to faster imports.
>
> (OK, maybe that last point brings back bad memories of eggs and you
> instinctively reject the idea. Or you have concerns about development
> ergonomics when module source code isn't in standalone editable files.
> These are fair points!)
>
> What if the Python interpreter gains an "app mode" where it is capable of
> being paired with a single "resources file" and running the application
> within? Think running zip applications today, but a bit faster, more
> tailored to Python, and more fully featured.
>
> What if an efficient binary serialization format could be leveraged as a
> cache to speed up subsequent interpreter startups?
>
> These were all considerations on my mind in the early days of PyOxidizer
> when I realized that the frozen importer and zip importers were lacking the
> features I desired and I would need to find an alternative solution.
>
> One thing led to another and I have incrementally developed the "Python
> packed resources" data format (
> https://pyoxidizer.readthedocs.io/en/stable/pyoxidizer_packed_resources.html).
> This is a binary format for representing Python source code, bytecode,
> resource files, extension modules, even shared libraries that extension
> modules rely on!
>
> Coupled with this format is the oxidized_importer meta path finder (
> https://pypi.org/project/oxidized-importer/ and
> https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer.html)
> capable of servicing imports and resource loading from these "Python packed
> resources" data structures.
>
> From a super high level, PyOxidizer assembles an instance of "Python
> packed resources" containing the CPython standard library and any
> additional Python packages you point it at and produces an executable with
> a main() that starts a Python interpreter, configures
> oxidized_importer.OxidizedFinder to read from the configured packed
> resources data structure (which may be embedded in the binary or loaded
> from a mmap()d file), and invokes some Python code inside to run your
> application.
>
> oxidized_importer has an API for reading and writing "Python packed
> resources" data structures. You can even use it to build your own
> PyOxidizer-like utilities (
> https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html
> ).
>
> I bring this work up because I believe that if you set yourself on a path
> to build a performant and fully featured importer/finder, you will
> inevitably build something with properties very similar to what I have
> built. To be uncompromising on performance, you'll want to roll your own
> data format that is in tune with Python's specific needs and avoids I/O and
> overhead when possible. To fully support the long-tail of features in
> Python's importing mechanism, you need the ability to richly - and
> efficiently - express metadata like whether a module is a package. It is
> possible to shoehorn this [meta]data into formats like tar and zip. But it
> won't be as efficient as rolling your own data structure. And when it comes
> to interpreter startup overhead, performance does matter.
>
> Am I suggesting CPython use oxidized_importer? No. It is implemented in
> Rust and CPython can't take a Rust dependency.
>
> Am I suggesting CPython support the "Python packed resources" data format
> as-is? No. The exact format today isn't suitable for CPython: I didn't
> design it with consideration for use beyond PyOxidizer's use case and there
> are still a ton of missing features.
>
> What I am suggesting is that Python developers think about the idea of
> standardizing a Python-centric container format for holding "Python
> resources" and a built-in/stdlib meta path finder for using it. Think of
> this as "frozen/zip importer 2.0" but with a more strongly defined and
> portable data format that is detached from C struct definitions. This could
> potentially solve a lot of problems around startup/import performance. And
> if you wanted to extend it to packaging/distribution, I think it could
> solve a lot of problems there too. (If you designed the format properly, I
> think it would be possible to converge with the use case of wheels.) (But I
> understand the skepticism about making the leap to packaging: that is an
> absurdly complex problem space!)
>
> If this idea sounds radical to you, I get the skepticism. I didn't want to
> incur this work/complexity when writing PyOxidizer either. But a long
> series of investigations and ruling out alternatives lead me down this
> path. With the benefit of hindsight I believe the type of solution is sound
> and it is inevitable Python gains something like this in the standard
> library or at least sees something like this in wide use in the wild. I say
> that because multiple Python app distribution tools have reinvented
> solutions to the general problem of "package multiple modules/resources in
> a single, efficient-to-load file/binary" in different ways because the
> solutions in the standard library (frozen and zip importers) or package
> distribution (wheels or eggs) just aren't sufficient because they each lack
> critical features. oxidized_importer _might_ be the most robust of these
> solutions to also be available as a standalone package on PyPI.
>
> I would encourage you to play around with oxidized_importer outside the
> context of PyOxidizer. I think you'll be pleasantly surprised by its
> performance and ability to emulate most of the common parts of the
> importlib APIs. The API for working with "Python packed resources" data
> structures isn't great. But only because I haven't spent much effort in
> making it so.
>
> I believe there's a path to adding a meta path importer to the stdlib that
> - like oxidized_importer - reads resource data from a well-defined data
> structure while retaining the performance of the frozen importer with the
> full feature set of PathFinder. I would suggest this as a better longer
> term solution than trying to incrementally evolve the frozen or zip
> importers to fit this use case. You could probably implement most of it in
> Python and freeze the bytecode into the interpreter like we do with
> PathFinder, leaving only the performance-sensitive parser to be implemented
> in C.
>
> All that being said, what I advocate for is obviously a lot of scope bloat
> versus doing some quick work to enable use of the frozen importer on a few
> dozen stdlib modules to speed up interpreter startup as is being discussed
> in issue45020. The practical engineer in me supports doing the quick and
> dirty solution now for the quick win. But I do encourage thinking bigger
> towards longer-term solutions, especially if you find yourself tempted to
> incrementally add features to frozen importer. I believe there is a market
> need for a stdlib meta path importer that reads a highly optimized and
> portable format similar to the solutions I've devised for PyOxidizer. Let
> me know how I can help incorporate one in the standard library.
>
> Gregory
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/XRJTN37WYVIPLFGXFGHAJZ6FSQC4NETD/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-- 
--Guido (mobile)

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3Y2A5ZNYINH7IKJAT76ERMAKXXNKMILB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: A better way to freeze modules

Reply via email to