Quick reaction: This feels like a bait and switch to me. Also, there are many advantages to using a standard format like zip (many formats are really zip with some conventions). Finally, the bytecode format you are using is “marshal”, and is fully portable — as is zip.
On Thu, Sep 2, 2021 at 21:44 Gregory Szorc <gregory.sz...@gmail.com> wrote: > Over in https://bugs.python.org/issue45020 there is some exciting work > around expanding the use of the frozen importer to speed up Python > interpreter startup. I wholeheartedly support the effort and don't want to > discourage progress in this area. > > Simultaneously, I've been down this path before with PyOxidizer and feel > like I have some insight to share. > > I don't think I'll be offending anyone by saying the existing CPython > frozen importer is quite primitive in terms of functionality: it does the > minimum it needs to do to support importing module bytecode embedded in the > interpreter binary [for purposes of bootstrapping the Python-based > importlib modules]. The C struct representing frozen modules is literally > just the module name and a pointer to a sized buffer containing bytecode. > > In issue45020 there is talk of enhancing the functionality of the frozen > importer to support its potential broader use. For example, setting > __file__ or exposing .__loader__.get_source(). I support the overall > initiative. > > However, introducing enhanced functionality of the frozen importer will at > the C level require either: > > a) backwards incompatible changes to the C API to support additional > metadata on frozen modules (or at the very least a supplementary API that > fragments what a "frozen" module is). > b) CPython only hacks to support additional functionality for "freezing" > the standard library for purposes of speeding up startup. > > I'm not a CPython core developer, but neither "a" nor "b" seem ideal to > me. "a" is backwards incompatible. "b" seems like a stop-gap solution until > a more generic version is available outside the CPython standard library. > > From my experience with PyOxidizer and software in general, here is what I > think is going to happen: > > 1. CPython enhances the frozen importer to be usable in more situations. > 2. Python programmers realize this solution has performance and > ease-of-distribution wins and want to use it more. > 3. Limitations in the frozen importer are found. Bugs are reported. > Feature requests are made. > 4. The frozen importer keeps getting incrementally extended or Python > developers grow frustrated that its enhancements are only available to the > standard library. You end up slowly reimplementing the importing mechanism > in C (remember Python 2?) or disappoint users. > > Rather than extending the frozen importer, I would suggest considering an > alternative solution that is far more useful to the long-term success of > Python: I would consider building a fully-featured, generic importer that > is capable of importing modules and resource data from a well-defined and > portable serialization format / data structure that isn't defined by C > structs and APIs. > > Instead of defining module bytecode (and possible additional minimal > metadata) in C structs in a frozen modules array (or an equivalent C API), > what if we instead defined a serialization format for representing the > contents of loadable Python data (module source, module bytecode, resource > files, extension module library data, etc)? We could then point the Python > interpreter at instances of this data structure (in memory or in files) so > it could import/load the resources within using a meta path importer. > > What if this serialization format were designed so that it was extremely > efficient to parse and imports could be serviced with the same trivially > minimal overhead that the frozen importer currently has? We could embed > these data structures in produced binaries and achieve the same desirable > results we'll be getting in issue45020 all while delivering a more generic > solution. > > What if this serialization format were portable across machines? The > entire Python ecosystem could leverage it as a container format for > distributing Python resources. Rather than splatting dozens or hundreds of > files on the filesystem, you could write a single file with all of a > package's resources. Bugs around filesystem implementation details such as > case (in)sensitivity and Unicode normalization go away. Package installs > are quicker. Run-time performance is better due to faster imports. > > (OK, maybe that last point brings back bad memories of eggs and you > instinctively reject the idea. Or you have concerns about development > ergonomics when module source code isn't in standalone editable files. > These are fair points!) > > What if the Python interpreter gains an "app mode" where it is capable of > being paired with a single "resources file" and running the application > within? Think running zip applications today, but a bit faster, more > tailored to Python, and more fully featured. > > What if an efficient binary serialization format could be leveraged as a > cache to speed up subsequent interpreter startups? > > These were all considerations on my mind in the early days of PyOxidizer > when I realized that the frozen importer and zip importers were lacking the > features I desired and I would need to find an alternative solution. > > One thing led to another and I have incrementally developed the "Python > packed resources" data format ( > https://pyoxidizer.readthedocs.io/en/stable/pyoxidizer_packed_resources.html). > This is a binary format for representing Python source code, bytecode, > resource files, extension modules, even shared libraries that extension > modules rely on! > > Coupled with this format is the oxidized_importer meta path finder ( > https://pypi.org/project/oxidized-importer/ and > https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer.html) > capable of servicing imports and resource loading from these "Python packed > resources" data structures. > > From a super high level, PyOxidizer assembles an instance of "Python > packed resources" containing the CPython standard library and any > additional Python packages you point it at and produces an executable with > a main() that starts a Python interpreter, configures > oxidized_importer.OxidizedFinder to read from the configured packed > resources data structure (which may be embedded in the binary or loaded > from a mmap()d file), and invokes some Python code inside to run your > application. > > oxidized_importer has an API for reading and writing "Python packed > resources" data structures. You can even use it to build your own > PyOxidizer-like utilities ( > https://pyoxidizer.readthedocs.io/en/latest/oxidized_importer_freezing_applications.html > ). > > I bring this work up because I believe that if you set yourself on a path > to build a performant and fully featured importer/finder, you will > inevitably build something with properties very similar to what I have > built. To be uncompromising on performance, you'll want to roll your own > data format that is in tune with Python's specific needs and avoids I/O and > overhead when possible. To fully support the long-tail of features in > Python's importing mechanism, you need the ability to richly - and > efficiently - express metadata like whether a module is a package. It is > possible to shoehorn this [meta]data into formats like tar and zip. But it > won't be as efficient as rolling your own data structure. And when it comes > to interpreter startup overhead, performance does matter. > > Am I suggesting CPython use oxidized_importer? No. It is implemented in > Rust and CPython can't take a Rust dependency. > > Am I suggesting CPython support the "Python packed resources" data format > as-is? No. The exact format today isn't suitable for CPython: I didn't > design it with consideration for use beyond PyOxidizer's use case and there > are still a ton of missing features. > > What I am suggesting is that Python developers think about the idea of > standardizing a Python-centric container format for holding "Python > resources" and a built-in/stdlib meta path finder for using it. Think of > this as "frozen/zip importer 2.0" but with a more strongly defined and > portable data format that is detached from C struct definitions. This could > potentially solve a lot of problems around startup/import performance. And > if you wanted to extend it to packaging/distribution, I think it could > solve a lot of problems there too. (If you designed the format properly, I > think it would be possible to converge with the use case of wheels.) (But I > understand the skepticism about making the leap to packaging: that is an > absurdly complex problem space!) > > If this idea sounds radical to you, I get the skepticism. I didn't want to > incur this work/complexity when writing PyOxidizer either. But a long > series of investigations and ruling out alternatives lead me down this > path. With the benefit of hindsight I believe the type of solution is sound > and it is inevitable Python gains something like this in the standard > library or at least sees something like this in wide use in the wild. I say > that because multiple Python app distribution tools have reinvented > solutions to the general problem of "package multiple modules/resources in > a single, efficient-to-load file/binary" in different ways because the > solutions in the standard library (frozen and zip importers) or package > distribution (wheels or eggs) just aren't sufficient because they each lack > critical features. oxidized_importer _might_ be the most robust of these > solutions to also be available as a standalone package on PyPI. > > I would encourage you to play around with oxidized_importer outside the > context of PyOxidizer. I think you'll be pleasantly surprised by its > performance and ability to emulate most of the common parts of the > importlib APIs. The API for working with "Python packed resources" data > structures isn't great. But only because I haven't spent much effort in > making it so. > > I believe there's a path to adding a meta path importer to the stdlib that > - like oxidized_importer - reads resource data from a well-defined data > structure while retaining the performance of the frozen importer with the > full feature set of PathFinder. I would suggest this as a better longer > term solution than trying to incrementally evolve the frozen or zip > importers to fit this use case. You could probably implement most of it in > Python and freeze the bytecode into the interpreter like we do with > PathFinder, leaving only the performance-sensitive parser to be implemented > in C. > > All that being said, what I advocate for is obviously a lot of scope bloat > versus doing some quick work to enable use of the frozen importer on a few > dozen stdlib modules to speed up interpreter startup as is being discussed > in issue45020. The practical engineer in me supports doing the quick and > dirty solution now for the quick win. But I do encourage thinking bigger > towards longer-term solutions, especially if you find yourself tempted to > incrementally add features to frozen importer. I believe there is a market > need for a stdlib meta path importer that reads a highly optimized and > portable format similar to the solutions I've devised for PyOxidizer. Let > me know how I can help incorporate one in the standard library. > > Gregory > _______________________________________________ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/XRJTN37WYVIPLFGXFGHAJZ6FSQC4NETD/ > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido (mobile)
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3Y2A5ZNYINH7IKJAT76ERMAKXXNKMILB/ Code of Conduct: http://python.org/psf/codeofconduct/