[Python-Dev] Re: A memory map based data persistence and startup speedup approach

文极 Wed, 11 May 2022 09:56:13 -0700


Hi Nick,

Sorry for the late reply, and thanks for the feedback!

We’ve been working on publishing the package, and the first version is 
available at https://github.com/alibaba/code-data-share-for-python/, with user 
guide and some statistics (TL;DR: ~15% speedup in startup).
We welcome code review, comments or questions.

> I assume the files wouldn't be portable across architectures

That’s true, this file is basically a snapshot of part of the CPython heap that 
could be shared between processes.

> so does the cache file naming scheme take that into account?

Currently no, this file is intended to be generated on demand (rather than 
generating a huge archive from all the third-party packages installed). Thus 
the file itself and the name should be managed by user.

> (The idea is interesting regardless of whether it produces arch-specific 
> files - kind of a middle ground between portable serialisation based pycs and 
> fully frozen modules)

I think our package could be the substitution of the frozen module mechanism 
for third-party packages — while builtin modules can be compiled to C code, 
code-data-share could automatically create a similar file that requires no 
compilation / deserialization.
Actually we do have a POC which is integrated with CPython and can speedup 
importing builtin modules, but after make it third-party package, there’s not 
much we can do to the builtins, so freeze and deep-freeze is quite exciting to 
us.

Best,
Yichen

> On Mar 20, 2022, at 23:26, Nick Coghlan <ncogh...@gmail.com> wrote:
> 
> (belated follow-up as I noticed there hadn't been a reply on list yet, just 
> the previous feedback on the faster-cpython ticket)
> 
> On Mon, 21 Feb 2022, 6:53 pm Yichen Yan via Python-Dev, 
> <python-dev@python.org> wrote:
>> 
>> Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a 
>> mechanism that supports data persistence of a subset of python date types 
>> with mmap, therefore can reduce package import time by caching code object. 
>> This could be seen as a more eager pyc format, as they are for the same 
>> purpose, but our approach try to avoid [de]serialization. Therefore, we get 
>> a speedup in overall python startup by ~15%.
> 
> 
> This certainly sounds interesting! 
> 
>> 
>> Currently, we’ve made it a third-party library and have been working on 
>> open-sourcing.
>> 
>> Our implementation (whose non-official name is “pycds”) mainly contains two 
>> parts:
>> importlib hooks, this implements the mechanism to dump code objects to an 
>> archive and a `Finder` that supports loading code object from mapped memory.
>> Dumping and loading (subset of) python types with mmap. In this part, we 
>> deal with 1) ASLR by patching `ob_type` fields; 2) hash seed randomization 
>> by supporting only basic types who don’t have hash-based layout (i.e. dict 
>> is not supported); 3) interned string by re-interning strings while loading 
>> mmap archive and so on.
> 
> I assume the files wouldn't be portable across architectures, so does the 
> cache file naming scheme take that into account?
> 
> (The idea is interesting regardless of whether it produces arch-specific 
> files - kind of a middle ground between portable serialisation based pycs and 
> fully frozen modules)
> 
> Cheers,
> Nick.
>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OPJV5HF4MUB2YHGZZQZXMTBNF6ZAJML5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: A memory map based data persistence and startup speedup approach

Reply via email to