Re: Improving speed of par app startup

Steffen Mueller Tue, 27 Nov 2007 06:46:29 -0800

Hi Scott,

first, let me say "thanks" for your research. The abysmal startup
performance has been annoying me for quite some time, but I just don't
have much time to devote to PAR these days, so I couldn't do much about
it. That was the reason for hacking up Archive::Unzip::Burst, which just
speeds up initial extraction of the executable by doing it in C
(infozip, really) instead of the dog-slow Archive::Zip.

A::U::B so far only runs on Linux, with some hacking, apparently also
under AIX. I understand that you did all your measurements on Windows?
If you did them on Linux, could you run them with A::U::Burst installed
for comparison? I have not managed to build A::U::B on Windows yet.

Scott Stanton wrote:
> I am trying to find a way to get self-contained par apps to approach the
> speed of PerlApp binaries on Windows.  I ran a few tests on a sample
> application to compare performance on my system.  Here are my results:

[...]

> First, I made a standard perlapp self-contained executable.  Perlapp
> unpacks just the shared libraries to a single user-specific directory
> ($TEMP/pdk-<user>).  It does not remove the files when done.  The files
> are named using an md5 checksum so they are unlikely to collide if there
> are changes.  The one exception I noticed was perl58.dll which was
> placed in a checksum named subdirectory under its original name.
> Presumably something similar must be done for any bundled run-time
> linker dependency.

That's actually a good idea: Avoiding dll file name clashes using a
directory for each external (non-perl-module) dll. Currently, we extract
the external dlls under their original name, IIRC. However, I don't know
how you can make those directories accessible to the dll loader without
appending *each* directory to $PATH. Doing that in a long path ($TEMP is
long on Windows, isn't it?) might quickly result in hitting the length
limitations of env. vars on Windows.

> Next I tried making a par binary that unpacks to an application specific
> temp directory ($TEMP/par-<user>/cache-<sum>) and does not clean up
> afterwards.  In this mode, par unpacks the whole zip file as well as all
> of the bootstrap files.  This results in:
> 
>             First call:           8.5 s
> 
>             Second call:      0.5 s

This is the default behaviour of PAR::Packer, of course.

> The next test I ran was to create a self contained executable made with
> par and -clean.  This case unpacks some of the files each time and
> removes the temporary directory when done:
> 
>             First call:           2.3 s
> 
>             Second call:      2.3 s (no reuse)
> 
>  
> 
> These tests indicate that there is a severe penalty for unpacking the
> whole file, but once that is done the second use is pretty fast.  Taking
> a cue from the perlapp approach, I tried running a few additional
> experiments.  To get a baseline for the cost of using infozip to expand
> the zip file, I tried expanding just the dlls, then everything:
> 
>             .dlls only:          0.5 s
> 
>             Everything:        4.9 s

This seems *very* slow for infozip. Perhaps it's because of the hardware
(old hard disk?) or the filesystem, or because you have a really huge
binary. But infozip was blazingly fast in extracting even zips of tens
of megabytes of tiny files on my reasonably fast (2005) linux machine
using ext3.

> That's pretty interesting.  We might be able to speed things up a lot by
> only unpacking the dlls.  So in my next experiment I modified par a bit
> to disable the calls to _extract_inc in PAR.pm so it doesn't unpack
> everything when running with a cache directory.  This improved things a
> lot:
> 
>             First call:           1.7 s
> 
>             Second call:      0.5 s

> This is pretty good, but we're still unpacking a lot of .pm files that
> could be loaded directly from memory.  It turns out there are two places
> that files are unpacked.  One is the bootstrap files embedded in the
> binary, the rest are inside the .par file.  The .par files can be loaded
> directly from memory by using PerlIO::scalar to create streams from
> buffers.  Some of the bootstrap files must be saved to disk in order to
> get far enough along to get the dynloader and PerlIO::scalar modules
> loaded, but the rest can be loaded from memory.  The resulting cache
> directory only contains a small subset of the .pm files plus all of the
> .dlls.  Also, par was extracting files before testing whether they
> already exist in the cache.  Deferring the read avoids unnecessary I/O
> (it also avoids the permission denied problems that sometimes crop up
> when the same binary is called multiple times).  With these changes in
> place, I got the following numbers:
> 
>             First call:           1.2 s
> 
>             Second call:      0.5 s

That is certainly the most elegant way to speed up PAR startup. It's one
of the longest standing criticisms that we don't do it that way.
Nicholas Clark's ex::lib::zip does this loading from memory using XS, if
I remember correctly. I don't remember whether or how it deals with
shared object files. It probably does.

> Rerunning the -clean case gives:
> 
>             First call:           1.6 s
> 
>             Second call:      1.6 s
> 
>  
> 
> This is a significant improvement over the original cached and uncached
> cases and probably approaches the limit of what can be achieved with the
> current zip implementation and application structure.  So what's
> missing?

I'm sure there are a few more improvements, but they probably mean
moving away from A::Zip for as much as possible. This is already impressive.

> There are probably ways to streamline the bootstrapping to avoid the
> need to unpack any .pm files to disk.  Further tweaking of the startup
> code might allow the PerlIO::scalar and dynloader modules to be loaded
> explicitly.

Probably. This would probably require quite some refactoring.

> My current deltas are a bit of a hack and don't preserve the original
> _extract_inc behavior.  Ideally, the this would be under the control of
> a switch.  I'd like to get feedback from the list about the best way to
> integrate this.  Also, if you can think of problems with this approach
> that I haven't considered, that would be helpful too.  I've verified
> that the patch works on both Linux (AS Perl 5.8.7) and Windows XP (AS
> Perl 5.8.6).

I would love to see this done in PAR(::Packer). As you correctly point
out, it's a pretty drastic change which certainly comes with some change
in behaviour, but in my opinion, it's the way to go forward.

Perhaps we should create a branch of PAR::Packer for these changes,
which will eventually lead to PAR::Packer >= 2.0. At the same time,
there have been several severe bug reports against PAR::Packer, PAR, and
Module::ScanDeps. Once or if we can fix those, I'd like to release a
PAR(::Packer) 1.0 to the world to get away<from those tiny 0.001 version
bumps. Scott, you have commit access to the repository, so please feel
free to branch PAR::Packer and experiment! I'd be very much willing to
do a developer release with your changes.

Best regards,
Steffen

Re: Improving speed of par app startup

Reply via email to