Re: [Distutils] [proposal] shared distribution installations

2017-10-31 Thread Nick Coghlan
On 31 October 2017 at 22:13, Leonardo Rochael Almeida 
wrote:

>
> Those are issues that buildout has solved long before pip was even around,
> but they rely on sys.path expansion that Ronny found objectionable due to
> performance issues.
>

The combination of network drives and lots of sys.path entries could lead
to *awful* startup times with the old stat-based import model (which Python
2.7 still uses by default).

The import system in Python 3.3+ relies on cached os.listdir() results
instead, and after we switched to that, we received at least one report
from a HPC operator of batch jobs that used to take 100+ seconds to start
when importing modules from NFS dropped down to startup times measured in
hundreds of milliseconds - most of the time was previously being lost to
network round trips for failed stat calls that just reported that the file
didn't exist. Even on spinning disks, the new import system gained back
most of the speed that was lost in the switch from low level C to more
maintainable and portable Python code.

An org that runs large rendering farms also reported significantly
improving their batch job startup times in 2.7 by switching to importlib2
(which backports the Py3 import implementation).


> I don't think the performance issues are that problematic (and wasn't
> there some work on Python 3 that made import faster even with long
> sys.paths?).
>

As soon as you combined the old import model with network drives, your
startup times could quickly become intolerable, even with short sys.path
entries - failing imports, and imports that get satisfied later in the path
just end up taking too long.

I wouldn't call it a *completely* solved problem in Py3 (there are still
some application startup related activities that scale linearly with the
length of sys.path), but the worst offender (X stat calls by Y sys.path
entries, taking Z milliseconds per call) is gone.


> On 31 October 2017 at 05:22, Nick Coghlan  wrote:
>
>
>> [...]
>>
>> However, there's another approach that specifically tackles the content
>> duplication problem, which would require a new installation layout as you
>> suggest, but could still rely on *.pth files to make it implicitly
>> compatible with existing packages and applications and existing Python
>> runtime versions.
>>
>> That approach is to create an install tree somewhere that looks like this:
>>
>> _shared-packages/
>> /
>> /
>> .dist-info/
>> 
>>
>> Instead of installing full packages directly into a venv the way pip
>> does, an installer that worked this way would instead manage a
>> .pth file that indicated
>> "_shared-packages//" should be
>> added to sys.path.
>>
>
> This solution is nice, but preserves the long sys.path that Ronny wanted
> to avoid in the first place.
>
> Another detail that needs mentioning is that, for .pth based sys.path
> manipulation to work, the  would need to be all the files
> from purelib and platlib directories from wheels mashed together instead of
> a simple unpacking of the wheel (though I guess the .pth file could add
> both purelib and platlib subfolders to sys.path...)
>

Virtual environments already tend to mash those file types together anyway
- it's mainly Linux system packages that separate them out.


> Another possibility that avoids the issue of long.syspath is to use this
> layout but with symlink farms instead of either sys.path manipulation or
> conda-like hard-linking.
>
> Symlinks would preserve better filesystem size visibility that Ronny
> wanted while allowing the layout above to contain wheels that were simply
> unzipped.
>

Yeah, one thing I really like about that install layout is that it
separates the question of "the installed package layout" from how that
package gets linked into a virtual environment. If you're only doing exact
version matches, then you can use symlinks quite happily, since you don't
need to cope with the name of the "dist-info" directory changing. However,
if you're going to allow for transparent maintenance updates (and hence
version number changes in the dist-info directory name), then you need a
*.pth file.


> In Windows, where symlinks require admin privileges (though this is
> changing
> ),
> an option could be provided for using hard links instead (which never
> require elevated privileges).
>

Huh, interesting - I never knew that Windows offered unprivileged hard link
support. I wonder if the venv module could be updated to offer that as an
alternative to copying when symlinks aren't available.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] [proposal] shared distribution installations

2017-10-31 Thread Leonardo Rochael Almeida
Hi,

On 31 October 2017 at 05:22, Nick Coghlan  wrote:

> On 31 October 2017 at 05:16, RonnyPfannschmidt <
> opensou...@ronnypfannschmidt.de> wrote:
>
>> Hi everyone,
>>
>> since a while now various details of installing python packages in
>> virtualenvs caused me grief
>>
>> a) typically each tox folder in a project is massive, and has a lot of
>> duplicate files, recreating them, managing and iterating them takes
>> quite a while
>> b) for nicely separated deployments, each virtualenv for an application
>> takes a few hundred megabytes - that quickly can saturate disk space
>> even if a reasonable amount was reserved
>> c) installation and recreation of virtualenvs with the same set of
>> packages takes quite a while (even with pip caches this is slow, and
>> there is no good reason to avoid making it completely instantaneous)
>>
>
Those are issues that buildout has solved long before pip was even around,
but they rely on sys.path expansion that Ronny found objectionable due to
performance issues.

I don't think the performance issues are that problematic (and wasn't there
some work on Python 3 that made import faster even with long sys.paths?).


> [...]
>
> However, there's another approach that specifically tackles the content
> duplication problem, which would require a new installation layout as you
> suggest, but could still rely on *.pth files to make it implicitly
> compatible with existing packages and applications and existing Python
> runtime versions.
>
> That approach is to create an install tree somewhere that looks like this:
>
> _shared-packages/
> /
> /
> .dist-info/
> 
>
> Instead of installing full packages directly into a venv the way pip does,
> an installer that worked this way would instead manage a
> .pth file that indicated
> "_shared-packages//" should be
> added to sys.path.
>

This solution is nice, but preserves the long sys.path that Ronny wanted to
avoid in the first place.

Another detail that needs mentioning is that, for .pth based sys.path
manipulation to work, the  would need to be all the files
from purelib and platlib directories from wheels mashed together instead of
a simple unpacking of the wheel (though I guess the .pth file could add
both purelib and platlib subfolders to sys.path...)

Another possibility that avoids the issue of long.syspath is to use this
layout but with symlink farms instead of either sys.path manipulation or
conda-like hard-linking.

Symlinks would preserve better filesystem size visibility that Ronny wanted
while allowing the layout above to contain wheels that were simply unzipped.

In Windows, where symlinks require admin privileges (though this is changing
),
an option could be provided for using hard links instead (which never
require elevated privileges).

Using symlinks into the above layout preserves all advantages and drawbacks
Nick mentioned other than the sys.path expansion.

Regards,

Leo
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] [proposal] shared distribution installations

2017-10-31 Thread Nick Coghlan
On 31 October 2017 at 05:16, RonnyPfannschmidt <
opensou...@ronnypfannschmidt.de> wrote:

> Hi everyone,
>
> since a while now various details of installing python packages in
> virtualenvs caused me grief
>
> a) typically each tox folder in a project is massive, and has a lot of
> duplicate files, recreating them, managing and iterating them takes
> quite a while
> b) for nicely separated deployments, each virtualenv for an application
> takes a few hundred megabytes - that quickly can saturate disk space
> even if a reasonable amount was reserved
> c) installation and recreation of virtualenvs with the same set of
> packages takes quite a while (even with pip caches this is slow, and
> there is no good reason to avoid making it completely instantaneous)
>
> in order to elevate those issues i would like to propose a new
> installation layout,
> where instead of storing each distribution in every python all
> distributions would share a storage, and each individual environment
> would only have references to the packages that where
> "installed/activated" for them
>

I've spent a fair bit of time pondering this problem (since distros care
about it in relation to ease of security updates), and the combination of
Python's import semantics with the PEP 376 installation database semantics
makes it fairly tricky to improve. Fortunately, the pth-file mechanism
provides an escape hatch that makes it possible to transparently experiment
with difference approaches.

At the venv management layer, pew already supports a model similar to that
offered by the Flatpak application container format [1]: instead of
attempting to share everything, pew permits a limited form of "virtual
environment inheritance", via "pew add $(pew dir
)" (which injects a *.pth file that appends the
other venv's site-packages directory to sys.path). Those inherited runtimes
then become the equivalent of the runtime layer in Flatpak: applications
will automatically pick up new versions of the runtime, so the runtime
maintainers are expected to strictly preserve backwards compatibility, and
when that isn't possible, provide a new parallel-installable version, so
apps using both the old and the new runtime can happily run side-by-side.

The idea behind that approach is to trade-off a bit of inflexibility in the
exact versions of some of your dependencies for the benefit of a reduction
in data duplication on systems running multiple applications or
environments: instead of specifying your full dependency set, you'd instead
only specify that you depended on a particular common computational
environment being available, plus whatever you needed that isn't part of
the assumed platform.

As semi-isolated-applications-with-a-shared-runtime mechanisms like Flatpak
gain popularity (vs fully isolated application & service silos), I'd expect
this model to start making more of an appearance in the Linux distro world,
as it's a natural way of mapping per-application venvs to the shared
runtime model, and it doesn't require any changes to installers or
applications to support it.

However, there's another approach that specifically tackles the content
duplication problem, which would require a new installation layout as you
suggest, but could still rely on *.pth files to make it implicitly
compatible with existing packages and applications and existing Python
runtime versions.

That approach is to create an install tree somewhere that looks like this:

_shared-packages/
/
/
.dist-info/


Instead of installing full packages directly into a venv the way pip does,
an installer that worked this way would instead manage a
.pth file that indicated
"_shared-packages//" should be
added to sys.path. Each shared package directory could include references
back to all of the venvs where it has been installed, allowing it to be
removed when either all of those have been updated to a new version, or
else removed entirely. This is actually a *lot* like the way
pkg_resources.requires() and self-contained egg directories work, but with
the version selection shifted to the venv's site-packages directory, rather
than happening implicitly in Python code on application startup.

An interesting point about this layout is that it would be amenable to a
future enhancement that allowed for more relaxed MAJOR and MAJOR.MINOR
qualifiers on the install directory references, permitting transparently
shared maintenance and security updates.

The big downside of this layout is that it means you lose the ability to
just bundle up an entire directory and unpack it on a different machine to
get a probably-mostly-working environment. This means that while it's
likely better for managing lots of environments on a single workstation
(due to the reduced file duplication), it's likely to be worse for folks
that work on only a handful of different projects at any given point in
time (and I say that as someone with ~140 different local 

Re: [Distutils] [proposal] shared distribution installations

2017-10-30 Thread Ronny Pfannschmidt
I would like to explicitly avoid Hardlink Farms
because those still have "Logical" duplication
i'd like to bind in the new paths without having it look like each
virtualenv is 400-1000 mb of distinct data

-- Ronny

Am Montag, den 30.10.2017, 22:23 + schrieb Thomas Kluyver:
> On Mon, Oct 30, 2017, at 07:16 PM, RonnyPfannschmidt wrote:
> > in order to elevate those issues i would like to propose a new
> > installation layout,
> > where instead of storing each distribution in every python all
> > distributions would share a storage, and each individual
> > environment
> > would only have references to the packages that where
> > "installed/activated" for them
> 
> This is also essentially what conda does - the references being in
> the
> form of hard links. The mechanism has some drawbacks of its own -
> like
> if a file somehow gets modified, it's harder to fix it, because
> removing
> the environment no longer removes the files.
> 
> Thomas
> ___
> Distutils-SIG maillist  -  Distutils-SIG@python.org
> https://mail.python.org/mailman/listinfo/distutils-sig
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] [proposal] shared distribution installations

2017-10-30 Thread Chris Barker - NOAA Federal
, the behaviour i aim for would be moslty like
virtualenv but without the file duplication.


For what it’s worth, conda environments use hard links where possible, so
limiting duplication...

Maybe conda would solve your problem...

-CHB







I beleive nix could also benefit from parts of such a mechanism.

-- Ronny


Am Montag, den 30.10.2017, 20:35 +0100 schrieb Freddy Rietdijk:

Hi Ronny,


What you describe here is, as you know, basically what the Nix

package manager does. You could create something similar specifically

for Python, like e.g. `ied` is for Node [2], or Spack, which is

written in Python. But then how are you going to deal with other

system libraries, and impurities? And you will have to deal with

them, because depending on how you configure Python packages that

depend on them (say a `numpy`), their output will be different. Or

would you choose to ignore this?


Freddy


[1] https://nixos.org/nix/

[2] https://github.com/alexanderGugel/ied

[3] https://spack.io/


On Mon, Oct 30, 2017 at 8:16 PM, RonnyPfannschmidt  wrote:

Hi everyone,


since a while now various details of installing python packages in

virtualenvs caused me grief


a) typically each tox folder in a project is massive, and has a lot

of

duplicate files, recreating them, managing and iterating them takes

quite a while

b) for nicely separated deployments, each virtualenv for an

application

takes a few hundred megabytes - that quickly can saturate disk

space

even if a reasonable amount was reserved

c) installation and recreation of virtualenvs with the same set of

packages takes quite a while (even with pip caches this is slow,

and

there is no good reason to avoid making it completely

instantaneous)


in order to elevate those issues i would like to propose a new

installation layout,

where instead of storing each distribution in every python all

distributions would share a storage, and each individual

environment

would only have references to the packages that where

"installed/activated" for them


this would massively reduce time required to create the contents of

the

environments and also the space required


since blindly expanding sys.path would lead to similar performance

issues as where seen with setuptools/buildout multi-version

installs,

this mechanism would also need a element on sys.meta_path that

handles

inexpensive dispatch to the toplevels and metadata files of each

packages (off hand i would assume linear walking of hundreds of

entries

simply isn't that effective)


however there would be need for some experimentation to see what

tradeoff is sensible there


I hope this mail will spark enough discussion to enable the

creation of

a PEP and a prototype.



Best, Ronny




___

Distutils-SIG maillist  -  Distutils-SIG@python.org

https://mail.python.org/mailman/listinfo/distutils-sig



___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] [proposal] shared distribution installations

2017-10-30 Thread Thomas Kluyver
On Mon, Oct 30, 2017, at 07:16 PM, RonnyPfannschmidt wrote:
> in order to elevate those issues i would like to propose a new
> installation layout,
> where instead of storing each distribution in every python all
> distributions would share a storage, and each individual environment
> would only have references to the packages that where
> "installed/activated" for them

This is also essentially what conda does - the references being in the
form of hard links. The mechanism has some drawbacks of its own - like
if a file somehow gets modified, it's harder to fix it, because removing
the environment no longer removes the files.

Thomas
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] [proposal] shared distribution installations

2017-10-30 Thread Ronny Pfannschmidt
Hi Freddy,

im well aware what nix currently does for python packages and suffered
my fair share from it.

What i want to do is simply store those wheels that pip would first
generate and unpack into each environment into a location where each
environment shares the unpacked files more directly

im not going to expand uppon my perceived shortcomings of nix as i know
,it since its irrelevant to this discussion and not something i have
the time and motivation to fix.

as far as impurities go, the behaviour i aim for would be moslty like
virtualenv but without the file duplication.

I beleive nix could also benefit from parts of such a mechanism.

-- Ronny


Am Montag, den 30.10.2017, 20:35 +0100 schrieb Freddy Rietdijk:
> Hi Ronny,
> 
> What you describe here is, as you know, basically what the Nix
> package manager does. You could create something similar specifically
> for Python, like e.g. `ied` is for Node [2], or Spack, which is
> written in Python. But then how are you going to deal with other
> system libraries, and impurities? And you will have to deal with
> them, because depending on how you configure Python packages that
> depend on them (say a `numpy`), their output will be different. Or
> would you choose to ignore this? 
> 
> Freddy
> 
> [1] https://nixos.org/nix/
> [2] https://github.com/alexanderGugel/ied
> [3] https://spack.io/
> 
> On Mon, Oct 30, 2017 at 8:16 PM, RonnyPfannschmidt  fannschmidt.de> wrote:
> > Hi everyone,
> > 
> > since a while now various details of installing python packages in
> > virtualenvs caused me grief
> > 
> > a) typically each tox folder in a project is massive, and has a lot
> > of
> > duplicate files, recreating them, managing and iterating them takes
> > quite a while
> > b) for nicely separated deployments, each virtualenv for an
> > application
> > takes a few hundred megabytes - that quickly can saturate disk
> > space
> > even if a reasonable amount was reserved
> > c) installation and recreation of virtualenvs with the same set of
> > packages takes quite a while (even with pip caches this is slow,
> > and
> > there is no good reason to avoid making it completely
> > instantaneous)
> > 
> > in order to elevate those issues i would like to propose a new
> > installation layout,
> > where instead of storing each distribution in every python all
> > distributions would share a storage, and each individual
> > environment
> > would only have references to the packages that where
> > "installed/activated" for them
> > 
> > this would massively reduce time required to create the contents of
> > the
> > environments and also the space required
> > 
> > since blindly expanding sys.path would lead to similar performance
> > issues as where seen with setuptools/buildout multi-version
> > installs,
> > this mechanism would also need a element on sys.meta_path that
> > handles
> > inexpensive dispatch to the toplevels and metadata files of each
> > packages (off hand i would assume linear walking of hundreds of
> > entries
> > simply isn't that effective)
> > 
> > however there would be need for some experimentation to see what
> > tradeoff is sensible there
> > 
> > I hope this mail will spark enough discussion to enable the
> > creation of
> > a PEP and a prototype.
> > 
> > 
> > Best, Ronny
> > 
> > 
> > 
> > ___
> > Distutils-SIG maillist  -  Distutils-SIG@python.org
> > https://mail.python.org/mailman/listinfo/distutils-sig
> 
> 
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] [proposal] shared distribution installations

2017-10-30 Thread Freddy Rietdijk
Hi Ronny,

What you describe here is, as you know, basically what the Nix package
manager does. You could create something similar specifically for Python,
like e.g. `ied` is for Node [2], or Spack, which is written in Python. But
then how are you going to deal with other system libraries, and impurities?
And you will have to deal with them, because depending on how you configure
Python packages that depend on them (say a `numpy`), their output will be
different. Or would you choose to ignore this?

Freddy

[1] https://nixos.org/nix/
[2] https://github.com/alexanderGugel/ied
[3] https://spack.io/

On Mon, Oct 30, 2017 at 8:16 PM, RonnyPfannschmidt <
opensou...@ronnypfannschmidt.de> wrote:

> Hi everyone,
>
> since a while now various details of installing python packages in
> virtualenvs caused me grief
>
> a) typically each tox folder in a project is massive, and has a lot of
> duplicate files, recreating them, managing and iterating them takes
> quite a while
> b) for nicely separated deployments, each virtualenv for an application
> takes a few hundred megabytes - that quickly can saturate disk space
> even if a reasonable amount was reserved
> c) installation and recreation of virtualenvs with the same set of
> packages takes quite a while (even with pip caches this is slow, and
> there is no good reason to avoid making it completely instantaneous)
>
> in order to elevate those issues i would like to propose a new
> installation layout,
> where instead of storing each distribution in every python all
> distributions would share a storage, and each individual environment
> would only have references to the packages that where
> "installed/activated" for them
>
> this would massively reduce time required to create the contents of the
> environments and also the space required
>
> since blindly expanding sys.path would lead to similar performance
> issues as where seen with setuptools/buildout multi-version installs,
> this mechanism would also need a element on sys.meta_path that handles
> inexpensive dispatch to the toplevels and metadata files of each
> packages (off hand i would assume linear walking of hundreds of entries
> simply isn't that effective)
>
> however there would be need for some experimentation to see what
> tradeoff is sensible there
>
> I hope this mail will spark enough discussion to enable the creation of
> a PEP and a prototype.
>
>
> Best, Ronny
>
>
>
> ___
> Distutils-SIG maillist  -  Distutils-SIG@python.org
> https://mail.python.org/mailman/listinfo/distutils-sig
>
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


[Distutils] [proposal] shared distribution installations

2017-10-30 Thread RonnyPfannschmidt
Hi everyone,

since a while now various details of installing python packages in
virtualenvs caused me grief

a) typically each tox folder in a project is massive, and has a lot of
duplicate files, recreating them, managing and iterating them takes
quite a while
b) for nicely separated deployments, each virtualenv for an application
takes a few hundred megabytes - that quickly can saturate disk space
even if a reasonable amount was reserved
c) installation and recreation of virtualenvs with the same set of
packages takes quite a while (even with pip caches this is slow, and
there is no good reason to avoid making it completely instantaneous)

in order to elevate those issues i would like to propose a new
installation layout,
where instead of storing each distribution in every python all
distributions would share a storage, and each individual environment
would only have references to the packages that where
"installed/activated" for them

this would massively reduce time required to create the contents of the
environments and also the space required

since blindly expanding sys.path would lead to similar performance
issues as where seen with setuptools/buildout multi-version installs,
this mechanism would also need a element on sys.meta_path that handles
inexpensive dispatch to the toplevels and metadata files of each
packages (off hand i would assume linear walking of hundreds of entries
simply isn't that effective)

however there would be need for some experimentation to see what
tradeoff is sensible there

I hope this mail will spark enough discussion to enable the creation of
a PEP and a prototype.


Best, Ronny



___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig