[racket-dev] package-system update

Matthew Flatt Sat, 13 Jul 2013 11:59:24 -0700

Here's a big-picture update of where we are in the new package system
and the conversion of the Racket distribution to use packages.


This message covers

 - how I see things working after the package system and
   reorganization is done, and a report on what pieces are still
   missing to reach that vision;

 - a look at how we got to our current design/reorganization choices
   and whether we're choosing the right place; and

 - speculation on why the package changes have been so difficult to
   implement.

All of that makes it a long message (sorry!), but I hope this message
is useful to bring us more in sync.


A Package-Based Racket
----------------------

Let's take a look at how you'll do various things in the new
package-based Racket world.

(There's no new information here, and parts marked with "[guess]" are
especially speculative.  Still, some details may be clearer than in
earlier accounts, now that much of it is implemented, and I think a
comprehensive review may be useful.)

** Downloading release installers from PLT

The "www.racket-lang.org" site's big blue button will provide the same
installers that it does now, at least by default. That is, the content
provided by the installer --- DrRacket, teaching languages, etc. ---
will be pretty much the same as now.

The blue button might also provide the option of "Minimal Racket"
installers, which gives you something that's a small as we can make it
and still provides command-line `raco pkg'.

** Downloading installers from other distributors

There are all sorts of reasons that the "main distribution" from PLT
might not fit the needs of some group. Maybe the release cycle is too
long or at the wrong time. Maybe it includes much too much, much too
little, or almost the right amount but missing a crucial
package. Maybe the group wants something almost minimal, but still
with a graphical package manager. Maybe some group uses a platform for
which PLT does not provide an installer.

For many of those groups, using a "Minimal Racket" installer plus
selective package installations will do the trick. For others,
creating a special set of installers might be worthwhile, but there
are too many reasons and too many permutations for PLT to provide
installers that cover all of them.

Fortunately, anyone can build a set of installers and put them on a
web page, and we make it as easy as possible to build a set of
installers that start with a given set of packages. PLT could host a
web page or wiki that points to other distributors. PLT might even be
able to provide an automated service that generates a set of
installers for a basic set of platforms.

** Compiling a release from source

In addition to installers, a download site can provide a source-code
option (not specific to any platform, unlike the current source
packages), which would mainly be used for building Racket on
additional platforms.

This option is mostly a snapshot of the source-code repository for the
core, but it includes a pre-built "collects" tree (see "technical
detail", below) and a default configuration that points back to the
distributor's site for pre-built packages.

** Adding or upgrading supported packages

In much the same way that you can easily install a set of supported
packages on your current OS, you'll be able to easily install a set of
packages that are supported by your distributor. Those packages are
pre-built, so they install quickly, along with any included
documentation.

Depending on the distributor and installer, packages might be
downloaded and installed in "binary" form, which means that tests and
source code (for libraries and documentation) are omitted from the
package. PLT seems unlikely to provide such installers in the near
future.

The default package scope configured by a distribution tends to be
"user", which means that packages are installed in a user-specific
location.

Package updates can be made available by distributors for whatever
reason and on whatever timetable see they fit.

If your distribution is from PLT, then the supported packages are
called "ring-0" packages. Ring-0 packages include contributions from
third parties (i.e., not just packages implemented by PLT) that are
vetted and regularly tested by PLT.

[Guess:] The "Racket" and "Minimal Racket" distributions might point
to different pre-built package catalogs. Possibly, the "Racket"
catalog never updates packages that were included in the installer (on
the grounds that the user may not have write permission to the
install), while the "Minimal Racket" catalog includes more frequent
updates for bug fixes (on the grounds that the user can update any
installed package).

A distributor doesn't necessarily have to provide its own package
catalog. It can instead supply an installer that works with packages
as served by some other distributor's catalog, such as PLT's
catalog. (See "technical detail" below.)

A user can also redirect `raco pkg' to a different catalog server,
instead of using the configuration that was supplied by the
installer. Binary, pre-built, and source variants of a package can be
"updated" to each other in any direction.

** Adding or upgrading other packages

An installer-provided configuration will normally point to a catalog
of packages that are not specifically supported by the distributor but
are still readily available --- probably mostly in source form and
directly pulled from a git repository. In particular,
"pkg.racket-lang.org" provides packages in source form.

** Reading documentation

A distribution site provides online documentation (including all
supported packages) alongside installers and packages.

Many installers and packages include documentation to be installed on
a user's machine, but there are some packages that provide libraries
without documentation. For example, "gui-lib" provides GUI libraries
without local documentation, while "gui" combines "gui-lib" local
documentation and the libraries.

Sometimes, documentation that is installed locally will still refer to
documentation that is not downloaded. Such links are directed back to
the distributor's site. That situation won't happen often for
pre-built packages, because links that go to other packages will tend
to go to packages that are dependencies. It will happen more for
binary packages, because the dependency can be build-time only.

** Creating new packages

A minimal package is a directory. So, let's suppose that you have some
modules in a directory that you want to turn into a package. Suppose
that your directory is called "potato", and it has module a file
"eat.rkt".

Turn your directory into a locally installed package with

   raco pkg install --link potato

Then, you can use "eat.rkt" with

   (require potato/eat)

To give your package to someone else, you could zip up the "potato"
directory as "potato.zip", and the other person would install with

   raco pkg install potato.zip

Note that you can use any zip archiving tool, or you can use

   rack pkg create --form-install potato

to create the ".zip" file, which has the advantage that directories
like "compiled" and ".git" are omitted.

Even better, maybe your directory is already on GitHub at
"http://github.com/idaho/potato";. Then, others can install your
package with

   raco pkg install github://github.com/idaho/potato/master

If you push changes to your GitHub repository, others can get them
with

  raco pkg update potato

If you're ready for the world to use your package, then go to
"pkg.racket-lang.org" and point the package name "potato" at your
GitHub repository. Then, not only will others know about your package,
they'll be able to install it with

   raco pkg install potato

Finally, if you'd like PLT to include your package as a pre-built
package with each snapshot and release, then go back to
"pkg-racket-lang.org" and request ring-0 status for the package.
Ring-0 status may require a few bureaucratic improvements to your
package, such as including an "info.rkt" file if you don't have one
already, because those details are needed to keep your package in
working order.

** Using the cutting edge

PLT provides one or more snapshot sites that work the same as the
release site, except that each snapshot's catalog expires after a few
days. When that catalog goes away, you can continue to use the
snapshot, but you'll have to get packages and updates via source.

** Using the bleeding edge

A user who wants to work with the minute-by-minute latest can start by
cloning the core Racket git repository, `configure', `make', and `make
install' to get a Minimum Racket build. Then, start installing
packages with `raco pkg'.

The default package catalog in built-from-source Racket is
"pkg.racket-lang.org", which means that you get all packages in source
form from various git repositories, including for PLT-maintained
packages. The default package scope is "installation".

If you run `raco pkg update -a', then you likely get updates and
trigger many compiles. Eventually, an update will fail, because your
core Racket version is too old, and you'll need to `git pull',
`configure', `make', and `make install' --- if you haven't been doing
that, anyway. Since packages were added with installation-wide scope,
`make install' rebuilds your previously installed packages, too.

** Using the bleeding edge as a PLT developer

As a convenience to PLT developers, who tend to work on a particular
set of packages, there is an alternate way of working on the bleeding
edge (which anyone can use, if they prefer).

[Guess #1:] Instead of cloning the core Racket repo, clone a "main
distribution" repo that has the core Racket repo as a submodule, plus
git submodules for each of the packages that are dependencies of
"main-distribution". In other words, you get something that looks like
the current Racket repo, but that uses git submodules.

[Guess #2:] Instead of cloning the core Racket repo from GitHub, you
clone from the "main distribution" repository, just like now. In
addition to being mirrored to GitHub directly, individual parts of the
"main distribution" repo are mirrored as GitHub repositories, and
the mirrors are the ones that "pkg.racket-lang.org" references.

GitHub repositories that correspond to packages (submodules in guess
#1, mirrored subtrees in guess #2) are registered with
"pkg.racket-lang.org", which is how users on the bleeding-edge might
normally get the packages.

** Becoming a distributor

If you want to create installers like PLT's, then it's simplest to
clone the git repo like a PLT developer, and then use `make
installers'.

Alternatively, you can use `make installers-from-catalog' to create a
set of installers based on packages pulled from a specified catalog.

Either way, if you want to piggy-back on some other installer's set of
pre-built packages, then configuration options and/or makefile targets
to do that. (This is more sketchy; see below.)

** Taking your own snapshot of Racket and packages:

Sometimes, you don't need to build installers, but you'd still like a
snapshot of the current Racket core and package. You might want to
edit the snapshot to upgrade some packages while keeping others the
same.

The `raco pkg catalog-copy' command is one of many tools to manipulate
catalog servers. For packages that are mapped to GitHub repositories,
merely copying a catalog doesn't archive the code, but it archives a
particular commit id. It's always possible to grab a copy of a package
repository and reference the copy from a catalog.


A Technical Detail
------------------

Starting from scratch twice with the same Racket sources does not lead
to compatible pre-built packages, unfortunarely, because bytecode
files are generated deterministically. Maybe we'll be able to fix
that, one day.

Meanwhile, pre-built packages depend on a particular build of the
libraries in "collects", as well as a particular build of any
dependencies. So, if a distributor wants to enable other distributors
that use the same catalog of pre-built packages, the distributor must
serve a "collects" tarball, too. Providing the "collects" will be
built into the snapshot support.


>From Here to There
------------------

The snapshot site

   http://www.cs.utah.edu/plt/snapshots/

demonstrates how a lot is working.

Here are the remaining implementation issues:

 * Generated distribution sites do not yet include a source code
   option or "collects.tgz" for piggy-backing distributors, and the
   makefile or configuration file lacks support for piggy-backing.

   These seem straightforward to add.

 * The PLT-maintained packages are not yet reflected on
   "pkg.racket-lang.org".

   Because all of those packages are currently in one big git
   repository, it's not clear how to register the packages. Guesses #1
   and #2 in "Developing Racket like PLT developers" above are two
   possible routes. Another is that we set up a process to pull from
   git and bundle package sources into individual zip archive that are
   registered on "pkg.racket-lang.org".

 * The `make installers' support needs to be less tied to
   "main-distribution".

   You can configure the set of packages that are built and included
   in installers by `make installers', but that set currently must be
   be a subset of the packages in the "pkgs" directory of the Racket
   repository. It's easy in principle to pull the packages from a
   catalog server, but there will be some issues to sort out in the
   bootstrapping process and in ensuring a consistent snapshot.

 * No support yet for generated distributions sites with binary
   packages.

   Probably not too difficult. I forget what went wrong last time I
   tried this, but a lot has been fixed since then. In any case, the
   idea of binary packages does not seem to have gained much traction.

 * Package-dependency checking for tests.

   Maybe it's just a matter of compiling tests sorting them into
   suitable packages, like everything else, which is a direction that
   we've already started.

 * The "main-distribution" package needs to be cleaned up.

   The "main-distribution" package currently inclues tests, and it
   includes packages like "honu" that are not in the current release.
   This clean-up task is related to sorting out tests.

 * Different builds modes are not yet configured with different
   default package scopes.

   Should be easy.

I also have a long-ish list of minor repairs and usability
improvements to tackle.


>From Back There to Here
-----------------------

I think the big-picture plans are probably uncontroversial.

When it comes to the details of exactly how things work and how things
are named, I'm hearing less confidence or less agreement. Some of us
are steeped in the issues and have different opinions. Others seem
overwhelmed by the details, unsure of how it will all work out, and
disconcerted by conflicting messages from others who seem to
understand the issues. For people who are in that last group or close
to it, it may seem overall that we're moving into a new package system
too quickly.

The decision to split Racket into packages has stressed our
development process, because now we're tackling two hard problems
instead of one: developing a package system and using it on a big pile
of code. I think a good case could be made that the package system is
too new to trust with a big shift. At the same time, my sense is that
waiting until the package system is good enough isn't how software
works; a piece of software becomes good enough for its job only when
you make it do its job.

>From what I hear, the issues that make people uncomfortable fit into
three categories:

 * Package-system design
   
 * Repository organization

 * Concerns that a more distributed ecosystem means a less unified one

Let's take them one at a time.

** Package-system design

We all appreciate the work that Jay did to design the package
system. I hear lingering concern about the design, including its
limited support for versioning (just dependency checks), the fact that
the package system is outside the module system (no built-in
auto-download of packages, although a tool like DrRacket can suggest
package installs in response to missing-library exceptions), its
stance on conflicts (simply disallowed), and its flat namespace (which
could make conflicts more frequent).

On some of the points, I think reasonable people will disagree. We've
had a years-long discussion, and we've been paying attention to
precedents. We've explored some nearby alternatives to the current
design (I'm thinking of single-collection versus multi-collection
packages). I think we've gotten as close to consensus as possible.

** Repository organization

As we try to split the Racket repository into packages, the questions
concern how finely to split the repository and how to eventually
allocate packages to source-code repositories.

I think the initial split of the Racket repository went more smoothly
than anyone expected. It was fairly easy, for example, to extract a
relatively small core to run `raco pkg', or to draw a line between
DrRacket and the teaching languages. I chalk that up to general
competence among the Racket implementors: big systems must be
developed in layers, whether the layers are declared or not.

In fact, it has worked out so well that the splitting of Racket into
packages has taken a more aggressive form than I expected. At this
point, we've split the Racket repository into 137(!) packages, and
that number is still growing. Two of us tried to make a coarser split,
and it didn't feel right. Others have since started shuffling packages
and continue to split things further. We seem to really like declaring
dependencies and reducing unrequested functionality.

Given that packages are going to be split finely, the question of
allocating packages to repositories is less straightforward. We've
concluded that "scribble-lib" and "scribble-doc" are good to have
separate as separate packages, but we don't want Scribble's
implementation and its documentation to end up in a separate
source-code repositories. At the same time, putting everything in one
big repository is intractable, at least at the point where we want
packages downloaded directly from a repository. (A package can be a
subdirectory of a repository, but the package manager has to download
a tarball of the entire tree to extract the subdirectory.) So, under
"pkgs", we have an extra layer in the directory hierarchy to reflect
an intended organization into repositories. Using a layer of
directories is consistent with git submodules, if we choose to go that
way.

The fact that many of us have tried and arrived at the same conclusion
on granularity gives me confidence that it's a reasonable conclusion,
but the current Racket repository organization really does feel
complex. For example, the core of `raco setup' is

   racket/lib/collects/setup/setup-unit.rkt

while the Scribble part of `raco setup' is in

   pkgs/racket-pkgs/racket-index/setup/scribble.rkt

Those paths reflect that `raco setup' is mostly core functionality,
but you don't get documentation setup until you install the
"racket-index" package, which is currently grouped with other
almost-core packages.

This example also illustrates how the current organization relies on
collection splicing in a big way. In the long run, not many
collections are going to be spliced so much as, say, "racket" and
"data", but splicing two or three times to separate modules,
documentation, and tests may turn out to be common.

And then there's

   pkgs/drracket-pkgs/drracket/drracket/drracket.rkt
             ^          ^         ^          ^
            repo      package  collection  module

Every layer before a "/" has multiple descendents, so they layers are
not trivially collapsed. If you just look at the path, it seems
crazy. But if you're expecting <repo>/<package>/<collection>/<module>,
then hopefully it seems reasonable.

In short, the current layout is driven by three factors: a bias toward
fine-grained packages, a sense that it's good to reflect layers and
dependencies via separate filesystem directories, and some constraints
on how directories relate to git repositories. Unless we change those
driving factors, I don't see us arriving at a simpler organization.

** Distributed versus unified ecosystem

While less prominent than the other categories, I'm also hearing some
concern that splitting up the Racket repository and reorganizing
various pieces of infrastructure will lead to a less unified system
--- or even a less unified community.

Moving our products and infrastructure into a more distributed form is
one of my main goals, but I don't think that "distributed" has to mean
"fragmented". It seems to me that the more distributed we are able to
make our world (the Internet, git, etc.), the more closely we are able
to work together. The math behind that effect eludes me, but I believe
in it, anyway.

At the same time, the sudden emphasis on reorganizing the Racket
repository could also give the impression that the new package system
is primarily about distributing Racket, and not about "third-party"
libraries and packages. I think we're trying to make our as much code
as possible treated as "third-party", and thus ensure that all parties
are well supported.


Why Aren't We There, Yet?
-------------------------

We're hardly the first to design a package system or apply it to a big
system, and I can't shake the sense most of the time that we're just
reinventing the wheel. Along those lines, implementing the mechanics
of the package system has been suspiciously difficult.

I hope that part of the reason is our commitment to documentation ---
that it exists, that it builds reliably, that it's richly formatted,
and that it is pervasively cross-referenced and hyperlinked. I don't
think that any package system delivers documentation that's anything
like ours.

Could it also be an unusual commitment to relative paths, especially
when distribution pre-built items? A lot of problems go away if you
know that the library is going to be in "/usr/local/lib".

Surely part of it is trying to make `raco setup' fast for installing
packages. It's complex and fragile to performing an incremental
computation based on changes inferred from filesystem state.

Bootstrapping, at least, is known to be tricky. The Racket compiler
isn't written in Racket, yet, but the installer-creator installs
Racket packages to create a local installation that is used to set up
packages on a remote installation that runs a Racket script to build
an installer. It took many days to make that work and make it
configurable.

On the plus side, `raco setup' can usefully check package dependencies
and sort them into "build-time" and "run-time" dependencies, even for
documentation links, and that checking was relatively easy to
implement. Since module collection references can be synthesized at
run time, there's no way to completely check dependencies statically,
but I think we may end up with something that's more reliable and
complete than checking in other package systems. If so, maybe that
helps explain why it was hard.

_________________________
  Racket Developers list:
  http://lists.racket-lang.org/dev

[racket-dev] package-system update

Reply via email to