Re: [pkg-discuss] Observations on IPS

Nicolas Williams Tue, 16 Sep 2008 09:41:14 -0700

On Tue, Sep 16, 2008 at 08:29:24PM +0530, Moinak Ghosh wrote:
> On Mon, Sep 15, 2008 at 10:21 PM, Stephen Hahn <[EMAIL PROTECTED]> wrote:
> > * Moinak Ghosh <[EMAIL PROTECTED]> [2008-09-14 19:31]:
> >>    3. Another fundamental restriction is that an IPS repo cannot be
> >> rsync-ed. IPS maintains an index in a huge sparse file rendering rsync
> >> impossible. In addition a running server is continuously accessing/
> >> updating metadata making it unsafe for rsync. Rsync is a tried and
> >> proven and highly optimized algorithm for mirroring used virtually by
> >> every mirroring service on the planet and distros need to support it.
> >
> >  Dan pointed out that the index implementation changed some time ago.
> >  I am uncertain why you believe that there is continuous change in the
> >  metadata; such a belief is incorrect, and the discrete changes at
> >  package publication time can be isolated from any rsync service.
> >
> 
>    This is fine and removes one big problem of the sparse file. However
>    rsync is still not straightforward. When rsync-ing from server_a to 
> server_b
>    the depotd on the server_b will have to be stopped for the duration of the
>    rsync. Alternatively one has to maintain a duplicate directory structure
>    on server_b, rsync to that and then cpio it to the actual depot to reduce
>    downtime. In any case this some amount of round-about activity and
>    does not fit into the straight zero-complexity distribution of content used
>    all over the place today.


You know that existing files other than indexes will be unmodified, with
only additions and deletions.  The indexes can be re-created locally.

So, what value does rsync add that couldn't be provided by a tool like
pkgrecv?

Consider too that by using the depot's stable interfaces (HTTP) you get
to avoid having to stop the depot during the mirror operation, whereas
with rsync, as you point out, you must stop the depot.

rsync is very useful, but it's not necessarily _the_ tool to use for any
project where mirroring is required.

> >>    8. IPS metadata is extremely opaque making it impossible for anyone
> >> to understand it and cost of corruption high both on installed system and
> >> on the repository server. With other solutions repairing a corrupt repo can
> >> be as simple as an rsync from a mirror. We believe that simple human-
> >> readable metadata that adequately serves the purpose is enough and is
> >> in fact vital.
> >
> >  I'm sure I'm too close to this, so you'll need to explain "extremely
> >  opaque" and "impossible for anyone".  What specific improvements would
> >  lead to simple human-readability?
> >
> 
>    I will admit here that my original comment goes a little overboard. I have
>    been compiling this list based on wide feedback and did not digest this
>    one.
> 
>    However the approach of naming files in the repo as hashes instead of the
>    actual filenames is confusing. One cannot figure out what is what without
>    cross-checking with the manifest.

I don't think this is an issue.  There are bound to be multiple versions
of a package, and, therefore, file, in the depot, so the file naming
conventions that you hint at would have to be pretty complicated and
confusing too.

Personally I'd prefer that IPS use MySQL embedded or SQLite3 as its DB,
since that would make it easy to do very raw queries while safely
bypassing IPS (queries such as mapping those file hashes to {pkgname,
version, filepath} tuples, for example).  And it would make hacking on
IPS easier too.

It would also make mirroring of the indexes, without stopping the depot
or rebuilding the indexes locally, easier too.

I suspect that will happen eventually.

> >>   10. IPS performance seems to be on the low side. I have seen an
> >> image-update in a machine in the US taking 3mins to compute the update
> >> plan. It seems to me as a gut feeling that the abstractions used are not
> >> utilizing Python's strengths. Far too much complexity.
> >
> >  We regularly run performance tests to see what operations are
> >  expensive.  I believe the bulk of the cost you are seeing is actually
> >  directory scanning in the image, but the next performance checks will
> >  confirm that.
> >
> 
>    It seems to be directory scanning from a little DTracing I did today but
>    further digging is warranted.

readdir(3C) is synchronous, so are open(2), close(2), ...  Very painful,
that.  Of course, one can write multi-threaded code to traverse
filesystems.  That's probably the solution to this problem.

> >>   11. IPS operations are somewhat opaque from the observability point of
> >> view. It is rather difficult for developers.
> >
> >  Vague; please expand.
> >
> 
>    I will point you to an example:
>    http://www.thewrittenword.com/www/projects/pkgutils/pkgadd/

Surely this level of verbosity can be added, no?

> >  At what points would the download cache contents be useful in an
> >  emergency, in a way beyond that envisioned by the fix subcommand to
> >  pkg(1)?
> >
> 
>    Ability too see filenames makes it clear what is there in the cache. One

I would prefer to treat the cache as a blackbox.

>    *) The feature of tagging within a package and filtering is not yet being
>    used and the potential to misuse this is already being exploited. Consider
>    the monolithic 450M OpenOffice package. There is no way one can install
>    say a single or selected components like Writer. One has to install the
>    whole hog.

I don't see how your example relates to the problem definition you give.
Clearly there may be ways to split up packages so that smaller sets of
components can be installed than "all."  But that's a problem for the
folks doing the packaging, not a problem for IPS.  And clearly there may
be a point at which a package cannot be split further.

Package tagging, however, can be very useful, I agree.  But "sub-package
tagging" -- I'm not sure what that is.

>   *) One final point from my observation, enterprises today have heterogenous
>   environments having Windows, Linux, Solaris and possibly other legacy
>   OS-es like say AIX. Leaving aside Windows and legacy, there are significant
>   frameworks setup for controlled delivery of software to hundreds and 
> thousands
>   of boxes typically involving a package repository. The management of all 
> these
>   can become hell of a lot easier if it is possible to use a uniform 
> repository
>   across platforms. So the repository needs to be modular and extensible to
>   different native packaging systems. Unfortunately IPS tightly couples
>   packaging and network repository making this use-case impossible. If IPS
>   had defined an independent stable on-disk format, had worked with an 
> existing
>   community repository project rather than re-doing from scratch, it would 
> have
>   made possible a common repository deployment for both Linux and Solaris,
>   reduced administrative and maintenance cost and reduced one small barrier
>   to entry for OpenSolaris.

Is this about the current tight coupling between repository server and
publication service?  If so, I agree, the two need to be separated, and
I already made this comment in response to the IPS ARC pre-inception
case.

Nico
-- 
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Re: [pkg-discuss] Observations on IPS

Reply via email to