Danek Duvall wrote:
On Wed, Jul 15, 2009 at 02:57:26PM -0500, Shawn Walker wrote:

    Catalogs are not separated by locale since manifests are not,

Hm.  Do we really want to bloat a catalog with dozens of languages?  I
assume that almost all clients will want just one language, and however
fast the serialization and deserialization is, multiplying the data in
there is going to slow things down.

I don't see this being an issue for "base" or "depend", but definitely for
"summary".

The issue comes in with signing. I'm willing to re-work this to split the catalogs by locale again, but when I asked Bart about this, he had mentioned that, at the moment, there were no plans to split manifests by locale.

The reasoning, by extension, was that the same would have to be applied to catalogs.

    and all data is assumed to be encodable using UTF-8.

We'll want an escape for this, I'm pretty sure.  My plan for the manifests
was going to be a set action, at or near the beginning of the manifests,
that provided the encoding name, at which point the engine could switch
over to that encoding.  For manifests which may change encoding from action
to action (descriptions in multiple languages, for instance), the action
itself could provide the encoding, and if that encoding is in the action
string prior to any values that use it, we can switch over at that point in
preparation.

I imagine that this could be done here, but I'm not sure whether JSON
provides for multiple encodings in-stream.

And then there are any issues that the unicode() type might have with
certain character sets, if such issues exist.

This may or may not put a damper on using JSON, but JSON only allows Unicode:

"The character encoding of JSON text is always Unicode. UTF-8 is the only encoding that makes sense on the wire, but UTF-16 and UTF-32 are also permitted." [1]

If we must support non-Unicode encoding, then I will have to abandon JSON in favour of the manifest-style format.

        created:
            The value is an ISO-8601 formatted date in UTC time
            indicating when the catalog was created.

        last-modified:
            The value is an ISO-8601 formatted date in UTC time
            indicating when the catalog was last updated.

        Example:

        {
            'created': '2005-06-14T08:00:00.686485',
            'last-modified': '2009-05-08T16:10:25.686485',

These should be in UTC, right?  Also, is there a reason you chose the
extended format, rather than the basic format used in the FMRIs?

Yes, those should have a 'Z' on the end. And yes, I used the ISO-8601 format instead of the basic format used in FMRIs since that's a standard format that should be easily parsable and requires no explanation or special logic (ignoring the fact that python 2.4 can't parse microseconds in ISO-8601 dates for datetime objects despite the fact that the standard allows them).

    - catalog.<name>

        Catalog files will contain a python dict structure serialized
        in JSON (JavaScript Object Notation) format.  Version entries
        for each package stem are kept in ascending version order to
        allow fast lookups by the client and avoid sort overhead on
        load.  The structure can be described as follows:

        {
            <publisher-prefix>: {
                <FMRI package stem>: [
                    {
                        "op-time": <ISO-8601 Date and Time>

What's "op-time" doing in the catalog?

copy-paste bug.  That was intended for the updatelog only.

            "SUNWdvdrw":[
              {
                "version":"5.21.4.10.8,5.11-0.108:20090218T042840Z",
                "actions":[
                  "set name=variant.zone value=global value=nonglobal",
                  "set name=variant.arch value=sparc value=i386",
                  "depend [email protected] type=require",
                  "depend [email protected] type=require",
                  "depend [email protected] type=require"
                ]
              }
            ],

I wonder if we couldn't push the actions fully into JSON, too:

    "actions": [
        {
            "": "set",
            "name": "variant.zone",
            "value": ["global", "nonglobal"]
        },
        {
            "": "depend",
            "fmri": "[email protected]",
            "type": "require"
        },
    ]

except that this might add sufficient extra depth to the catalog that
serialization times on SPARC kick up again.

There are two reasons I didn't go that route:

1) serialization overhead (even on x86)

2) file-size difference (much larger)

3) Seemed pointless, since the client api consumes actions, and the fastest way to transform the data back into an action object is through the action string. As Bart pointed out to me the other day, the Catalog is a glorified action pipeline.

We've been talking about providing the install folks with a quick way of
determining the size of packages to be installed; it would make sense for
this information to go here as well.  There would be multiple entries, for
all the combinations of variants and facets.  It would also be synthetic,
not pulled directly from the manifest.

I would also expect package rename and obsoletion set actions to go here as
well.

I'll assume the above are FYIs only and not a request for a change to the proposed format or entries.

The structure also brings up the issue of stability order of entries, which
will be necessary for catalog signing.  Signing and verification could
always happen on a transform of the catalog, where dictionaries are turned
into lists of key/value pairs sorted by key, but given that we'll want to
verify the signature every time we read the catalog (I assume), then this
seems a bit expensive.  Perhaps you or Bart have given some more thought to
this?

I haven't, but Bart's comment about only doing signature validation when the catalog is updated or first retrieved seems right to me. Attempting to write the catalog in a pre-sorted order beyond very simple rules would require a custom writer (very likely) or a change to the data structure that naturally enforced order (such as lists). I've already enforced some order by ensuring version entries are always in ascending version order.

2.2  Server Changes

    To enable clients to retrieve the new catalog files and incremental
    updates to them, the following changes will be made:

    - The new catalog files will be stored in the /var/pkg/catalog

Don't you mean <repo>/catalog?  (See also the next comment.)

copy/paste bug :(

<repo>/catalog is correct.

2.3.1  Image Changes

    - The image object, upon initialization, will remove the
      /var/pkg/catalog directory and its contents if possible.
      If this cannot be done (due to permissions), the client
      will continue on.  If it can be removed, a new directory
      named /var/pkg/publisher be created, and publisher objects
      will be told to store and retrieve their metadata from it.

Interesting.  This, plus your proposal to put the server-side catalog in
/var/pkg/catalog suggests that you're intending to be able to use /var/pkg
as the root of a repo on which you can just run a depot.  If so, I'm not
sure that's exactly the way you want to do it.  The files and manifests,
IMHO, really ought to be in a dataset of their own, outside of the ROOT
dataset, since they're completely shareable, and freeing up disk space
shouldn't be prevented simply because you're trying to remove old files
held on disk by an old snapshot.

Of course, sharing manifests and files (and even the catalog) between
client and server should be doable, but I'd like you to take the above into
account.

And my assumption could be completely wrong.  In which case I don't
understand the use of /var/pkg/publisher here instead of /var/pkg/catalog.

The point of /var/pkg/publisher is that metadata, other than catalog data, will eventually be stored on a per-publisher basis. So, instead of having /var/pkg/catalog /var/pkg/<metatype1> /var/pkg/<metatype2>, I'm just organising this a bit more.

To be clear, the intent is really:

/var/pkg/publisher/<prefix>/

But, it probably should be:

/var/pkg/publisher/<prefix>/catalogs/

    - For performance reasons, the client api will also store
      versions of each of the catalogs proposed that only
      contain entries for installed FMRIs to accelerate common
      client functions such as info, list, uninstall, etc.

Does/could this eliminate the need for /var/pkg/state/install, as well?
And the "installed" files.

It does. Especially since (long-term) the __PRE__ wackiness will be going away once we allow ranking of all publishers and no longer have a single, preferred publisher.

    It was discovered that the likely reason for poor serialization on
    some SPARC systems is that simplejson uses a recursive function-
    based iterative encoder that does not perform well on SPARC systems
    (due to register windows?).

We have a workaround now, but we should probably file a bug on this and
possibly work on an implementation that isn't recursive.

Where should that bug be filed?

Cheers,
--
Shawn Walker

[1] http://www.json.org/fatfree.html
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to