Re: [Pulp-dev] Integer IDs in Pulp 3

Jeff Ortel Thu, 19 Jul 2018 07:20:17 -0700

The PK for a task record in the db does not need to be the same as thejob ID in rq/redis. Consistency is good. Let's make the Task.id (intlike the rest of the tables) and add a job_id to correlate with rq/redis.


On 07/11/2018 03:20 PM, David Davis wrote:

I actually started working on converting IDs from UUIDs to integerIDs. It was pretty easy with one exception. Jobs in rq/redis arecreated using task id[0] and this job id needs to be a uuid. I see twopossible solutions:


1. We leave task id as a UUID but every other id is an integer
2. We add a job uuid field on task

With the hard numbers that show that integer IDs are significantlyfaster, I think we should proceed unless anyone has a major objection.


Great work on this btw.

[0]https://github.com/pulp/pulp/blob/9bfc50d90a24c9d0ac4a93f5718187515b947058/pulpcore/pulpcore/tasking/tasks.py#L187


David

On Wed, Jul 11, 2018 at 3:56 PM Daniel Alley <[email protected]<mailto:[email protected]>> wrote:


    w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22
    seconds vs. 55.98 seconds.

    w/ searching through the same 400,000 units, performance is still
    about 30% faster.  Doing a filter for file content units that have
    a relative_path__startswith={some random letter} (I put UUIDs in
    all the fields) takes about 0.44 seconds if the model has a UUID
    pk and about 0.33 seconds if the model has a default Django
    auto-incrementing PK.

    On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <[email protected]
    <mailto:[email protected]>> wrote:

        So, since I've already been working on some Pulp 3
        benchmarking I decided to go ahead and benchmark this to get
        some actual data.

        Disclaimer:  The following data is using bulk_create() with a
        modified, flat, non-inheriting content model, not the current
        multi-table inherited content model we're currently using. 
        It's also using bulk_create() which we are not currently using
        in Pulp 3, but likely will end up using eventually.

        Using normal IDs instead of UUIDs was between 13% and 25%
        faster with 15,000 units.  15,000 units isn't really a
        sufficient value to actually test index performance, so I'm
        rerunning it with a few hundred thousand units, but that will
        take a substantial amount of time to run.  I'll follow up later.

        As far as search/update performance goes, that probably has
        better margins than just insert performance, but I'll need to
        write new code to benchmark that properly.

        On Thu, May 24, 2018 at 11:52 AM, David Davis
        <[email protected] <mailto:[email protected]>> wrote:

            Agreed on performance. Doing some more Googling seems to
            have mixed opinions on whether UUIDs performance is worse
            or not. If this is a significant reason to switch, I agree
            we should test out the performance.

            Regarding the disk size, I think using UUIDs is
            cumulative. Larger PKs mean bigger index sizes, bigger
            FKs, etc. I agree that it’s probably not a major concern
            but I wouldn’t say it’s trivial.

            David

            On Thu, May 24, 2018 at 11:27 AM, Sean Myers
            <[email protected] <mailto:[email protected]>> wrote:

                Responses inline.

                On 05/23/2018 02:26 PM, David Davis wrote:
                > Before the release of Pulp 3.0 GA, I think it’s
                worth just checking in to
                > make sure we want to use UUIDs over integer based
                IDs. Changing from UUIDs
                > to ints would be a very easy change at this point 
                (1-2 lines of code) but
                > after GA ships, it would be hard if not impossible
                to switch.
                >
                > I think there are a number of reasons why we might
                want to consider integer
                > IDs:
                >
                > - Better performance all around for inserts[0],
                searches, indexing, etc

                I don't really care either way, but it's worth
                pointing out that UUIDs are
                integers (in the sense that the entire internet can be
                reduced to a single
                integer since it's all just bits). To the best of my
                knowledge they are equally
                performant to integers and stored in similar ways in
                Postgres.

                You linked a MySQL experiment, done using a version of
                MySQL that is nearly 10
                years old. If there are concerns about the performance
                of UUID PKs vs. int PKs
                in Pulp, we should compare apples to apples and
                profile Pulp using UUID PKs,
                profile Pulp using integer PKs, and then compare the two.

                In my small-scale testing (100,000 randomly generated
                content rows of a
                proto-RPM content model, 1000 repositories randomly
                related to each, no db funny
                business beyond enforced uniqueness constraints),
                there was either no
                difference, or what difference there was fell into the
                margin of error.

                > - Less storage required (4 bytes for int vs 16 byes
                for UUIDs)

                Well, okay...UUIDs are *huge* integers. But it's the
                length of an IPv6 address
                vs. the length of an IPv4 address. While it's true
                that 4 < 16, both are still
                pretty small. Trivially so, I think.

                Without taking relations into account, a table with a
                million rows should be a
                little less than twelve mega(mebi)bytes larger. Even
                at scale, the size
                difference is negligible, especially when compared to
                the size on disk of the
                actual content you'd need to be storing that those
                million rows represent.

                > - Hrefs would be shorter (e.g.
                /pulp/api/v3/repositories/1/)
                > - In line with other apps like Katello

                I think these two are definitely worth considering,
                though.

                > There are some downsides to consider though:
                >
                > - Integer ids expose info like how many records
                there are

                This was the main intent, if I recall correctly. UUID
                PKs are not:
                - monotonically increasing
                - variably sized (string length, not bit length)

                So an objects PK doesn't give you any indication of
                how many other objects may
                be in the same collection, and while the Hrefs are
                long, for any given resource
                they will always be a predictable size.

                The major downside is really that they're a pain in
                the butt to type out when
                compared to int PKs, so if users are in a situation
                where they do have to type
                these things out, I think something has gone wrong.

                If users typing in PKs can't be avoided, UUIDs
                probably should be avoided. I
                recognize that this is effectively a restatement of
                "Hrefs would be shorter" in
                the context of how that impacts the user.

                > - Can’t support sharding or multiple dbs (are we
                ever going to need this?)

                A very good question. To the best of my recollection
                this was never stated as a
                hard requirement; it was only ever mentioned like it
                is here, as a potential
                positive side-effect of UUID keys. If
                collision-avoidance is not desired, and
                will certainly never be desired, then a normal integer
                field would likely be a
                less astonishing[0] user experience, and therefore a
                better user experience.

                [0]:
                https://en.wikipedia.org/wiki/Principle_of_least_astonishment


                _______________________________________________
                Pulp-dev mailing list
                [email protected] <mailto:[email protected]>
                https://www.redhat.com/mailman/listinfo/pulp-dev



            _______________________________________________
            Pulp-dev mailing list
            [email protected] <mailto:[email protected]>
            https://www.redhat.com/mailman/listinfo/pulp-dev





_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Integer IDs in Pulp 3

Reply via email to