Re: [Pulp-dev] Integer IDs in Pulp 3

Dennis Kliban Wed, 11 Jul 2018 13:54:27 -0700

Thanks David. I am in favor of this  change.

On Wed, Jul 11, 2018 at 4:39 PM, David Davis <[email protected]> wrote:


> There is now:
>
> https://pulp.plan.io/issues/3848
>
> David
>
>
> On Wed, Jul 11, 2018 at 4:23 PM Brian Bouterse <[email protected]>
> wrote:
>
>> A 30% improvement I think is a good case for integers over uuids.
>>
>> Is there a ticket tracking that change?
>>
>> On Wed, Jul 11, 2018 at 3:55 PM, Daniel Alley <[email protected]> wrote:
>>
>>> w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22
>>> seconds vs. 55.98 seconds.
>>>
>>> w/ searching through the same 400,000 units, performance is still about
>>> 30% faster.  Doing a filter for file content units that have a
>>> relative_path__startswith={some random letter} (I put UUIDs in all the
>>> fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33
>>> seconds if the model has a default Django auto-incrementing PK.
>>>
>>> On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <[email protected]>
>>> wrote:
>>>
>>>> So, since I've already been working on some Pulp 3 benchmarking I
>>>> decided to go ahead and benchmark this to get some actual data.
>>>>
>>>> Disclaimer:  The following data is using bulk_create() with a modified,
>>>> flat, non-inheriting content model, not the current multi-table inherited
>>>> content model we're currently using.  It's also using bulk_create() which
>>>> we are not currently using in Pulp 3, but likely will end up using
>>>> eventually.
>>>>
>>>> Using normal IDs instead of UUIDs was between 13% and 25% faster with
>>>> 15,000 units.  15,000 units isn't really a sufficient value to actually
>>>> test index performance, so I'm rerunning it with a few hundred thousand
>>>> units, but that will take a substantial amount of time to run.  I'll follow
>>>> up later.
>>>>
>>>> As far as search/update performance goes, that probably has better
>>>> margins than just insert performance, but I'll need to write new code to
>>>> benchmark that properly.
>>>>
>>>> On Thu, May 24, 2018 at 11:52 AM, David Davis <[email protected]>
>>>> wrote:
>>>>
>>>>> Agreed on performance. Doing some more Googling seems to have mixed
>>>>> opinions on whether UUIDs performance is worse or not. If this is a
>>>>> significant reason to switch, I agree we should test out the performance.
>>>>>
>>>>> Regarding the disk size, I think using UUIDs is cumulative. Larger PKs
>>>>> mean bigger index sizes, bigger FKs, etc. I agree that it’s probably not a
>>>>> major concern but I wouldn’t say it’s trivial.
>>>>>
>>>>> David
>>>>>
>>>>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Responses inline.
>>>>>>
>>>>>> On 05/23/2018 02:26 PM, David Davis wrote:
>>>>>> > Before the release of Pulp 3.0 GA, I think it’s worth just checking
>>>>>> in to
>>>>>> > make sure we want to use UUIDs over integer based IDs. Changing
>>>>>> from UUIDs
>>>>>> > to ints would be a very easy change at this point  (1-2 lines of
>>>>>> code) but
>>>>>> > after GA ships, it would be hard if not impossible to switch.
>>>>>> >
>>>>>> > I think there are a number of reasons why we might want to consider
>>>>>> integer
>>>>>> > IDs:
>>>>>> >
>>>>>> > - Better performance all around for inserts[0], searches, indexing,
>>>>>> etc
>>>>>>
>>>>>> I don't really care either way, but it's worth pointing out that
>>>>>> UUIDs are
>>>>>> integers (in the sense that the entire internet can be reduced to a
>>>>>> single
>>>>>> integer since it's all just bits). To the best of my knowledge they
>>>>>> are equally
>>>>>> performant to integers and stored in similar ways in Postgres.
>>>>>>
>>>>>> You linked a MySQL experiment, done using a version of MySQL that is
>>>>>> nearly 10
>>>>>> years old. If there are concerns about the performance of UUID PKs
>>>>>> vs. int PKs
>>>>>> in Pulp, we should compare apples to apples and profile Pulp using
>>>>>> UUID PKs,
>>>>>> profile Pulp using integer PKs, and then compare the two.
>>>>>>
>>>>>> In my small-scale testing (100,000 randomly generated content rows of
>>>>>> a
>>>>>> proto-RPM content model, 1000 repositories randomly related to each,
>>>>>> no db funny
>>>>>> business beyond enforced uniqueness constraints), there was either no
>>>>>> difference, or what difference there was fell into the margin of
>>>>>> error.
>>>>>>
>>>>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs)
>>>>>>
>>>>>> Well, okay...UUIDs are *huge* integers. But it's the length of an
>>>>>> IPv6 address
>>>>>> vs. the length of an IPv4 address. While it's true that 4 < 16, both
>>>>>> are still
>>>>>> pretty small. Trivially so, I think.
>>>>>>
>>>>>> Without taking relations into account, a table with a million rows
>>>>>> should be a
>>>>>> little less than twelve mega(mebi)bytes larger. Even at scale, the
>>>>>> size
>>>>>> difference is negligible, especially when compared to the size on
>>>>>> disk of the
>>>>>> actual content you'd need to be storing that those million rows
>>>>>> represent.
>>>>>>
>>>>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
>>>>>> > - In line with other apps like Katello
>>>>>>
>>>>>> I think these two are definitely worth considering, though.
>>>>>>
>>>>>> > There are some downsides to consider though:
>>>>>> >
>>>>>> > - Integer ids expose info like how many records there are
>>>>>>
>>>>>> This was the main intent, if I recall correctly. UUID PKs are not:
>>>>>> - monotonically increasing
>>>>>> - variably sized (string length, not bit length)
>>>>>>
>>>>>> So an objects PK doesn't give you any indication of how many other
>>>>>> objects may
>>>>>> be in the same collection, and while the Hrefs are long, for any
>>>>>> given resource
>>>>>> they will always be a predictable size.
>>>>>>
>>>>>> The major downside is really that they're a pain in the butt to type
>>>>>> out when
>>>>>> compared to int PKs, so if users are in a situation where they do
>>>>>> have to type
>>>>>> these things out, I think something has gone wrong.
>>>>>>
>>>>>> If users typing in PKs can't be avoided, UUIDs probably should be
>>>>>> avoided. I
>>>>>> recognize that this is effectively a restatement of "Hrefs would be
>>>>>> shorter" in
>>>>>> the context of how that impacts the user.
>>>>>>
>>>>>> > - Can’t support sharding or multiple dbs (are we ever going to need
>>>>>> this?)
>>>>>>
>>>>>> A very good question. To the best of my recollection this was never
>>>>>> stated as a
>>>>>> hard requirement; it was only ever mentioned like it is here, as a
>>>>>> potential
>>>>>> positive side-effect of UUID keys. If collision-avoidance is not
>>>>>> desired, and
>>>>>> will certainly never be desired, then a normal integer field would
>>>>>> likely be a
>>>>>> less astonishing[0] user experience, and therefore a better user
>>>>>> experience.
>>>>>>
>>>>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> [email protected]
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> [email protected]
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> [email protected]
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
> _______________________________________________
> Pulp-dev mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Integer IDs in Pulp 3

Reply via email to