Re: [Pulp-dev] Integer IDs in Pulp 3

David Davis Wed, 11 Jul 2018 13:21:58 -0700

I actually started working on converting IDs from UUIDs to integer IDs. It
was pretty easy with one exception. Jobs in rq/redis are created using task
id[0] and this job id needs to be a uuid. I see two possible solutions:


1. We leave task id as a UUID but every other id is an integer
2. We add a job uuid field on task

With the hard numbers that show that integer IDs are significantly faster,
I think we should proceed unless anyone has a major objection.

Great work on this btw.

[0]
https://github.com/pulp/pulp/blob/9bfc50d90a24c9d0ac4a93f5718187515b947058/pulpcore/pulpcore/tasking/tasks.py#L187

David


On Wed, Jul 11, 2018 at 3:56 PM Daniel Alley <[email protected]> wrote:

> w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22 seconds
> vs. 55.98 seconds.
>
> w/ searching through the same 400,000 units, performance is still about
> 30% faster.  Doing a filter for file content units that have a
> relative_path__startswith={some random letter} (I put UUIDs in all the
> fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33
> seconds if the model has a default Django auto-incrementing PK.
>
> On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <[email protected]> wrote:
>
>> So, since I've already been working on some Pulp 3 benchmarking I decided
>> to go ahead and benchmark this to get some actual data.
>>
>> Disclaimer:  The following data is using bulk_create() with a modified,
>> flat, non-inheriting content model, not the current multi-table inherited
>> content model we're currently using.  It's also using bulk_create() which
>> we are not currently using in Pulp 3, but likely will end up using
>> eventually.
>>
>> Using normal IDs instead of UUIDs was between 13% and 25% faster with
>> 15,000 units.  15,000 units isn't really a sufficient value to actually
>> test index performance, so I'm rerunning it with a few hundred thousand
>> units, but that will take a substantial amount of time to run.  I'll follow
>> up later.
>>
>> As far as search/update performance goes, that probably has better
>> margins than just insert performance, but I'll need to write new code to
>> benchmark that properly.
>>
>> On Thu, May 24, 2018 at 11:52 AM, David Davis <[email protected]>
>> wrote:
>>
>>> Agreed on performance. Doing some more Googling seems to have mixed
>>> opinions on whether UUIDs performance is worse or not. If this is a
>>> significant reason to switch, I agree we should test out the performance.
>>>
>>> Regarding the disk size, I think using UUIDs is cumulative. Larger PKs
>>> mean bigger index sizes, bigger FKs, etc. I agree that it’s probably not a
>>> major concern but I wouldn’t say it’s trivial.
>>>
>>> David
>>>
>>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <[email protected]>
>>> wrote:
>>>
>>>> Responses inline.
>>>>
>>>> On 05/23/2018 02:26 PM, David Davis wrote:
>>>> > Before the release of Pulp 3.0 GA, I think it’s worth just checking
>>>> in to
>>>> > make sure we want to use UUIDs over integer based IDs. Changing from
>>>> UUIDs
>>>> > to ints would be a very easy change at this point  (1-2 lines of
>>>> code) but
>>>> > after GA ships, it would be hard if not impossible to switch.
>>>> >
>>>> > I think there are a number of reasons why we might want to consider
>>>> integer
>>>> > IDs:
>>>> >
>>>> > - Better performance all around for inserts[0], searches, indexing,
>>>> etc
>>>>
>>>> I don't really care either way, but it's worth pointing out that UUIDs
>>>> are
>>>> integers (in the sense that the entire internet can be reduced to a
>>>> single
>>>> integer since it's all just bits). To the best of my knowledge they are
>>>> equally
>>>> performant to integers and stored in similar ways in Postgres.
>>>>
>>>> You linked a MySQL experiment, done using a version of MySQL that is
>>>> nearly 10
>>>> years old. If there are concerns about the performance of UUID PKs vs.
>>>> int PKs
>>>> in Pulp, we should compare apples to apples and profile Pulp using UUID
>>>> PKs,
>>>> profile Pulp using integer PKs, and then compare the two.
>>>>
>>>> In my small-scale testing (100,000 randomly generated content rows of a
>>>> proto-RPM content model, 1000 repositories randomly related to each, no
>>>> db funny
>>>> business beyond enforced uniqueness constraints), there was either no
>>>> difference, or what difference there was fell into the margin of error.
>>>>
>>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs)
>>>>
>>>> Well, okay...UUIDs are *huge* integers. But it's the length of an IPv6
>>>> address
>>>> vs. the length of an IPv4 address. While it's true that 4 < 16, both
>>>> are still
>>>> pretty small. Trivially so, I think.
>>>>
>>>> Without taking relations into account, a table with a million rows
>>>> should be a
>>>> little less than twelve mega(mebi)bytes larger. Even at scale, the size
>>>> difference is negligible, especially when compared to the size on disk
>>>> of the
>>>> actual content you'd need to be storing that those million rows
>>>> represent.
>>>>
>>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
>>>> > - In line with other apps like Katello
>>>>
>>>> I think these two are definitely worth considering, though.
>>>>
>>>> > There are some downsides to consider though:
>>>> >
>>>> > - Integer ids expose info like how many records there are
>>>>
>>>> This was the main intent, if I recall correctly. UUID PKs are not:
>>>> - monotonically increasing
>>>> - variably sized (string length, not bit length)
>>>>
>>>> So an objects PK doesn't give you any indication of how many other
>>>> objects may
>>>> be in the same collection, and while the Hrefs are long, for any given
>>>> resource
>>>> they will always be a predictable size.
>>>>
>>>> The major downside is really that they're a pain in the butt to type
>>>> out when
>>>> compared to int PKs, so if users are in a situation where they do have
>>>> to type
>>>> these things out, I think something has gone wrong.
>>>>
>>>> If users typing in PKs can't be avoided, UUIDs probably should be
>>>> avoided. I
>>>> recognize that this is effectively a restatement of "Hrefs would be
>>>> shorter" in
>>>> the context of how that impacts the user.
>>>>
>>>> > - Can’t support sharding or multiple dbs (are we ever going to need
>>>> this?)
>>>>
>>>> A very good question. To the best of my recollection this was never
>>>> stated as a
>>>> hard requirement; it was only ever mentioned like it is here, as a
>>>> potential
>>>> positive side-effect of UUID keys. If collision-avoidance is not
>>>> desired, and
>>>> will certainly never be desired, then a normal integer field would
>>>> likely be a
>>>> less astonishing[0] user experience, and therefore a better user
>>>> experience.
>>>>
>>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>>>>
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> [email protected]
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> [email protected]
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Integer IDs in Pulp 3

Reply via email to