Thanks David. I am in favor of this change. On Wed, Jul 11, 2018 at 4:39 PM, David Davis <[email protected]> wrote:
> There is now: > > https://pulp.plan.io/issues/3848 > > David > > > On Wed, Jul 11, 2018 at 4:23 PM Brian Bouterse <[email protected]> > wrote: > >> A 30% improvement I think is a good case for integers over uuids. >> >> Is there a ticket tracking that change? >> >> On Wed, Jul 11, 2018 at 3:55 PM, Daniel Alley <[email protected]> wrote: >> >>> w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22 >>> seconds vs. 55.98 seconds. >>> >>> w/ searching through the same 400,000 units, performance is still about >>> 30% faster. Doing a filter for file content units that have a >>> relative_path__startswith={some random letter} (I put UUIDs in all the >>> fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33 >>> seconds if the model has a default Django auto-incrementing PK. >>> >>> On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <[email protected]> >>> wrote: >>> >>>> So, since I've already been working on some Pulp 3 benchmarking I >>>> decided to go ahead and benchmark this to get some actual data. >>>> >>>> Disclaimer: The following data is using bulk_create() with a modified, >>>> flat, non-inheriting content model, not the current multi-table inherited >>>> content model we're currently using. It's also using bulk_create() which >>>> we are not currently using in Pulp 3, but likely will end up using >>>> eventually. >>>> >>>> Using normal IDs instead of UUIDs was between 13% and 25% faster with >>>> 15,000 units. 15,000 units isn't really a sufficient value to actually >>>> test index performance, so I'm rerunning it with a few hundred thousand >>>> units, but that will take a substantial amount of time to run. I'll follow >>>> up later. >>>> >>>> As far as search/update performance goes, that probably has better >>>> margins than just insert performance, but I'll need to write new code to >>>> benchmark that properly. >>>> >>>> On Thu, May 24, 2018 at 11:52 AM, David Davis <[email protected]> >>>> wrote: >>>> >>>>> Agreed on performance. Doing some more Googling seems to have mixed >>>>> opinions on whether UUIDs performance is worse or not. If this is a >>>>> significant reason to switch, I agree we should test out the performance. >>>>> >>>>> Regarding the disk size, I think using UUIDs is cumulative. Larger PKs >>>>> mean bigger index sizes, bigger FKs, etc. I agree that it’s probably not a >>>>> major concern but I wouldn’t say it’s trivial. >>>>> >>>>> David >>>>> >>>>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <[email protected]> >>>>> wrote: >>>>> >>>>>> Responses inline. >>>>>> >>>>>> On 05/23/2018 02:26 PM, David Davis wrote: >>>>>> > Before the release of Pulp 3.0 GA, I think it’s worth just checking >>>>>> in to >>>>>> > make sure we want to use UUIDs over integer based IDs. Changing >>>>>> from UUIDs >>>>>> > to ints would be a very easy change at this point (1-2 lines of >>>>>> code) but >>>>>> > after GA ships, it would be hard if not impossible to switch. >>>>>> > >>>>>> > I think there are a number of reasons why we might want to consider >>>>>> integer >>>>>> > IDs: >>>>>> > >>>>>> > - Better performance all around for inserts[0], searches, indexing, >>>>>> etc >>>>>> >>>>>> I don't really care either way, but it's worth pointing out that >>>>>> UUIDs are >>>>>> integers (in the sense that the entire internet can be reduced to a >>>>>> single >>>>>> integer since it's all just bits). To the best of my knowledge they >>>>>> are equally >>>>>> performant to integers and stored in similar ways in Postgres. >>>>>> >>>>>> You linked a MySQL experiment, done using a version of MySQL that is >>>>>> nearly 10 >>>>>> years old. If there are concerns about the performance of UUID PKs >>>>>> vs. int PKs >>>>>> in Pulp, we should compare apples to apples and profile Pulp using >>>>>> UUID PKs, >>>>>> profile Pulp using integer PKs, and then compare the two. >>>>>> >>>>>> In my small-scale testing (100,000 randomly generated content rows of >>>>>> a >>>>>> proto-RPM content model, 1000 repositories randomly related to each, >>>>>> no db funny >>>>>> business beyond enforced uniqueness constraints), there was either no >>>>>> difference, or what difference there was fell into the margin of >>>>>> error. >>>>>> >>>>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs) >>>>>> >>>>>> Well, okay...UUIDs are *huge* integers. But it's the length of an >>>>>> IPv6 address >>>>>> vs. the length of an IPv4 address. While it's true that 4 < 16, both >>>>>> are still >>>>>> pretty small. Trivially so, I think. >>>>>> >>>>>> Without taking relations into account, a table with a million rows >>>>>> should be a >>>>>> little less than twelve mega(mebi)bytes larger. Even at scale, the >>>>>> size >>>>>> difference is negligible, especially when compared to the size on >>>>>> disk of the >>>>>> actual content you'd need to be storing that those million rows >>>>>> represent. >>>>>> >>>>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/) >>>>>> > - In line with other apps like Katello >>>>>> >>>>>> I think these two are definitely worth considering, though. >>>>>> >>>>>> > There are some downsides to consider though: >>>>>> > >>>>>> > - Integer ids expose info like how many records there are >>>>>> >>>>>> This was the main intent, if I recall correctly. UUID PKs are not: >>>>>> - monotonically increasing >>>>>> - variably sized (string length, not bit length) >>>>>> >>>>>> So an objects PK doesn't give you any indication of how many other >>>>>> objects may >>>>>> be in the same collection, and while the Hrefs are long, for any >>>>>> given resource >>>>>> they will always be a predictable size. >>>>>> >>>>>> The major downside is really that they're a pain in the butt to type >>>>>> out when >>>>>> compared to int PKs, so if users are in a situation where they do >>>>>> have to type >>>>>> these things out, I think something has gone wrong. >>>>>> >>>>>> If users typing in PKs can't be avoided, UUIDs probably should be >>>>>> avoided. I >>>>>> recognize that this is effectively a restatement of "Hrefs would be >>>>>> shorter" in >>>>>> the context of how that impacts the user. >>>>>> >>>>>> > - Can’t support sharding or multiple dbs (are we ever going to need >>>>>> this?) >>>>>> >>>>>> A very good question. To the best of my recollection this was never >>>>>> stated as a >>>>>> hard requirement; it was only ever mentioned like it is here, as a >>>>>> potential >>>>>> positive side-effect of UUID keys. If collision-avoidance is not >>>>>> desired, and >>>>>> will certainly never be desired, then a normal integer field would >>>>>> likely be a >>>>>> less astonishing[0] user experience, and therefore a better user >>>>>> experience. >>>>>> >>>>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pulp-dev mailing list >>>>>> [email protected] >>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Pulp-dev mailing list >>>>> [email protected] >>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Pulp-dev mailing list >>> [email protected] >>> https://www.redhat.com/mailman/listinfo/pulp-dev >>> >>> >> > _______________________________________________ > Pulp-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/pulp-dev > >
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
