I actually started working on converting IDs from UUIDs to integer
IDs. It was pretty easy with one exception. Jobs in rq/redis are
created using task id[0] and this job id needs to be a uuid. I see two
possible solutions:
1. We leave task id as a UUID but every other id is an integer
2. We add a job uuid field on task
With the hard numbers that show that integer IDs are significantly
faster, I think we should proceed unless anyone has a major objection.
Great work on this btw.
[0]
https://github.com/pulp/pulp/blob/9bfc50d90a24c9d0ac4a93f5718187515b947058/pulpcore/pulpcore/tasking/tasks.py#L187
David
On Wed, Jul 11, 2018 at 3:56 PM Daniel Alley <dal...@redhat.com
<mailto:dal...@redhat.com>> wrote:
w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22
seconds vs. 55.98 seconds.
w/ searching through the same 400,000 units, performance is still
about 30% faster. Doing a filter for file content units that have
a relative_path__startswith={some random letter} (I put UUIDs in
all the fields) takes about 0.44 seconds if the model has a UUID
pk and about 0.33 seconds if the model has a default Django
auto-incrementing PK.
On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <dal...@redhat.com
<mailto:dal...@redhat.com>> wrote:
So, since I've already been working on some Pulp 3
benchmarking I decided to go ahead and benchmark this to get
some actual data.
Disclaimer: The following data is using bulk_create() with a
modified, flat, non-inheriting content model, not the current
multi-table inherited content model we're currently using.
It's also using bulk_create() which we are not currently using
in Pulp 3, but likely will end up using eventually.
Using normal IDs instead of UUIDs was between 13% and 25%
faster with 15,000 units. 15,000 units isn't really a
sufficient value to actually test index performance, so I'm
rerunning it with a few hundred thousand units, but that will
take a substantial amount of time to run. I'll follow up later.
As far as search/update performance goes, that probably has
better margins than just insert performance, but I'll need to
write new code to benchmark that properly.
On Thu, May 24, 2018 at 11:52 AM, David Davis
<davidda...@redhat.com <mailto:davidda...@redhat.com>> wrote:
Agreed on performance. Doing some more Googling seems to
have mixed opinions on whether UUIDs performance is worse
or not. If this is a significant reason to switch, I agree
we should test out the performance.
Regarding the disk size, I think using UUIDs is
cumulative. Larger PKs mean bigger index sizes, bigger
FKs, etc. I agree that it’s probably not a major concern
but I wouldn’t say it’s trivial.
David
On Thu, May 24, 2018 at 11:27 AM, Sean Myers
<sean.my...@redhat.com <mailto:sean.my...@redhat.com>> wrote:
Responses inline.
On 05/23/2018 02:26 PM, David Davis wrote:
> Before the release of Pulp 3.0 GA, I think it’s
worth just checking in to
> make sure we want to use UUIDs over integer based
IDs. Changing from UUIDs
> to ints would be a very easy change at this point
(1-2 lines of code) but
> after GA ships, it would be hard if not impossible
to switch.
>
> I think there are a number of reasons why we might
want to consider integer
> IDs:
>
> - Better performance all around for inserts[0],
searches, indexing, etc
I don't really care either way, but it's worth
pointing out that UUIDs are
integers (in the sense that the entire internet can be
reduced to a single
integer since it's all just bits). To the best of my
knowledge they are equally
performant to integers and stored in similar ways in
Postgres.
You linked a MySQL experiment, done using a version of
MySQL that is nearly 10
years old. If there are concerns about the performance
of UUID PKs vs. int PKs
in Pulp, we should compare apples to apples and
profile Pulp using UUID PKs,
profile Pulp using integer PKs, and then compare the two.
In my small-scale testing (100,000 randomly generated
content rows of a
proto-RPM content model, 1000 repositories randomly
related to each, no db funny
business beyond enforced uniqueness constraints),
there was either no
difference, or what difference there was fell into the
margin of error.
> - Less storage required (4 bytes for int vs 16 byes
for UUIDs)
Well, okay...UUIDs are *huge* integers. But it's the
length of an IPv6 address
vs. the length of an IPv4 address. While it's true
that 4 < 16, both are still
pretty small. Trivially so, I think.
Without taking relations into account, a table with a
million rows should be a
little less than twelve mega(mebi)bytes larger. Even
at scale, the size
difference is negligible, especially when compared to
the size on disk of the
actual content you'd need to be storing that those
million rows represent.
> - Hrefs would be shorter (e.g.
/pulp/api/v3/repositories/1/)
> - In line with other apps like Katello
I think these two are definitely worth considering,
though.
> There are some downsides to consider though:
>
> - Integer ids expose info like how many records
there are
This was the main intent, if I recall correctly. UUID
PKs are not:
- monotonically increasing
- variably sized (string length, not bit length)
So an objects PK doesn't give you any indication of
how many other objects may
be in the same collection, and while the Hrefs are
long, for any given resource
they will always be a predictable size.
The major downside is really that they're a pain in
the butt to type out when
compared to int PKs, so if users are in a situation
where they do have to type
these things out, I think something has gone wrong.
If users typing in PKs can't be avoided, UUIDs
probably should be avoided. I
recognize that this is effectively a restatement of
"Hrefs would be shorter" in
the context of how that impacts the user.
> - Can’t support sharding or multiple dbs (are we
ever going to need this?)
A very good question. To the best of my recollection
this was never stated as a
hard requirement; it was only ever mentioned like it
is here, as a potential
positive side-effect of UUID keys. If
collision-avoidance is not desired, and
will certainly never be desired, then a normal integer
field would likely be a
less astonishing[0] user experience, and therefore a
better user experience.
[0]:
https://en.wikipedia.org/wiki/Principle_of_least_astonishment
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>
https://www.redhat.com/mailman/listinfo/pulp-dev
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev