Hi folks,
I've been pondering the best approach to modelling objects where the objects
in my GAE datastore correspond to a subset of objects in an external DB
where primary keys are UUIDs. As I expect most of my records in GAE to
really be quite small, I feel it's worth avoiding the storage size overhead
of just using the 36 character UUID as the key_name, and have come up with
the following to generate datastore uint63 IDs from UUIDs (Python, but the
questions are more general GAE efficiency questions):
MASK_64 = 2**64-1
class UUID(uuid.UUID):
def get_id(self):
return abs((self.int & MASK_64) ^ (self.int >> 64))
id = property(get_id)
class UUIDModel(Model):
@classmethod
def get_by_uuid(cls, uuids, **kwds):
uuids, multiple = datastore.NormalizeAndTypeCheck(uuids, (UUID,
str))
def normalize(uuid):
if isinstance(uuid, str):
return UUID(uuid)
else:
return uuid
uuids = [normalize(uuid) for uuid in uuids]
ids = [uuid.id for uuid in uuids]
entities = cls.get_by_id(ids, **kwds)
for index, entity in enumerate(entities):
if entity is not None and entity.uuid != uuids[index]:
raise BadValueError('UUID hash collision detected!')
if multiple:
return entities
else:
return entities[0]
@classmethod
def get_or_insert_by_uuid(cls, uuid, **kwds):
if isinstance(uuid, str):
uuid = UUID(uuid)
id = uuid.id
def txn():
entity = cls.get_by_id(id, parent=kwds.get('parent'))
if entity is None:
entity = cls(key=Key.from_path(cls.kind(), id,
parent=kwds.get('parent')),
uuid=uuid,
**kwds)
entity.put()
elif entity.uuid != uuid:
raise BadValueError('UUID hash collision detected!')
return entity
return db.run_in_transaction(txn)
uuid = UUIDProperty('UUID')
I won't be using GAE's auto-assigned IDs for the model classes which have
IDs assigned from the external DB's UUID, so I'm not terribly worried about
the probability of ID collision, as 2**63 is still a very large number space
compared to the number of records that I expect to have. My reason for
using a custom hashing of the UUID into a uint63 is because Python's hash
function isn't guaranteed to remain consistent with future Python versions.
The reason for using uint63 is because the datastore classes throw an
exception on negative int64s used as IDs. Had the datastore supported
int128 or uint127 for IDs, I would have just used the UUIDs more directly
with it. I'm using the UUID as the GAE key to allow direct get_by_id()
calls when I already know the UUID, rather than having to do a filtered
query on it.
So, on to the questions. The above seems to work just fine for me in early
prototype stages of development, but I'm wondering if there's a downside to
this technique? Will I hit any performance, space, or general efficiency
penalties with the datastore by using IDs which are essentially randomly
assigned throughout the entire 63bit ID space? Is there anything about this
which strikes people as a terrible idea and would justify me having a major
rethink about my approach? What techniques are others using when they have
externally assigned UUIDs as primary keys for some of their model classes?
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/google-appengine/-/IqEKH0ZY5BkJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.