Hi folks,

I've been pondering the best approach to modelling objects where the objects 
in my GAE datastore correspond to a subset of objects in an external DB 
where primary keys are UUIDs.  As I expect most of my records in GAE to 
really be quite small, I feel it's worth avoiding the storage size overhead 
of just using the 36 character UUID as the key_name, and have come up with 
the following to generate datastore uint63 IDs from UUIDs (Python, but the 
questions are more general GAE efficiency questions):

MASK_64 = 2**64-1

class UUID(uuid.UUID):
    def get_id(self):
        return abs((self.int & MASK_64) ^ (self.int >> 64))

    id = property(get_id)

class UUIDModel(Model):
    @classmethod
    def get_by_uuid(cls, uuids, **kwds):
        uuids, multiple = datastore.NormalizeAndTypeCheck(uuids, (UUID, 
str))
        def normalize(uuid):
            if isinstance(uuid, str):
                return UUID(uuid)
            else:
                return uuid
        uuids = [normalize(uuid) for uuid in uuids]
        ids = [uuid.id for uuid in uuids]
        entities = cls.get_by_id(ids, **kwds)
        for index, entity in enumerate(entities):
            if entity is not None and entity.uuid != uuids[index]:
                raise BadValueError('UUID hash collision detected!')
        if multiple:
            return entities
        else:
            return entities[0]

    @classmethod
    def get_or_insert_by_uuid(cls, uuid, **kwds):
        if isinstance(uuid, str):
            uuid = UUID(uuid)
        id = uuid.id
        def txn():
            entity = cls.get_by_id(id, parent=kwds.get('parent'))
            if entity is None:
                entity = cls(key=Key.from_path(cls.kind(), id,
                                               parent=kwds.get('parent')),
                             uuid=uuid,
                             **kwds)
                entity.put()
            elif entity.uuid != uuid:
                raise BadValueError('UUID hash collision detected!')
            return entity
        return db.run_in_transaction(txn)

    uuid = UUIDProperty('UUID')

I won't be using GAE's auto-assigned IDs for the model classes which have 
IDs assigned from the external DB's UUID, so I'm not terribly worried about 
the probability of ID collision, as 2**63 is still a very large number space 
compared to the number of records that I expect to have.  My reason for 
using a custom hashing of the UUID into a uint63 is because Python's hash 
function isn't guaranteed to remain consistent with future Python versions.  
The reason for using uint63 is because the datastore classes throw an 
exception on negative int64s used as IDs.  Had the datastore supported 
int128 or uint127 for IDs, I would have just used the UUIDs more directly 
with it.  I'm using the UUID as the GAE key to allow direct get_by_id() 
calls when I already know the UUID, rather than having to do a filtered 
query on it.

So, on to the questions.  The above seems to work just fine for me in early 
prototype stages of development, but I'm wondering if there's a downside to 
this technique?  Will I hit any performance, space, or general efficiency 
penalties with the datastore by using IDs which are essentially randomly 
assigned throughout the entire 63bit ID space?  Is there anything about this 
which strikes people as a terrible idea and would justify me having a major 
rethink about my approach?  What techniques are others using when they have 
externally assigned UUIDs as primary keys for some of their model classes?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/IqEKH0ZY5BkJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to