Hello, I'm trying to do the following periodically (lets say once a week):
- download a couple of public datasets - merge them together, resulting in a dictionary (I'm using Python) of ~2.5m entries - upload the result to Cloud Datastore so that I have it as "reference data" for other things running in the project I've put together a python script using google-cloud-datastore however the performance is abysmal - it takes around 10 hours (!) to do this. What I'm doing: - iterate over the entries from the datastore - look them up in my dictionary and decide if the need update / delete (if no longer present in the dictionary) - write them back / delete them as needed - insert any new elements from the dictionary I already batch the requests (use .put_multi, .delete_multi, etc). Some things I considered: - Use DataFlow. The problem is that each tasks would have to load the dataset (my "dictionary") into memory which is time and memory consuming - Use the managed import / export. Problem is that it produces / consumes some undocumented binary format (I would guess entities serialized as protocol buffers?) - Use multiple threads locally to mitigate the latency. Problem is the google-cloud-datastore library has limited support for cursors (it doesn't have an "advance cursor by X" method for example) so I don't have a way to efficiently divide up the entities from the DataStore into chunks which could be processed by different threads Any suggestions on how I could improve the performance? Attila -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/4d6a6f1a-27e8-4866-86ab-e1ce831a2211%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
