Hello,

I'm trying to do the following periodically (lets say once a week):

- download a couple of public datasets
- merge them together, resulting in a dictionary (I'm using Python) of 
~2.5m entries
- upload the result to Cloud Datastore so that I have it as "reference 
data" for other things running in the project

I've put together a python script using google-cloud-datastore however the 
performance is abysmal - it takes around 10 hours (!) to do this. What I'm 
doing:

- iterate over the entries from the datastore
- look them up in my dictionary and decide if the need update / delete (if 
no longer present in the dictionary)
- write them back / delete them as needed
- insert any new elements from the dictionary

I already batch the requests (use .put_multi, .delete_multi, etc).

Some things I considered:

- Use DataFlow. The problem is that each tasks would have to load the 
dataset (my "dictionary") into memory which is time and memory consuming
- Use the managed import / export. Problem is that it produces / consumes 
some undocumented binary format (I would guess entities serialized as 
protocol buffers?)
- Use multiple threads locally to mitigate the latency. Problem is the 
google-cloud-datastore library has limited support for cursors (it doesn't 
have an "advance cursor by X" method for example) so I don't have a way to 
efficiently divide up the entities from the DataStore into chunks which 
could be processed by different threads

Any suggestions on how I could improve the performance?

Attila

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/4d6a6f1a-27e8-4866-86ab-e1ce831a2211%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to