I do believe Dataflow would be the best option here if configured with many 
workers (which can be split based on your current batch request). Not sure 
what type of datasets your 'dictionary' is using  but, correct me if I'm 
wrong, my understanding of your current script is that you are querying one 
Datastore entity at a time against your 2.5M entries in your dictionary and 
keep repeating the process until all Datastore entities were checked 
against the dictionary. Are you creating the Datastore keys based on your 
dictionary entries or you are allowing the keys to be created by Datastore? 
If its the former, there's a possibility that you are experiencing a 
'hotspot' issue due to narrow key range, as explained here 
<https://cloud.google.com/datastore/docs/best-practices#high_readwrite_rates_to_a_narrow_key_range>
. 

Regarding the memory, adapting the dictionary into a file (yes it will be 
big) and adding it to Cloud Storage should be the ideal way to create the 
PCollection 
<https://beam.apache.org/documentation/programming-guide/#creating-a-pcollection>
 
to be used. You can delete the file once the whole process is finished to 
minimized the costs. Depending on the source of these public datasets, you 
could also completely skip the dictionary creation and simply store the 
data within a file. However, I'm not sure if the dictionary is being used 
somewhere else, like the "referenced data".

I'm not fully understanding the undocumented binary format which you 
mentioned, would you be able to provide an example? 


On Wednesday, July 11, 2018 at 2:18:32 PM UTC-4, Attila-Mihaly Balazs wrote:
>
> Hello,
>
> I'm trying to do the following periodically (lets say once a week):
>
> - download a couple of public datasets
> - merge them together, resulting in a dictionary (I'm using Python) of 
> ~2.5m entries
> - upload the result to Cloud Datastore so that I have it as "reference 
> data" for other things running in the project
>
> I've put together a python script using google-cloud-datastore however the 
> performance is abysmal - it takes around 10 hours (!) to do this. What I'm 
> doing:
>
> - iterate over the entries from the datastore
> - look them up in my dictionary and decide if the need update / delete (if 
> no longer present in the dictionary)
> - write them back / delete them as needed
> - insert any new elements from the dictionary
>
> I already batch the requests (use .put_multi, .delete_multi, etc).
>
> Some things I considered:
>
> - Use DataFlow. The problem is that each tasks would have to load the 
> dataset (my "dictionary") into memory which is time and memory consuming
> - Use the managed import / export. Problem is that it produces / consumes 
> some undocumented binary format (I would guess entities serialized as 
> protocol buffers?)
> - Use multiple threads locally to mitigate the latency. Problem is the 
> google-cloud-datastore library has limited support for cursors (it doesn't 
> have an "advance cursor by X" method for example) so I don't have a way to 
> efficiently divide up the entities from the DataStore into chunks which 
> could be processed by different threads
>
> Any suggestions on how I could improve the performance?
>
> Attila
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/5f1b47cb-7698-4643-8b41-a910ba7e6ae2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to