I do believe Dataflow would be the best option here if configured with many workers (which can be split based on your current batch request). Not sure what type of datasets your 'dictionary' is using but, correct me if I'm wrong, my understanding of your current script is that you are querying one Datastore entity at a time against your 2.5M entries in your dictionary and keep repeating the process until all Datastore entities were checked against the dictionary. Are you creating the Datastore keys based on your dictionary entries or you are allowing the keys to be created by Datastore? If its the former, there's a possibility that you are experiencing a 'hotspot' issue due to narrow key range, as explained here <https://cloud.google.com/datastore/docs/best-practices#high_readwrite_rates_to_a_narrow_key_range> .
Regarding the memory, adapting the dictionary into a file (yes it will be big) and adding it to Cloud Storage should be the ideal way to create the PCollection <https://beam.apache.org/documentation/programming-guide/#creating-a-pcollection> to be used. You can delete the file once the whole process is finished to minimized the costs. Depending on the source of these public datasets, you could also completely skip the dictionary creation and simply store the data within a file. However, I'm not sure if the dictionary is being used somewhere else, like the "referenced data". I'm not fully understanding the undocumented binary format which you mentioned, would you be able to provide an example? On Wednesday, July 11, 2018 at 2:18:32 PM UTC-4, Attila-Mihaly Balazs wrote: > > Hello, > > I'm trying to do the following periodically (lets say once a week): > > - download a couple of public datasets > - merge them together, resulting in a dictionary (I'm using Python) of > ~2.5m entries > - upload the result to Cloud Datastore so that I have it as "reference > data" for other things running in the project > > I've put together a python script using google-cloud-datastore however the > performance is abysmal - it takes around 10 hours (!) to do this. What I'm > doing: > > - iterate over the entries from the datastore > - look them up in my dictionary and decide if the need update / delete (if > no longer present in the dictionary) > - write them back / delete them as needed > - insert any new elements from the dictionary > > I already batch the requests (use .put_multi, .delete_multi, etc). > > Some things I considered: > > - Use DataFlow. The problem is that each tasks would have to load the > dataset (my "dictionary") into memory which is time and memory consuming > - Use the managed import / export. Problem is that it produces / consumes > some undocumented binary format (I would guess entities serialized as > protocol buffers?) > - Use multiple threads locally to mitigate the latency. Problem is the > google-cloud-datastore library has limited support for cursors (it doesn't > have an "advance cursor by X" method for example) so I don't have a way to > efficiently divide up the entities from the DataStore into chunks which > could be processed by different threads > > Any suggestions on how I could improve the performance? > > Attila > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/5f1b47cb-7698-4643-8b41-a910ba7e6ae2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
