Hi, The upload process is problematic (for large data sets) but using splitting into smaller chunks becomes reasonably bearable.
I'm attaching an email I've sent internally(= raw format) after playing with bulkupload a bit on Unbuntu 10.04 64bit. Note: I've used Python 2.6 (It's possibly that this is the root cause of the problems, as the python sdk designed to work with python 2.5). ---------- Forwarded message ---------- From: Maxim Veksler <[email protected]> Date: Wed, Sep 1, 2010 at 5:00 PM Subject: Loading MaxMind GeoIP Country database To: ... Provides below summary of the steps that were required to upload the content of Maxmind *Country* database to AppEngine data store. Provided for future reference. * * *Download DB in CSV format from MaxMind* Download CSV from http://www.maxmind.com/app/geoip_country *Split the CSV file* Then we need to split this file into chunks, so we do something like this: $ mkdir /home/maxim/Downloads/GeoIPCountryWhois_SPLIT $ mkdir /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/processed $ cd /home/maxim/Downloads/GeoIPCountryWhois_SPLIT $ wc -l /home/maxim/Downloads/GeoIPCountryWhois.csv 131578 /home/maxim/Downloads/GeoIPCountryWhois.csv $ cat /home/maxim/Downloads/GeoIPCountryWhois.csv | chunksplit.sh /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT WRITING: /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000000001-000000000001000.csv WRITING: /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000001001-000000000002000.csv WRITING: /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000002001-000000000003000.csv ... ... ... WRITING: /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000130001-000000000131000.csv WRITING: /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000131001-000000000132000.csv chunksplit is this little utility script: $ cat /home/maxim/bin/chunksplit.sh #!/bin/bash DST_TEMPLATE="$1" __line_counter=0 BLOCKSIZE=1000 while read line; do if [[ $(( $__line_counter % $BLOCKSIZE )) == 0 ]]; then TARGET=$(printf "%s_%015d-%015d.csv" "$DST_TEMPLATE" "$((${__line_counter}+1))" "$((${__line_counter} + $BLOCKSIZE))") echo "WRITING: $TARGET" # TARGET="${DST_TEMPLATE}_$((${__line_counter}+1))-$((${__line_counter} + $BLOCKSIZE)).csv" fi echo $line >> $TARGET __line_counter=$(($__line_counter + 1)) done *Upload the file to DataStore* * * *Next we upload the file to our application data store.* *Note: This is not a stable process: Google sometimes will return JavaException errors, the script can hang or it's possible that your request will timeout. Therefor it's important to continue rerunning this command until all files are moved from /**home/maxim/Downloads/GeoIPCountryWhois_SPLIT to home/maxim/Downloads/GeoIPCountryWhois_SPLIT/processed* * * *The actual command to execute is:* * * for split in $(find /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/ -maxdepth 1 -name 'SPLIT*' -type f); do echo "WORKING ON $split" appcfg.py upload_data \ --num_threads=30 --batch_size=100 --bandwidth_limit=5000000 --rps_limit=500000 --http_limit=1000 \ --config_file=/home/maxim/workspace/FooBar-Python/bulkloader.yaml \ --kind=GeoIPCountryZone \ --filename="$split" --url=http://FooBar-staging.appspot.com/remote_api && mv $split /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/processed; done The yaml we are using for the bulkuploader looks like this: $ cat /home/maxim/workspace/FooBar-Python/bulkloader.yaml python_preamble: - import: google.appengine.ext.bulkload.transform - import: google.appengine.ext.db - import: re - import: base64 transformers: - kind: GeoIPCountryZone connector: csv connector_options: column_list: [ip_range_start, ip_range_end, ip_range_n_start, ip_range_n_end, country_code, country_name] property_map: - property: __key__ import_template: "%(ip_range_start)s-%(ip_range_end)s" - property: ip_range_start external_name: ip_range_start - property: ip_range_end external_name: ip_range_end - property: ip_range_n_start external_name: ip_range_n_start import_transform: long - property: ip_range_n_end external_name: ip_range_n_end import_transform: long - property: country_code external_name: country_code - property: country_name external_name: country_name ---------- Forwarded message ---------- From: Maxim Veksler <[email protected]> Date: Wed, Sep 1, 2010 at 5:46 PM Subject: Re: Loading MaxMind GeoIP Country database Also forgot to mention, A little hack to keep the sdk for asking you for username & password on each call to appcfg ma...@maxim-desktop:/tmp/appengine-python-sdk-1.3.7$ diff -Naur -x '*.pyc' google_appengine/ /home/maxim/Desktop/sdk/appengine-python-sdk-1.3.7/ diff -Naur -x '*.pyc' google_appengine/google/appengine/tools/bulkloader.py /home/maxim/Desktop/sdk/appengine-python-sdk-1.3.7/google/appengine/tools/bulkloader.py --- google_appengine/google/appengine/tools/bulkloader.py 2010-08-25 20:04:27.000000000 +0300 +++ /home/maxim/Desktop/sdk/appengine-python-sdk-1.3.7/google/appengine/tools/bulkloader.py 2010-09-01 14:07:20.355917791 +0300 @@ -1191,6 +1191,8 @@ Returns: A pair of the username and password. """ + return (*'[email protected]'*, *'YOUR-EMAIL-PASSWORD'*) + if self.email: email = self.email else: Maxim. HTH. Maxim. On Tue, Sep 14, 2010 at 9:06 PM, Rahul Ravikumar <[email protected]>wrote: > Recently, i have come across an error (quite frequently) with the > RemoteApiServlet as well as the remote_api handler. > > While bulk loading large amounts of data using the Bulk Loader, i > start seeing random HTTP 500 errors, with the following details (in > the log file): > > Request was aborted after waiting too long to attempt to service your > request. > This may happen sporadically when the App Engine serving cluster is > under > unexpectedly high or uneven load. If you see this message frequently, > please > contact the App Engine team. > > Can someone explain what i might be doing wrong? This errors prevents > the Bulk Loader from uploading any data further, and I am having to > start all over again. > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine for Java" group. > To post to this group, send email to > [email protected]. > To unsubscribe from this group, send email to > [email protected]<google-appengine-java%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/google-appengine-java?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Google App Engine for Java" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine-java?hl=en.
