Hi,

The upload process is problematic (for large data sets) but using splitting
into smaller chunks becomes reasonably bearable.

I'm attaching an email I've sent internally(= raw format) after playing with
bulkupload a bit on Unbuntu 10.04 64bit.
Note: I've used Python 2.6 (It's possibly that this is the root cause of the
problems, as the python sdk designed to work with python 2.5).

---------- Forwarded message ----------
From: Maxim Veksler <[email protected]>
Date: Wed, Sep 1, 2010 at 5:00 PM
Subject: Loading MaxMind GeoIP Country database
To:  ...


Provides below summary of the steps that were required to upload the content
of Maxmind *Country* database to AppEngine data store.

Provided for future reference.
*
*
*Download DB in CSV format from MaxMind*
Download CSV from http://www.maxmind.com/app/geoip_country

*Split the CSV file*
Then we need to split this file into chunks, so we do something like this:

$ mkdir /home/maxim/Downloads/GeoIPCountryWhois_SPLIT
$ mkdir /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/processed
$ cd /home/maxim/Downloads/GeoIPCountryWhois_SPLIT

$ wc -l /home/maxim/Downloads/GeoIPCountryWhois.csv
131578 /home/maxim/Downloads/GeoIPCountryWhois.csv

$ cat /home/maxim/Downloads/GeoIPCountryWhois.csv | chunksplit.sh
/home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT
WRITING:
/home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000000001-000000000001000.csv
WRITING:
/home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000001001-000000000002000.csv
WRITING:
/home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000002001-000000000003000.csv
...
...
...
WRITING:
/home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000130001-000000000131000.csv
WRITING:
/home/maxim/Downloads/GeoIPCountryWhois_SPLIT/SPLIT_000000000131001-000000000132000.csv


chunksplit is this little utility script:

$ cat /home/maxim/bin/chunksplit.sh
#!/bin/bash

DST_TEMPLATE="$1"

__line_counter=0
BLOCKSIZE=1000

while read line; do
        if [[ $(( $__line_counter % $BLOCKSIZE )) == 0 ]]; then
                TARGET=$(printf "%s_%015d-%015d.csv" "$DST_TEMPLATE"
"$((${__line_counter}+1))" "$((${__line_counter} + $BLOCKSIZE))")
                echo "WRITING: $TARGET"
#
TARGET="${DST_TEMPLATE}_$((${__line_counter}+1))-$((${__line_counter} +
$BLOCKSIZE)).csv"
        fi

        echo $line >> $TARGET
        __line_counter=$(($__line_counter + 1))
done

*Upload the file to DataStore*
*
*
*Next we upload the file to our application data store.*
*Note: This is not a stable process: Google sometimes will return
JavaException errors, the script can hang or it's possible that your request
will timeout. Therefor it's important to continue rerunning this command
until all files are moved from /**home/maxim/Downloads/GeoIPCountryWhois_SPLIT
to home/maxim/Downloads/GeoIPCountryWhois_SPLIT/processed*
*
*
*The actual command to execute is:*
*
*

for split in $(find /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/ -maxdepth
1 -name 'SPLIT*' -type f); do
echo "WORKING ON $split"
 appcfg.py upload_data \
--num_threads=30 --batch_size=100 --bandwidth_limit=5000000
--rps_limit=500000 --http_limit=1000 \
 --config_file=/home/maxim/workspace/FooBar-Python/bulkloader.yaml \
--kind=GeoIPCountryZone \
--filename="$split" --url=http://FooBar-staging.appspot.com/remote_api && mv
$split /home/maxim/Downloads/GeoIPCountryWhois_SPLIT/processed;
done

The yaml we are using for the bulkuploader looks like this:

$ cat /home/maxim/workspace/FooBar-Python/bulkloader.yaml
python_preamble:
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.db
- import: re
- import: base64

transformers:
- kind: GeoIPCountryZone
  connector: csv
  connector_options:
    column_list: [ip_range_start, ip_range_end, ip_range_n_start,
ip_range_n_end, country_code, country_name]
  property_map:
    - property: __key__
      import_template: "%(ip_range_start)s-%(ip_range_end)s"
    - property: ip_range_start
      external_name: ip_range_start
    - property: ip_range_end
      external_name: ip_range_end
    - property: ip_range_n_start
      external_name: ip_range_n_start
      import_transform: long
    - property: ip_range_n_end
      external_name: ip_range_n_end
      import_transform: long
    - property: country_code
      external_name: country_code
    - property: country_name
      external_name: country_name


---------- Forwarded message ----------
From: Maxim Veksler <[email protected]>
Date: Wed, Sep 1, 2010 at 5:46 PM
Subject: Re: Loading MaxMind GeoIP Country database


Also forgot to mention,

A little hack to keep the sdk for asking you for username & password on each
call to appcfg

ma...@maxim-desktop:/tmp/appengine-python-sdk-1.3.7$ diff -Naur -x '*.pyc'
google_appengine/ /home/maxim/Desktop/sdk/appengine-python-sdk-1.3.7/
diff -Naur -x '*.pyc' google_appengine/google/appengine/tools/bulkloader.py
/home/maxim/Desktop/sdk/appengine-python-sdk-1.3.7/google/appengine/tools/bulkloader.py
--- google_appengine/google/appengine/tools/bulkloader.py       2010-08-25
20:04:27.000000000 +0300
+++
/home/maxim/Desktop/sdk/appengine-python-sdk-1.3.7/google/appengine/tools/bulkloader.py
    2010-09-01 14:07:20.355917791 +0300
@@ -1191,6 +1191,8 @@
     Returns:
       A pair of the username and password.
     """
+    return (*'[email protected]'*, *'YOUR-EMAIL-PASSWORD'*)
+
     if self.email:
       email = self.email
     else:


Maxim.


HTH.

Maxim.

On Tue, Sep 14, 2010 at 9:06 PM, Rahul Ravikumar <[email protected]>wrote:

> Recently, i have come across an error (quite frequently) with the
> RemoteApiServlet as well as the remote_api handler.
>
> While bulk loading large amounts of data using the Bulk Loader, i
> start seeing random HTTP 500 errors, with the following details (in
> the log file):
>
> Request was aborted after waiting too long to attempt to service your
> request.
> This may happen sporadically when the App Engine serving cluster is
> under
> unexpectedly high or uneven load. If you see this message frequently,
> please
> contact the App Engine team.
>
> Can someone explain what i might be doing wrong? This errors prevents
> the Bulk Loader from uploading any data further, and I am having to
> start all over again.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine for Java" group.
> To post to this group, send email to
> [email protected].
> To unsubscribe from this group, send email to
> [email protected]<google-appengine-java%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/google-appengine-java?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

Reply via email to