ArielGlenn has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/280102 )
Change subject: archiveloader full pylint and pep8 ...................................................................... archiveloader full pylint and pep8 A lot of code cleanup, breaking up the too-long module into reasonable sized files, removinging redundant code, compacting long sequences of if statements, etc. Change-Id: I67526d9b154d2b7a5aa9e38fb8561916aaefb454 --- M tools/archive.org/README A tools/archive.org/archivelib/__init__.py A tools/archive.org/archivelib/config.py A tools/archive.org/archivelib/curlargs.py A tools/archive.org/archivelib/error.py A tools/archive.org/archivelib/html_utils.py A tools/archive.org/archivelib/sitematrix.py A tools/archive.org/archivelib/uploader.py A tools/archive.org/archivelib/urls.py A tools/archive.org/archivelib/xml_utils.py M tools/archive.org/archiveuploader.py 11 files changed, 1,204 insertions(+), 1,025 deletions(-) Approvals: ArielGlenn: Looks good to me, approved jenkins-bot: Verified diff --git a/tools/archive.org/README b/tools/archive.org/README index dcd0ffa..054377b 100644 --- a/tools/archive.org/README +++ b/tools/archive.org/README @@ -1,30 +1,30 @@ This is the archiveuploader script which we use to upload dumps to archive.org -via their S3-style api. +via their S3-style api. -Notes: +Notes: This is used only for dumps. It does things like try to determine the language -of the project dumped by polling the en wikipedia SiteMatrix. +of the project dumped by polling the en wikipedia SiteMatrix. Setup: Create a config file. It should contain the access and secret key needed for -access to the archive.org api, the url to the SiteMatrix for your projects, +access to the archive.org api, the url to the SiteMatrix for your projects, and the full path to the name of the file where the SiteMatrix information will be cached. See archiveuploader.conf.sample for an example. -If you don't need multiple config files, put it in the default place: +If you don't need multiple config files, put it in the default place: "archiveuploader.conf.sample" in the working directory of the script. Make sure curl is installed on your system and a pointer to its location is also in the config file. -If you are not uploading to WMF items (buckets)... i.e. you are -an individual user uploading to some other bucket, you'll need to +If you are not uploading to WMF items (buckets)... i.e. you are +an individual user uploading to some other bucket, you'll need to add to the configuration file a pointer to the license covering the content of your files, the creator of the dumps, and the download location of the dumps. -By default the dbname of your project is assumed to be the same +By default the dbname of your project is assumed to be the same as the itemname that will appear in S3 urls. If you don't want this to be the case, you can specify a format string in the config file, putting %%s in the string in the spot where the dbname would go. @@ -37,7 +37,7 @@ Files to be uploaded: -We use this uploading a full directory of dumps tarred up into a +We use this uploading a full directory of dumps tarred up into a single file, with a name like elwiktionary-20060703.tar Obviously this only works for smaller projects. @@ -52,12 +52,12 @@ Set up a tarball of the dumped tables and xml files of a project for a given date. Run the script without arguments to see a detailed -help message about its invocation. +help message about its invocation. -If you have put all of your auth information in the config file, -you should be able to create the item with the command - python archiveuploader.py --createitem dbnamehere -to create the initial item. The dbname should be the actual dbname of the +If you have put all of your auth information in the config file, +you should be able to create the item with the command + python archiveuploader.py --createitem dbnamehere +to create the initial item. The dbname should be the actual dbname of the project with the dump you'll be uploading. *The item name is created from the dbname using the itemnameformat entry in the config file.* @@ -65,26 +65,26 @@ wind up in a todo queue which you can check here (must log in via web interface, no xml or json output available either): http://www.archive.org/catalog.php?justme=1 -Completed jobs are listed here: +Completed jobs are listed here: http://www.archive.org/catalog.php?history=1&justme=1&history=1 -If the metadata loks wrong, you can try to tweak it by altering +If the metadata looks wrong, you can try to tweak it by altering the config file settings, and then update the item by - python archiveuploader.py --updateitem dbnamehere + python archiveuploader.py --action updateitem --itemname dbnamehere Now you're ready to add objects (files) to the item (bucket). -You can see what the script thinks it should do: - python archiveuploader.py --uploadobject dbnamehere --objectname objectnamehere --filename pathtofiletoupload --dryrun --verbose +You can see what the script thinks it should do: + python archiveuploader.py --action uploadobject --itemname dbnamehere --objectname objectnamehere --filename pathtofiletoupload --dryrun --verbose If that all looks sensible, you can try uploading a file: - python archiveuploader.py --uploadobject dbnamehere --objectname objectnamehere --filename pathtofiletoupload + python archiveuploader.py --action uploadobject --itemname dbnamehere --objectname objectnamehere --filename pathtofiletoupload As before you can check on the progress of the upload here: http://www.archive.org/catalog.php?justme=1 You can list your objects (files) in an item: - python archiveuploader.py --listobjects dbnamehere + python archiveuploader.py --action listobjects --itemname dbnamehere You can also list all your items: - python archiveuploader.py --listitems + python archiveuploader.py --action listitems diff --git a/tools/archive.org/archivelib/__init__.py b/tools/archive.org/archivelib/__init__.py new file mode 100644 index 0000000..4ded676 --- /dev/null +++ b/tools/archive.org/archivelib/__init__.py @@ -0,0 +1,4 @@ +# -*- coding: utf-8 -*- +''' +library utils for archive uploader +''' diff --git a/tools/archive.org/archivelib/config.py b/tools/archive.org/archivelib/config.py new file mode 100644 index 0000000..806e296 --- /dev/null +++ b/tools/archive.org/archivelib/config.py @@ -0,0 +1,70 @@ +import os +import sys +import ConfigParser + + +class ArchiveUploaderConfig(object): + """Read contents of config file, if any. + If no filename is provided, the default name 'archiveuploader.conf' will + be checked. If it is not present, the files /etc/archiveuploader.conf and + .archiveuploader.conf will be checked, in that order.""" + + def __init__(self, config_file=None): + """Constructor. Args: + config_file -- path to configuration file. If not passed, + the default 'archiveuploader.conf' will be checked.""" + + self.project_name = False + self.access_key = None + self.secret_key = None + + home = os.path.dirname(sys.argv[0]) + if config_file is None: + config_file = "archiveuploader.conf" + self.files = [ + os.path.join(home, config_file), + "/etc/archiveuploader.conf", + os.path.join(os.getenv("HOME"), ".archiveuploader.conf")] + + self.conf = ConfigParser.SafeConfigParser(self.get_config_defaults()) + self.conf.read(self.files) + self.settings = self.parse_conf_file() + + @staticmethod + def get_config_defaults(): + """return a dict of configuration setting defaults usable + by the Config module""" + defaults = { + # "auth": { + "accesskey": "", + "secretkey": "", + "username": "", + "password": "", + # "output": { + "sitematrixfile": "", + # "web": { + "apiurl": "http://en.wikipedia.org/w/api.php", + "curl": "/usr/bin/curl", + "itemnameformat": "%%s", + "licenseurl": "http://wikimediafoundation.org/wiki/Terms_of_Use", + "creator": "the Wikimedia Foundation", + "downloadurl": "http://dumps.wikimedia.org" + } + return defaults + + def parse_conf_file(self): + """Get contents of config file, using new values to overwrite + corresponding defaults.""" + settings = {} + settings['access_key'] = self.conf.get("auth", "accesskey") + settings['secret_key'] = self.conf.get("auth", "secretkey") + settings['username'] = self.conf.get("auth", "username") + settings['password'] = self.conf.get("auth", "password") + settings['site_matrix_file'] = self.conf.get("output", "sitematrixfile") + settings['api_url'] = self.conf.get("web", "apiurl") + settings['curl'] = self.conf.get("web", "curl") + settings['item_name_format'] = self.conf.get("web", "itemnameformat") + settings['license_url'] = self.conf.get("web", "licenseurl") + settings['creator'] = self.conf.get("web", "creator") + settings['downloadurl'] = self.conf.get("web", "downloadurl") + return settings diff --git a/tools/archive.org/archivelib/curlargs.py b/tools/archive.org/archivelib/curlargs.py new file mode 100644 index 0000000..50b9e17 --- /dev/null +++ b/tools/archive.org/archivelib/curlargs.py @@ -0,0 +1,54 @@ +def get_location_curl_arg(): + """Returns the argument that causes curl to follow all redirects""" + return ["--location"] + + +def get_rest_of_login_curl_args(): + """Returns some bizarre arguments needed for archive.org login + partly cause we want to get thecookies without all the html, partly + cause login failes without this 'test cookie', how is that possible? >_<""" + return ['-s', '-c', '-', '-b', 'test-cookie=1', '-o', '/dev/null'] + + +def get_quiet_curl_arg(): + return ["-s"] + + +def get_no_derive_curl_arg(): + """This tag tells archive.org not to try to derive a bunch of other + formats from this file (which it would do for videos, for example). + We've been requested to add this since our files have no derivative + formats.""" + return ["--header", "x-archive-queue-derive:0"] + + +def get_head_req_curl_args(): + """Returns the curl arguments needed to do head request and write + out just the http return code""" + args = get_quiet_curl_arg() + args.extend(["--write-out", "%{http_code}", "-X", "HEAD"]) + return args + + +def get_head_with_output_curl_args(): + """Returns the curl arguments needed to do head request and write + out everything""" + return ["--head"] + + +def get_show_headers_curl_arg(): + """Returns the curl argument needed to do a normal (post or + get) request and show the headers along with the output""" + return ["--include"] + + +def get_item_creation_curl_args(): + """Returns the curl arguments needed to put an empty file; + this is used for updating or creating an item (bucket).""" + return ["-X", "PUT", "--header", "Content-Length: 0"] + + +def get_ign_exist_bucket_curl_arg(): + """Return the curl argument required for overwriting the metadata + of an existing item (bucket).""" + return ["--header", "x-archive-ignore-preexisting-bucket:1"] diff --git a/tools/archive.org/archivelib/error.py b/tools/archive.org/archivelib/error.py new file mode 100644 index 0000000..f2387dd --- /dev/null +++ b/tools/archive.org/archivelib/error.py @@ -0,0 +1,4 @@ +class ArchiveUploaderError(Exception): + """Exception class for the Archive Uploader and all of + its related classes. Doesn't do much :-P""" + pass diff --git a/tools/archive.org/archivelib/html_utils.py b/tools/archive.org/archivelib/html_utils.py new file mode 100644 index 0000000..5a64e07 --- /dev/null +++ b/tools/archive.org/archivelib/html_utils.py @@ -0,0 +1,90 @@ +import re +from archivelib.error import ArchiveUploaderError + + +def get_login_cookies(text): + """Get cookie out of the text returned from the + archive.org login form. gahhhh""" + # format: .archive.org^ITRUE^I/^IFALSE^I1361562342^Ilogged-in-sig^Isomehugenumberhere + # .archive.org^ITRUE^I/^IFALSE^I1361562342^Ilogged-in-user^Ijohndoe%40wikimedia.org$ + # plus others which we will ignore. + cookies = [] + lines = text.split('\n') + for line in lines: + if (not len(line)) or (line[0] == '#'): + continue + parts = line.split('\t') + print parts + if parts[5] == 'logged-in-user': + cookies.append("%s=%s" % (parts[5], parts[6])) + elif parts[5] == 'logged-in-sig': + cookies.append("%s=%s" % (parts[5], parts[6])) + if len(cookies): + return '; '.join(cookies) + else: + return None + + +def strip_hidden(cell): + start = cell.find('<span class="catHidden">') + if start != -1: + index = start + 1 + open_tags = 1 + span_open_or_close_expr = re.compile('(<span[^>]+>|</span>)') + # find index of first occurrence and what was matched, + # so we can see if it was open or close tag + while open_tags: + span_match = span_open_or_close_expr.search(cell[index:]) + if span_match: + tag_found = span_match.group(1) + index = span_match.start(1) + index + if tag_found == '</span>': + open_tags = open_tags - 1 + else: + open_tags = open_tags + 1 + else: + # bad html. just toss the rest of the cell + open_tags = 0 + index = -1 + # now we have the index where we found the close tag for us. + # toss everything up to that, we'll lose the actual close tag when we + # toss the rest of the html + cell = cell[:start] + cell[index:] + return cell + + +def show_item_status_from_html(text): + """Wade through the html output to find information + about each job we have requested and its status. + THIS IS UGLY and guaranteed to break in the future + but hey, there's no json output available, nor xml.""" + html_tag_expr = re.compile('(<[^>]+>)+') + # get the headers for the table of tasks + start = text.find('<tr><th><b><a href="/catalog.php?history=1&identifier=') + if start >= 0: + end = text.find('<!--task_ids: -->', start) + content = text[start:end] + print "content is", content + lines = content.split('</th>') + lines = [re.sub(html_tag_expr, '', line).strip() + for line in lines if + line.find('<th><b><a href="/catalog.php?history=1&identifier=') != -1] + print ' | '.join(filter(None, lines)) + + # get the tasks themselves + start = text.find('<!--task_ids: -->') + if start < 0: + raise ArchiveUploaderError( + "Can't locate the beginning of the item status information" + " in the html output.") + end = text.find('</table>', start) + content = text[start:end] + lines = content.split('</tr>') + + for line in lines: + line = line.replace('\n', '') + cells = line.split('</td>') + cells_to_print = [re.sub(html_tag_expr, '', + strip_hidden(cell)).strip() + for cell in cells] + print ' | '.join(filter(None, cells_to_print)) diff --git a/tools/archive.org/archivelib/sitematrix.py b/tools/archive.org/archivelib/sitematrix.py new file mode 100644 index 0000000..c630514 --- /dev/null +++ b/tools/archive.org/archivelib/sitematrix.py @@ -0,0 +1,160 @@ +import os +import json +import codecs +from subprocess import Popen, PIPE +from archivelib.error import ArchiveUploaderError + + +def set_site_entries(site_info, lang_info): + """ + given pointers to project info, extract the + project type and language, and return + a little dict with this information + """ + site_entries = {} + site_entries['project'] = site_info['code'] + # special hack + if site_entries['project'] == 'wiki': + site_entries['project'] = 'wikipedia' + if lang_info is None: + site_entries['locallangname'] = None + site_entries['lang'] = None + else: + site_entries['locallangname'] = lang_info['localname'] + site_entries['lang'] = lang_info['code'] + return site_entries + + +def api_matrix_json_to_dict(json_string): + """ + Convert the sitematrix json string to a dict for our use, + keeping only the information we want: dbname, project name, lang code. + + Sample input: + { u'localname': u'Aromanian', + u'code': u'roa-rup', + u'name': u'Arm\xe3neashce', + u'site': + [ + {u'url': u'http://roa-rup.wikipedia.org', \ + u'code': u'wiki', u'dbname': u'roa_rupwiki'}, + {u'url': u'http://roa-rup.wiktionary.org', \ + u'code': u'wiktionary', u'dbname': u'roa_rupwiktionary'} + ] + } + """ + matrix_json = json.loads(json_string) + matrix = {} + + for key in matrix_json['sitematrix'].keys(): + if key == 'count': + continue + if key == 'specials': + for site in range(0, len((matrix_json['sitematrix'][key]))): + sitename = matrix_json['sitematrix'][key][site]['dbname'] + matrix[sitename] = set_site_entries(matrix_json['sitematrix'][key][site], None) + else: + for site in range(0, len((matrix_json['sitematrix'][key]['site']))): + sitename = matrix_json['sitematrix'][key]['site'][site]['dbname'] + matrix[sitename] = set_site_entries(matrix_json['sitematrix'][key]['site'][site], + matrix_json['sitematrix'][key]) + return matrix + + +class SiteMatrix(object): + """ + Get and/or update the SiteMatrix (list of MediaWiki sites + with projct name, database name and language name) via the api, + saving it to a cache file if requested. + If no filename is supplied in the config we will use only the api + to load and update. + If a filename is supplied in the config we will load from it + initially and save to it after every update from the api. + """ + + def __init__(self, config, debugging): + """ + Constructor. Arguments: + config -- populated ArchiveUploaderConfig object + source_url -- url to the api.php script. For example:" + http://en.wikipedia.org/w/api.php + file_name -- full path to a cache file for the site matrix information + debugging -- list of debugging flags + * dont_save_file -- load from cache file but never update it + (used primarily for doing a dry run) + * verbose -- print out various progress messages + all other entries are ignored + """ + self.config = config + self.source_url = self.config.settings['api_url'] + "?action=sitematrix&format=json" + self.file_name = self.config.settings['site_matrix_file'] + self.debugging = debugging + self.matrix_json = None + if self.file_name and os.path.exists(self.file_name): + try: + self.matrix_json = self.load_matrix_json_from_file() + self.matrix = json.loads(self.matrix_json) + except Exception: + self.matrix_json = None + if self.matrix_json is None: + self.matrix_json = self.load_matrix_json_from_api() + self.matrix = json.loads(self.matrix_json) + if 'dont_save_file' not in self.debugging: + self.save_matrix_json_to_file() + + def update_matrix(self): + """Update the copy of the sitematrix in memory via the MW api. + Write the results to a cache file if requested/enabled.""" + new_matrix_json = self.load_matrix_json_from_api() + new_matrix = json.loads(new_matrix_json) + # We may wind up with wikis that have been renamed, or removed, so that + # the old name is no longer valid; it will take up space but otherwise + # is harmless, so ignore this case. + self.matrix = self.matrix.update(new_matrix) + self.matrix_json = json.dumps(self.matrix, ensure_ascii=False) + if 'dont_save_file' not in self.debugging: + self.save_matrix_json_to_file() + + def load_matrix_json_from_api(self): + """ + Fetch the sitematrix information via the MW api. Get rid + of the extra columns and convert the rest to a dict for our use. + """ + api_matrix_json = self.load_api_matrix_json_from_api() + matrix = api_matrix_json_to_dict(api_matrix_json) + matrix_json = json.dumps(matrix, ensure_ascii=False) + return matrix_json + + def load_api_matrix_json_from_api(self): + """Fetch the sitematrix information via the MW api.""" + command = [self.config.settings['curl'], "--location", self.source_url] + + if 'verbose' in self.debugging: + command_string = " ".join(command) + print "about to run " + command_string + + proc = Popen(command, stdout=PIPE, stderr=PIPE) + output, error = proc.communicate() + if proc.returncode: + command_string = " ".join(command) + raise ArchiveUploaderError( + "command '" + command_string + + ("' failed with return code %s " % proc.returncode) + + " and error '" + error + "'") + return output + + def save_matrix_json_to_file(self): + """Write the site matrix information to a cache file + in json format.""" + if self.file_name: + outfile = codecs.open(self.file_name, "w", "UTF-8") + json.dump(self.matrix, outfile, ensure_ascii=False) + outfile.close() + + def load_matrix_json_from_file(self): + """Load the json-formatted site matrix information from a + cache file, converting it to a dict for our use.""" + if self.file_name and os.path.exists(self.file_name): + infile = open(self.file_name, "r") + self.matrix_json = json.load(infile) + infile.close() diff --git a/tools/archive.org/archivelib/uploader.py b/tools/archive.org/archivelib/uploader.py new file mode 100644 index 0000000..ca0dec1 --- /dev/null +++ b/tools/archive.org/archivelib/uploader.py @@ -0,0 +1,451 @@ +import sys +import json +import xml.sax +import traceback +import re +import hashlib +from subprocess import Popen, PIPE +from archivelib.curlargs import get_location_curl_arg, get_rest_of_login_curl_args +from archivelib.curlargs import get_quiet_curl_arg, get_no_derive_curl_arg, get_head_req_curl_args +from archivelib.curlargs import get_head_with_output_curl_args +from archivelib.curlargs import get_item_creation_curl_args, get_ign_exist_bucket_curl_arg +from archivelib.urls import get_archive_base_s3_url, get_login_form_url +from archivelib.urls import get_archive_item_details_url, get_archive_item_url +from archivelib.urls import get_object_url, get_archive_item_status_url +from archivelib.html_utils import get_login_cookies, show_item_status_from_html +from archivelib.error import ArchiveUploaderError +from archivelib.sitematrix import SiteMatrix +from archivelib.xml_utils import ListObjectsCH, ListAllItemsCH + + +def show_command(command): + """Print the supplied command (a list consisting of a command + and any arguments) to stdout.""" + command_string = " ".join(command) + print "would run: command " + command_string + + +def get_etag_value(text): + # format: ETag: "8ea7c3551a74098b49fbfea49b1ee9e1" + lines = text.split('\n') + etag_expr = re.compile(r'^ETag:\s+"([abcdef0-9]+)"') + for line in lines: + etag_match = etag_expr.match(line) + if etag_match: + return etag_match.group(1) + return None + + +def get_md5sum_of_file(file_name): + summer = hashlib.md5() + infile = file(file_name, "rb") + # really? could this be bigger?? consider 20GB files. + bufsize = 4192 * 32 + inbuffer = infile.read(bufsize) + while inbuffer: + summer.update(inbuffer) + inbuffer = infile.read(bufsize) + infile.close() + return summer.hexdigest() + + +class ArchiveUploader(object): + """Use the archive.org s3 api to create and update items (buckets) + and to upload files (objects) into a bucket. Relies on curl.""" + + def __init__(self, config, archive_key, item_name, debugging): + """Constructor. Args: + config -- populated ArchiveUploadedConfig object + archiveKey -- populated ArchiveKey object (contains access and secret keys) + itemName -- name of item tp be created, updated, or uploaded into + debugging -- if 'verbose' is in this list, produce extra output; default False + if 'dryrun' is in this list don't actually do update/creation/upload, + show what would be run; default False + """ + self.config = config + self.archive_key = archive_key + self.item_name = item_name + self.debugging = debugging + if 'dryrun' in self.debugging: + self.debugging.append('dont_save_file') + self.matrix = None + self.db_name = self.item_name + if self.config.settings['item_name_format']: + self.item_name = self.config.settings['item_name_format'] % self.db_name + self.session_cookies = None + + def get_login_form_curl_args(self): + """Returns the arguments needed for auth to the archive.org S3 api""" + return ["--data-urlencode", "username=%s" % self.config.settings['username'], + "--data-urlencode", "password=%s" % self.config.settings['password'], + '--data-urlencode', 'referer=https://archive.org/', + '--data-urlencode', 'action=login', + '--data-urlencode', 'remember=CHECKED', + '--data-urlencode', 'submit=Log in'] + + def get_object_upload_curl_args(self, object_name, file_name): + """Returns the curl arguments needed for upload of a file S3-style: + the authentication header with accesskey and secret key, and + the url to the object (file) as an S3 url.""" + args = self.archive_key.get_s3_auth_curl_args() + args.extend(['--upload-file', file_name, get_object_url(object_name, self.item_name)]) + return args + + def get_cookie_curl_args(self): + return ["-b", self.session_cookies] + + def get_item_meta_header_args(self): + """ + Get the curl arguments needed to generate all the headers containing + metadata for objects (files) on archive.org. + Sample headers for el wiktionary: + --header 'x-archive-meta-title:Wikimedia database dumps of el.wiktionary' + --header 'x-archive-meta-mediatype:web' + --header 'x-archive-meta-language:el (Modern Greek)' + --header 'x-archive-meta-description:Dumps of el.wiktionary \ + created by the Wikimedia Foundation and downloadable \ + from http://dumps.wikimedia.org' + --header 'x-archive-meta-format:xml and sql' + --header 'x-archive-meta-licenseurl:http://wikimediafoundation.org/wiki/Terms_of_Use' + --header 'x-archive-meta-subject:xml;dump;wikimedia;el;wiktionary'""" + headers = [ + '--header', + 'x-archive-meta-title:Wikimedia database dumps of %s' % self.db_name, + '--header', + 'x-archive-meta-mediatype:web', + '--header', + ('x-archive-meta-description:Dumps of %s created by %s and downloadable from %s' + % (self.db_name, self.config.settings['creator'], + self.config.settings['downloadurl'])), + '--header', + 'x-archive-meta-format:xml and sql', + '--header', + 'x-archive-meta-licenseurl:%s' % self.config.settings['licenseUrl']] + + lang = self.get_lang() + if lang: + headers.extend([ + '--header', + 'x-archive-meta-language:%s (%s)' % (lang, self.get_local_lang_name()), + '--header', + ('x-archive-meta-subject:xml;dump;wikimedia;%s;%s' + % (lang, self.get_project()))]) + else: + headers.extend([ + '--header', + ('x-archive-meta-subject:xml,dump,wikimedia,%s' + % (self.get_project()))]) + return headers + + def do_curl_command(self, curl_command, get_output=False): + """Given a list containing curl command with all the args and run it. + If getOutput is True, return any output. + Raises ArchiveUploaderError on error fron curl.""" + if 'verbose' in self.debugging: + command_string = " ".join(curl_command) + print "about to run " + command_string + + try: + proc = Popen(curl_command, stdout=PIPE, stderr=PIPE) + except: + command_string = " ".join(curl_command) + exc_type, exc_value, exc_traceback = sys.exc_info() + print repr(traceback.format_exception(exc_type, exc_value, exc_traceback)) + raise ArchiveUploaderError("curl_command '" + command_string + "' failed'") + + output, error = proc.communicate() + if proc.returncode: + # curl has this annoying idea that when you specifically do a HEAD + # request it should return an error code anyways indicating a + # partial file transfer + if not (proc.returncode == 18 and 'HEAD' in curl_command): + command_string = " ".join(curl_command) + raise ArchiveUploaderError( + "curl_command '" + command_string + + ("' failed with return code %s " % proc.returncode) + + " and error '" + error + "'") + if 'verbose' in self.debugging: + print "Command successful." + if get_output: + if output: + print output + else: + print "No output returned." + + if get_output: + return output + + def upload_object(self, object_name, file_name): + """Upload an object (file) to the bucket (item). Args: + object_name -- name of the object as it will appear in the S3-style url + file_name -- path to file to be uploaded""" + + # note that someone could remove the item in between the + # time we check for one upload and the time we check for another + # upload, in the case of multiple uploads via this script. + # we're not expecting to beat race conditions, just to warn + # the user if they try uploading to a bucket they never set up + exists = self.check_if_item_exists() + if exists != "200": + raise ArchiveUploaderError( + "No such item " + self.item_name + + " exists, http error code " + exists + ", giving up.") + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + curl_command.extend(get_no_derive_curl_arg()) + curl_command.extend(self.get_object_upload_curl_args(object_name, file_name)) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + self.do_curl_command(curl_command) + + def verify_object(self, object_name, file_name): + """Verify an object (file) in a given bucket (item) by checking etag + from server and md5sum of local file. Args: + object_name -- name of the object as it appears in the S3-style url + file_name -- path to corresponding local file""" + + exists = self.check_if_item_exists() + if exists != "200": + raise ArchiveUploaderError( + "No such item " + self.item_name + + " exists, http error code " + exists + ", giving up.") + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + curl_command.extend(get_head_with_output_curl_args()) + curl_command.append(get_object_url(object_name, self.item_name)) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + result = self.do_curl_command(curl_command, True) + md5sum_from_etag = get_etag_value(result) + if not md5sum_from_etag: + print "no Etag in server output, received:" + print result + sys.exit(1) + md5sum_from_local_file = get_md5sum_of_file(file_name) + if 'verbose' in self.debugging: + print "Etag: ", md5sum_from_etag, "md5 of local file: ", md5sum_from_local_file + if md5sum_from_etag == md5sum_from_local_file: + if 'verbose' in self.debugging: + print "File verified ok." + else: + raise ArchiveUploaderError("File verification FAILED.") + + def check_if_item_exists(self): + """Check it the item (bucket) exists, returning True if it exists + and False otherwise.""" + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + curl_command.extend(get_head_req_curl_args()) + curl_command.append(get_archive_item_url(self.item_name)) + result = self.do_curl_command(curl_command, get_output=True) + return result + + # FIXME we should really check once to see if the project name + # is valid and then refuse to work on it otherwise, instead + # of scattering the retry throughout all these functions + def get_lang(self): + """Get the language code corresponding to the dbname + of the dump we are creating/uploading""" + if not self.matrix: + self.matrix = SiteMatrix(self.config, self.debugging) + + if self.db_name in self.matrix.matrix.keys(): + return self.matrix.matrix[self.db_name]['lang'] + self.matrix.update_matrix() + # one more try + if self.db_name in self.matrix.matrix.keys(): + return self.matrix.matrix[self.db_name]['lang'] + else: + return None + + def get_local_lang_name(self): + """From the dbname, get the translation of the name of the language + for the lang code of the dump we are creating/uploading. The translation + is into the content language of the site from which we retrieve + the sitematrix information; typically this should be English, since we are + uploading to archive.org and the description keywords used there + are generally English.""" + if not self.matrix: + self.matrix = SiteMatrix(self.config, self.debugging) + + if self.db_name in self.matrix.matrix.keys(): + return self.matrix.matrix[self.db_name]['locallangname'] + self.matrix.update_matrix() + # one more try + if self.db_name in self.matrix.matrix.keys(): + return self.matrix.matrix[self.db_name]['locallangname'] + else: + return None + + def get_project(self): + """From the dbname, get the project name of the dump we are + creating/uploading.""" + if not self.matrix: + self.matrix = SiteMatrix(self.config, self.debugging) + + if self.db_name in self.matrix.matrix.keys(): + return self.matrix.matrix[self.db_name]['project'] + self.matrix.update_matrix() + # one more try + if self.db_name in self.matrix.matrix.keys(): + return self.matrix.matrix[self.db_name]['project'] + else: + return None + + def update_item(self): + """Update an item (bucket); this entails a full update of the metadata. The + objects (files) it contains are not touched in any way.""" + self.create_item(True) + + def create_item(self, rewrite=False): + """Create an item (bucket) S3-style. Args: + rewrite -- if true, we are updating the metadata of an item that + already exists; default false""" + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + curl_command.extend(self.archive_key.get_s3_auth_curl_args()) + if rewrite: + curl_command.extend(get_ign_exist_bucket_curl_arg()) + else: + exists = self.check_if_item_exists() + if exists == "200": + raise ArchiveUploaderError("Item " + self.item_name + " already exists, giving up.") + curl_command.extend(self.get_item_meta_header_args()) + curl_command.extend(get_item_creation_curl_args()) + curl_command.append(get_archive_item_url(self.item_name)) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + self.do_curl_command(curl_command) + + def list_all_items(self): + """List all items for the user associated with the accesskey/secretkey.""" + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + curl_command.extend(self.archive_key.get_s3_auth_curl_args()) + if 'verbose' not in self.debugging: + curl_command.extend(get_quiet_curl_arg()) + curl_command.append(get_archive_base_s3_url()) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + output = self.do_curl_command(curl_command, True) + if 'verbose' in self.debugging: + print "About to parse output (list items)" + xml.sax.parseString(output, ListAllItemsCH()) + + def list_objects(self): + """List all objects (files) contained in a specific item (bucket).""" + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + curl_command.extend(self.archive_key.get_s3_auth_curl_args()) + if 'verbose' not in self.debugging: + curl_command.extend(get_quiet_curl_arg()) + curl_command.append(get_archive_item_url(self.item_name)) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + output = self.do_curl_command(curl_command, True) + if 'verbose' in self.debugging: + print "About to parse output (list objects)" + xml.sax.parseString(output, ListObjectsCH()) + + def show_item(self): + """Show metadata associated with a particular item (bucket).""" + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + if 'verbose' not in self.debugging: + curl_command.extend(get_quiet_curl_arg()) + curl_command.append(get_archive_item_details_url(self.item_name)) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + output = self.do_curl_command(curl_command, True) + if 'verbose' in self.debugging: + print "About to parse output (show item)" + self.show_item_metadata_from_json(output) + + def show_item_metadata_from_json(self, json_string): + """ + Grab the metadata for an item from the json for the item details + (contains lost of other cruft) and display to stdout + Sample output: + + "metadata":{ + "identifier":["elwiktionary-dumps"], + "description":["Dumps of el.wiktionary created by the \ + Wikimedia Foundation and downloadable from \ + http://dumps.wikimedia.org/elwiktionary/"], + "language":["el (Modern Greek)"], + "licenseurl":["http://wikimediafoundation.org/wiki/Terms_of_Use"], + "mediatype":["web"], + "subject":["xml,dump,wikimedia,el,wiktionary"], + "title":["Wikimedia database dumps of el.wiktionary, format:xml and sql"], + "publicdate":["2012-02-17 11:03:45"], + "collection":["opensource"], + "addeddate":["2012-02-17 11:03:45"] + } + """ + details = json.loads(json_string) + if 'metadata' in details.keys(): + print "Item metadata for", self.item_name + for key in details['metadata'].keys(): + print "%s:" % key, + print " | ".join(details['metadata'][key]) + else: + print "No metadata for", self.item_name, "is available." + + def log_in(self): + if not self.session_cookies: + curl_command = [self.config.settings['curl']] + curl_command.extend(self.get_login_form_curl_args()) + curl_command.extend(get_rest_of_login_curl_args()) + curl_command.append(get_login_form_url()) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + output = self.do_curl_command(curl_command, True) + if 'verbose' in self.debugging: + print "About to dig cookie out of login response:" + print output + self.session_cookies = get_login_cookies(output) + if not self.session_cookies: + raise ArchiveUploaderError("Login failed.") + + def show_item_status(self): + """Show the status of an item (bucket): which objects (files) are waiting + on further action from archive.org.""" + self.log_in() + curl_command = [self.config.settings['curl']] + curl_command.extend(get_location_curl_arg()) + if 'verbose' not in self.debugging: + curl_command.extend(get_quiet_curl_arg()) + curl_command.extend(self.get_cookie_curl_args()) + curl_command.append(get_archive_item_status_url(self.item_name)) + if 'dryrun' in self.debugging: + show_command(curl_command) + else: + output = self.do_curl_command(curl_command, True) + show_item_status_from_html(output) + + +class ArchiveKey(object): + """Authentication to the archive.org api, S3-style.""" + + def __init__(self, config): + """Constructor. Args: + config -- a populated ArchiveUploaderConfig object.""" + self.config = config + self.access_key = self.config.settings['access_key'] + self.secret_key = self.config.settings['secret_key'] + + def get_auth_header(self): + """Returns the http header needed for authentication to the archive.org + api.""" + return "authorization: LOW %s:%s" % (self.access_key, self.secret_key) + + def get_s3_auth_curl_args(self): + """Returns the arguments needed for auth to the archive.org S3 api""" + return ["--header", self.get_auth_header()] diff --git a/tools/archive.org/archivelib/urls.py b/tools/archive.org/archivelib/urls.py new file mode 100644 index 0000000..0438c76 --- /dev/null +++ b/tools/archive.org/archivelib/urls.py @@ -0,0 +1,36 @@ +def get_archive_base_s3_url(): + """Returns location of the base url for archive.org S3 requests""" + return "http://s3.us.archive.org/" + + +def get_archive_base_url(): + """Returns location of the base url for regular archive.org requests""" + return "https://archive.org/" + + +def get_login_form_url(): + """Returns the url of the login form for archive.org""" + return "%saccount/login.php" % get_archive_base_url() + + +def get_archive_item_details_url(item_name): + """Returns location of item details, sadly as a regular url, but + happily with json output.""" + return "%sdetails/%s?output=json" % (get_archive_base_url(), item_name) + + +def get_archive_item_url(item_name): + """Returns location of the item as an S3-style url""" + return "%s%s" % (get_archive_base_s3_url(), item_name) + + +def get_object_url(object_name, item_name): + """Returns the curl arguments needed for the url of an object (file) S3-style""" + return "%s/%s" % (get_archive_item_url(item_name), object_name) + + +def get_archive_item_status_url(item_name): + """Returns the url of the status of an item (whether there are + any related things in the job queue) for archive.org""" + return ('%scatalog.php?history=1&identifier=%s' + % (get_archive_base_url(), item_name)) diff --git a/tools/archive.org/archivelib/xml_utils.py b/tools/archive.org/archivelib/xml_utils.py new file mode 100644 index 0000000..d99549b --- /dev/null +++ b/tools/archive.org/archivelib/xml_utils.py @@ -0,0 +1,154 @@ +import xml.sax + + +class ListObjectsCH(xml.sax.ContentHandler): + """ + Read contents from a request to list all objects (files) + in a given item (bucket) + + Sample output: + <?xml version='1.0' encoding='UTF-8'?> + <ListBucketResult> + <Name>elwiktionary-dumps</Name> + <Contents> + <Key>elwiktionary-20060703.tar</Key> + <LastModified>2012-02-17T11:22:21.000Z</LastModified> + <ETag>2012-02-17T11:22:21.000Z</ETag> + <Size>10076160</Size> + <StorageClass>STANDARD</StorageClass> + <Owner> + <ID>OpaqueIDStringGoesHere</ID> + <DisplayName>Readable ID Goes Here</DisplayName> + </Owner> + </Contents> + </ListBucketResult> + """ + + NONE = 0x0 + CONTENTS = 0x1 + KEY = 0x2 + LASTMODIFIED = 0x3 + SIZE = 0x4 + + def __init__(self): + xml.sax.ContentHandler.__init__(self) + self.key = "" + self.last_modified = "" + self.size = "" + self.state = ListObjectsCH.NONE + self.item_name = "" + self.item_creation_date = "" + + def startElement(self, name, attrs): + if name == "Contents": + self.state = ListObjectsCH.CONTENTS + elif name == "Key" and self.state == ListObjectsCH.CONTENTS: + self.state = ListObjectsCH.KEY + elif name == "LastModified" and self.state == ListObjectsCH.CONTENTS: + self.state = ListObjectsCH.LASTMODIFIED + elif name == "Size" and self.state == ListObjectsCH.CONTENTS: + self.state = ListObjectsCH.SIZE + + def endElement(self, name): + if name == "Contents": + self.state = ListObjectsCH.NONE + # FIXME really, a print? Do better. + print ("Object: %s, last modified: %s, size: %s" + % (self.key, self.last_modified, self.size)) + self.item_name = "" + self.item_creation_date = "" + + elif name == "Key" and self.state == ListObjectsCH.KEY: + self.state = ListObjectsCH.CONTENTS + elif name == "LastModified" and self.state == ListObjectsCH.LASTMODIFIED: + self.state = ListObjectsCH.CONTENTS + elif name == "Size" and self.state == ListObjectsCH.SIZE: + self.state = ListObjectsCH.CONTENTS + + def characters(self, content): + if self.state == ListObjectsCH.KEY: + self.key = content + elif self.state == ListObjectsCH.LASTMODIFIED: + self.last_modified = content + elif self.state == ListObjectsCH.SIZE: + self.size = content + + +class ListAllItemsCH(xml.sax.ContentHandler): + """ + Read contents from a request to list all items (buckets) + + Sample output: + + <?xml version='1.0' encoding='UTF-8'?> + <ListAllMyBucketsResult> + <Owner> + <ID>OpaqueIDStringGoesHere</ID> + <DisplayName>atglenn</DisplayName> + </Owner> + <Buckets> + <Bucket> + <Name>elwiktionary-dumps</Name> + <CreationDate>1970-01-01T00:00:00.000Z</CreationDate> + </Bucket> + </Buckets> + </ListAllMyBucketsResult> + """ + + NONE = 0x0 + BUCKET = 0x1 + NAME = 0x2 + CREATIONDATE = 0x3 + + def __init__(self): + xml.sax.ContentHandler.__init__(self) + self.item_name = "" + self.item_creation_date = "" + self.state = ListAllItemsCH.NONE + + def startElement(self, name, attrs): + if name == "Bucket": + self.state = ListAllItemsCH.BUCKET + elif name == "Name" and self.state == ListAllItemsCH.BUCKET: + self.state = ListAllItemsCH.NAME + elif name == "CreationDate" and self.state == ListAllItemsCH.BUCKET: + self.state = ListAllItemsCH.CREATIONDATE + + def endElement(self, name): + if name == "Bucket": + self.state = ListAllItemsCH.NONE + # FIXME really, a print? Do better. + print "Item: %s, created: %s" % (self.item_name, self.item_creation_date) + self.item_name = "" + self.item_creation_date = "" + + elif name == "Name" and self.state == ListAllItemsCH.NAME: + self.state = ListAllItemsCH.BUCKET + elif name == "CreationDate" and self.state == ListAllItemsCH.CREATIONDATE: + self.state = ListAllItemsCH.BUCKET + + def characters(self, content): + if self.state == ListAllItemsCH.NAME: + self.item_name = content + elif self.state == ListAllItemsCH.CREATIONDATE: + self.item_creation_date = content + + +class ArchiveKey(object): + """Authentication to the archive.org api, S3-style.""" + + def __init__(self, config): + """Constructor. Args: + config -- a populated ArchiveUploaderConfig object.""" + self.config = config + self.access_key = self.config.settings['access_key'] + self.secret_key = self.config.settings['secret_key'] + + def get_auth_header(self): + """Returns the http header needed for authentication to the archive.org + api.""" + return "authorization: LOW %s:%s" % (self.access_key, self.secret_key) + + def get_s3_auth_curl_args(self): + """Returns the arguments needed for auth to the archive.org S3 api""" + return ["--header", self.get_auth_header()] diff --git a/tools/archive.org/archiveuploader.py b/tools/archive.org/archiveuploader.py index f3d7c72..681df18 100644 --- a/tools/archive.org/archiveuploader.py +++ b/tools/archive.org/archiveuploader.py @@ -1,17 +1,11 @@ import getopt import sys -import json -import xml.sax -import os -import codecs import traceback -import re -import hashlib -import subprocess -from subprocess import Popen, PIPE -import ConfigParser +from archivelib.config import ArchiveUploaderConfig +from archivelib.uploader import ArchiveUploader, ArchiveKey -# todo: + +# todo: # progress bar for large file uploads, or other way the user can figure # out how much of the file upload has been done. # support multipart uploads for the really huge files @@ -20,1019 +14,181 @@ # than log in via icky old web interface and screen scraping # md5sum or sha1 of uploaded object?? -class ArchiveUploaderConfig(object): - """Read contents of config file, if any. - If no filename is provided, the default name 'archiveuploader.conf' will - be checked. If it is not present, the files /etc/archiveuploader.conf and - .archiveuploader.conf will be checked, in that order.""" - def __init__(self, configFile=False): - """Constructor. Args: - configFile -- path to configuration file. If not passed, - the default 'archiveuploader.conf' will be checked.""" - - self.projectName = False - - home = os.path.dirname(sys.argv[0]) - if (not configFile): - configFile = "archiveuploader.conf" - self.files = [ - os.path.join(home,configFile), - "/etc/archiveuploader.conf", - os.path.join(os.getenv("HOME"), ".archiveuploader.conf")] - defaults = { - #"auth": { - "accesskey": "", - "secretkey": "", - "username": "", - "password": "", - #"output": { - "sitematrixfile": "", - #"web": { - "apiurl": "http://en.wikipedia.org/w/api.php", - "curl" : "/usr/bin/curl", - "itemnameformat" : "%s", - "licenseurl" : "http://wikimediafoundation.org/wiki/Terms_of_Use", - "creator" : "the Wikimedia Foundation", - "downloadurl" : "http://dumps.wikimedia.org" - } - - self.conf = ConfigParser.SafeConfigParser(defaults) - self.conf.read(self.files) - self.parseConfFile() - - def parseConfFile(self): - """Get contents of config file, using new values to overwrite - corresponding defaults.""" - self.accessKey = self.conf.get("auth", "accesskey") - self.secretKey = self.conf.get("auth", "secretkey") - self.username = self.conf.get("auth", "username") - self.password = self.conf.get("auth", "password") - self.siteMatrixFile = self.conf.get("output", "sitematrixfile") - self.apiUrl = self.conf.get("web", "apiurl") - self.curl = self.conf.get("web", "curl") - self.itemNameFormat = self.conf.get("web", "itemnameformat") - self.licenseUrl = self.conf.get("web", "licenseurl") - self.creator = self.conf.get("web", "creator") - self.downloadurl = self.conf.get("web", "downloadurl") - -class SiteMatrix(object): - """Get and/or update the SiteMatrix (list of MediaWiki sites - with projct name, database name and language name) via the api, - saving it to a cache file if requested. - If no filename is supplied in the config we will use only the api - to load and update. - If a filename is supplied in the config we will load from it - initially and save to it after every update from the api.""" - - def __init__(self, config, dontSaveFile = False, verbose = False): - """Constructor. Arguments: - config -- populated ArchiveUploaderConfig object - sourceUrl -- url to the api.php script. For example:" - http://en.wikipedia.org/w/api.php - fileName -- full path to a cache file for the site matrix information - dontSaveFile -- load form cache file but never update it (used primarily for - doing a dry run)""" - self.config = config - self.sourceUrl = self.config.apiUrl + "?action=sitematrix&format=json" - self.fileName = self.config.siteMatrixFile - self.dontSaveFile = dontSaveFile - self.verbose = verbose - self.matrixJson = None - if self.fileName and os.path.exists(self.fileName): - try: - self.matrixJson = self.loadMatrixJsonFromFile() - self.matrix = json.loads(self.matrixJson) - except: - self.matrixJson = None - if self.matrixJson == None: - self.matrixJson = self.loadMatrixJsonFromApi() - self.matrix = json.loads(self.matrixJson) - if not self.dontSaveFile: - self.saveMatrixJsonToFile() - - def updateMatrix(self): - """Update the copy of the sitematrix in memory via the MW api. - Write the results to a cache file if requested/enabled.""" - newMatrixJson = self.loadMatrixJsonFromApi() - newMatrix = json.loads(newMatrixJson) - # We may wind up with wikis that have been renamed, or removed, so that - # the old name is no longer valid; it will take up space but otherwise - # is harmless, so ignore this case. - self.matrix = self.matrix.update(newMatrix) - self.matrixJson = json.dumps(self.matrix, ensure_ascii = False) - if not self.dontSaveFile: - self.saveMatrixJsonToFile() - - def loadMatrixJsonFromApi(self): - """Fetch the sitematrix information via the MW api. Get rid - of the extra columns and convert the rest to a dict for our use.""" - apiMatrixJson = self.loadApiMatrixJsonFromApi() - matrix = self.apiMatrixJsonToDict(apiMatrixJson) - matrixJson = json.dumps(matrix, ensure_ascii = False) - return matrixJson - - def loadApiMatrixJsonFromApi(self): - """Fetch the sitematrix information via the MW api.""" - command = [ self.config.curl, "--location", self.sourceUrl ] - - if self.verbose: - commandString = " ".join(command) - print "about to run " + commandString - - proc = Popen(command, stdout = PIPE, stderr = PIPE) - output, error = proc.communicate() - if proc.returncode: - commandString = " ".join(command) - raise ArchiveUploaderError("command '" + commandString + ( "' failed with return code %s " % proc.returncode ) + " and error '" + error + "'") - return output - - def apiMatrixJsonToDict(self, jsonString): - """Convert the sitematrix json string to a dict for our use, - keeping only the information we want: dbname, project name, lang code.""" - matrixJson = json.loads(jsonString) - matrix = {} - #{ u'localname': u'Aromanian', - # u'code': u'roa-rup', - # u'name': u'Arm\xe3neashce', - # u'site': - # [ {u'url': u'http://roa-rup.wikipedia.org', u'code': u'wiki', u'dbname': u'roa_rupwiki'}, - # {u'url': u'http://roa-rup.wiktionary.org', u'code': u'wiktionary', u'dbname': u'roa_rupwiktionary'} - # ] } - for k in matrixJson['sitematrix'].keys(): - if k == 'count': - continue - if k == 'specials': - for s in range(0,len((matrixJson['sitematrix'][k]))): - sitename = matrixJson['sitematrix'][k][s]['dbname'] - matrix[sitename] = {} - matrix[sitename]['project'] = matrixJson['sitematrix'][k][s]['code'] - # special hack - if matrix[sitename]['project'] == 'wiki': - matrix[sitename]['project'] = 'wikipedia' - matrix[sitename]['locallangname'] = None - matrix[sitename]['lang'] = None - else: - for s in range(0,len((matrixJson['sitematrix'][k]['site']))): - sitename = matrixJson['sitematrix'][k]['site'][s]['dbname'] - matrix[sitename] = {} - matrix[sitename]['project'] = matrixJson['sitematrix'][k]['site'][s]['code'] - # special hack - if matrix[sitename]['project'] == 'wiki': - matrix[sitename]['project'] = 'wikipedia' - matrix[sitename]['locallangname'] = matrixJson['sitematrix'][k]['localname'] - matrix[sitename]['lang'] = matrixJson['sitematrix'][k]['code'] - return matrix - - def saveMatrixJsonToFile(self): - """Write the site matrix information to a cache file - in json format.""" - if self.fileName: - outfile = codecs.open(self.fileName,"w","UTF-8") - json.dump(self.matrix, outfile, ensure_ascii = False) - outfile.close() - - def loadMatrixJsonFromFile(self): - """Load the json-formatted site matrix information from a - cache file, converting it to a dict for our use.""" - if self.fileName and os.path.exists(self.fileName): - infile = open(self.fileName,"r") - self.matrixJson= json.load(infile) - infile.close() - -class ArchiveUploaderError(Exception): - """Exception class for the Archive Uploader and all of - its related classes. Doesn't do much :-P""" - pass - -class ArchiveUploader(object): - """Use the archive.org s3 api to create and update items (buckets) - and to upload files (objects) into a bucket. Relies on curl.""" - - def __init__(self, config, archiveKey, itemName, verbose = False, dryrun = False, getmatrix = False): - """Constructor. Args: - config -- populated ArchiveUploadedConfig object - archiveKey -- populated ArchiveKey object (contains access and secret keys) - itemName -- name of item tp be created, updated, or uploaded into - verbose -- if set, produce extra output; default False - dryrun -- if set, don't actually do update/creation/upload, show what - would be run; default False - """ - self.config = config - self.archiveKey = archiveKey - self.itemName = itemName - self.verbose = verbose - self.dryrun = dryrun - self.existence = False # will hold return code of an existence check via curl, on demand - if self.dryrun: - self.dontSaveFile = True - else: - self.dontSaveFile = False - self.matrix = None - self.dbName = self.itemName - if self.config.itemNameFormat: - self.itemName = self.config.itemNameFormat % self.dbName - self.sessionCookies = None - - def getArchiveBaseS3Url(self): - """Returns location of the base url for archive.org S3 requests""" - return "http://s3.us.archive.org/" - - def getArchiveBaseUrl(self): - """Returns location of the base url for regular archive.org requests""" - return "http://www.archive.org/" - - def getArchiveItemUrl(self): - """Returns location of the item as an S3-style url""" - return "%s%s" % (self.getArchiveBaseS3Url(), self.itemName) - - def getArchiveItemDetailsUrl(self): - """Returns location of item details, sadly as a regular url, but - happily with json output.""" - return "%sdetails/%s?output=json" % (self.getArchiveBaseUrl(), self.itemName) - - def getObjectUrl(self, objectName, fileName): - """Returns the curl arguments needed for the url of an object (file) S3-style""" - return "%s/%s" % ( self.getArchiveItemUrl(), objectName ) - - def getLoginFormUrl(self): - """Returns the url of the login form for archive.org""" - return "%saccount/login.php"% self.getArchiveBaseUrl() - - def getArchiveItemStatusUrl(self): - """Returns the url of the status of an item (whether there are - any related things in the job queue) for archive.org""" - return '%scatalog.php?history=1&identifier=%s' % ( self.getArchiveBaseUrl(), self.itemName ) - - def getLocationCurlArg(self): - """Returns the argument that causes curl to follow all redirects""" - return [ "--location" ] - - def getS3AuthCurlArgs(self): - """Returns the arguments needed for auth to the archive.org S3 api""" - return [ "--header", self.archiveKey.getAuthHeader() ] - - def getLoginFormCurlArgs(self): - """Returns the arguments needed for auth to the archive.org S3 api""" - return [ "--data-urlencode", "username=%s" % self.config.username, "--data-urlencode", - "password=%s" % self.config.password, - '--data-urlencode', 'remember=CHECKED', '--data-urlencode', 'submit=Log in' ] - - def getRestOfIckyLoginCurlArgs(self): - """Returns some bizarre arguments needed for archive.org login - partly cause we want to get thecookies without all the html, partly - cause login failes without this 'test cookie', how is that possible? >_<""" - return [ '-s', '-c', '-', '-b', 'test-cookie=1', '-o', '/dev/null' ] - - def getObjectUploadCurlArgs(self, objectName, fileName): - """Returns the curl arguments needed for upload of a file S3-style: - the authentication header with accesskey and secret key, and - the url to the object (file) as an S3 url.""" - args = self.getS3AuthCurlArgs() - args.extend( ['--upload-file', fileName, self.getObjectUrl() ] ) - return args - - def getQuietCurlArg(self): - return [ "-s" ] - - def getCookieCurlArgs(self): - return [ "-b" , self.sessionCookies ] - - def getNoDeriveCurlArg(self): - """This tag tells archive.org not to try to derive a bunch of other - formats from this file (which it would do for videos, for example). - We've been requested to add this since our files have no derivative - formats.""" - return [ "--header", "x-archive-queue-derive:0" ] - - def getHeadReqCurlArgs(self): - """Returns the curl arguments needed to do head request and write - out just the http return code""" - args = self.getQuietCurlArg() - args.extend([ "--write-out", "%{http_code}", "-X", "HEAD" ] ) - return args - - def getHeadWithOutputCurlArgs(self): - """Returns the curl arguments needed to do head request and write - out everything""" - return [ "--head" ] - - def getShowHeadersCurlArg(self): - """Returns the curl argument needed to do a normal (post or - get) request and show the headers along with the output""" - return [ "--include" ] - - def getItemCreationCurlArgs(self): - """Returns the curl arguments needed to put an empty file; - this is used for updating or creating an item (bucket).""" - return [ "-X", "PUT", "--header", "Content-Length: 0" ] - - def getItemMetaHeaderArgs(self): - """Get the curl arguments needed to generate all the headers containing - metadata for objects (files) on archive.org. - Sample headers for el wiktionary: - --header 'x-archive-meta-title:Wikimedia database dumps of el.wiktionary' - --header 'x-archive-meta-mediatype:web' - --header 'x-archive-meta-language:el (Modern Greek)' - --header 'x-archive-meta-description:Dumps of el.wiktionary created by the Wikimedia Foundation and downloadable from http://dumps.wikimedia.org' - --header 'x-archive-meta-format:xml and sql' - --header 'x-archive-meta-licenseurl:http://wikimediafoundation.org/wiki/Terms_of_Use' - --header 'x-archive-meta-subject:xml,dump,wikimedia,el,wiktionary'""" - headers = [ '--header', 'x-archive-meta-title:Wikimedia database dumps of %s' % self.dbName, - '--header', 'x-archive-meta-mediatype:web', - '--header', 'x-archive-meta-description:Dumps of %s created by %s and downloadable from %s' % (self.dbName, self.config.creator, self.config.downloadurl), - '--header', 'x-archive-meta-format:xml and sql', - '--header', 'x-archive-meta-licenseurl:%s' % self.config.licenseUrl ] - - lang = self.getLang() - if lang: - headers.extend([ '--header', 'x-archive-meta-language:%s (%s)' % ( lang, self.getLocalLangName() ), - '--header', 'x-archive-meta-subject:xml,dump,wikimedia,%s,%s' %( lang, self.getProject() ) ]) - else: - headers.extend([ '--header', 'x-archive-meta-subject:xml,dump,wikimedia,%s' %( self.getProject() ) ]) - return headers - - def getIgnoreExistingBucketCurlArg(self): - """Return the curl argument required for overwriting the metadata - of an existing item (bucket).""" - return [ "--header", "x-archive-ignore-preexisting-bucket:1" ] - - def doCurlCommand(self, curlCommand, getOutput=False): - """Given a list containing curl command with all the args and run it. - If getOutput is True, return any output. - Raises ArchiveUploaderError on error fron curl.""" - if self.verbose: - commandString = " ".join(curlCommand) - print "about to run " + commandString - - try: - proc = Popen(curlCommand, stdout = PIPE, stderr = PIPE) - except: - commandString = " ".join(curlCommand) - exc_type, exc_value, exc_traceback = sys.exc_info() - print repr(traceback.format_exception(exc_type, exc_value, exc_traceback)) - raise ArchiveUploaderError("curlCommand '" + commandString + "' failed'" ) - - output, error = proc.communicate() - if proc.returncode: - # curl has this annoying idea that when you specifically do a HEAD - # request it should return an error code anyways indicating a - # partial file transfer - if not (proc.returncode == 18 and 'HEAD' in curlCommand): - commandString = " ".join(curlCommand) - raise ArchiveUploaderError("curlCommand '" + commandString + ( "' failed with return code %s " % proc.returncode ) + " and error '" + error + "'") - if verbose: - print "Command successful." - if getOutput: - if output: - print output - else: - print "No output returned." - - if getOutput: - return output - - def showCommand(self,command): - """Print the supplied command (a list consisting of a command - and any arguments) to stdout.""" - commandString = " ".join(command) - print "would run: command " + commandString - - def uploadObject(self, objectName, fileName): - """Upload an object (file) to the bucket (item). Args: - objectName -- name of the object as it will appear in the S3-style url - fileName -- path to file to be uploaded""" - - # note that someone could remove the item in between the - # time we check for one upload and the time we check for another - # upload, in the case of multiple uploads via this script. - # we're not expecting to beat race conditions, just to warn - # the user if they try uploading to a bucket they never set up - self.checkIfItemExists() - if self.existence != "200": - raise ArchiveUploaderError("No such item " + self.itemName + " exists, http error code " + self.existence + ", giving up.") - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - curlCommand.extend(self.getNoDeriveCurlArg()) - curlCommand.extend(self.getObjectUploadCurlArgs(objectName, fileName)) - if (self.dryrun): - self.showCommand(curlCommand) - else: - self.doCurlCommand(curlCommand) - - def verifyObject(self, objectName, fileName): - """Verify an object (file) in a given bucket (item) by checking etag - from server and md5sum of local file. Args: - objectName -- name of the object as it appears in the S3-style url - fileName -- path to corresponding local file""" - - self.checkIfItemExists() - if self.existence != "200": - raise ArchiveUploaderError("No such item " + self.itemName + " exists, http error code " + self.existence + ", giving up.") - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - curlCommand.extend(self.getHeadWithOutputCurlArgs()) - curlCommand.append(self.getObjectUrl(objectName, fileName)) - if (self.dryrun): - self.showCommand(curlCommand) - else: - result = self.doCurlCommand(curlCommand, True) - md5sumFromEtag = self.getEtagValue(result) - if not md5sumFromEtag: - print "no Etag in server output, received:" - print result - sys.exit(1) - md5sumFromLocalFile = self.getMd5sumOfFile(fileName) - if verbose: - print "Etag: ", md5sumFromEtag, "md5 of local file: ", md5sumFromLocalFile - if md5sumFromEtag == md5sumFromLocalFile: - if verbose: - print "File verified ok." - else: - raise ArchiveUploaderError("File verification FAILED.") - - def getEtagValue(self, text): - # format: ETag: "8ea7c3551a74098b49fbfea49b1ee9e1" - lines = text.split('\n') - etagExpr = re.compile('^ETag:\s+"([abcdef0-9]+)"') - for l in lines: - etagMatch = etagExpr.match(l) - if etagMatch: - return etagMatch.group(1) - return None - - def getMd5sumOfFile(self, fileName): - summer = hashlib.md5() - infile = file(fileName, "rb") - # really? could this be bigger?? consider 20GB files. - bufsize = 4192 * 32 - buffer = infile.read(bufsize) - while buffer: - summer.update(buffer) - buffer = infile.read(bufsize) - infile.close() - return summer.hexdigest() - - def checkIfItemExists(self): - """Check it the item (bucket) exists, returning True if it exists - and False otherwise.""" - if not self.existence: - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - curlCommand.extend(self.getHeadReqCurlArgs()) - curlCommand.append(self.getArchiveItemUrl()) - result = self.doCurlCommand(curlCommand,getOutput=True) - self.existence = result - - # FIXME we should really check once to see if the project name - # is valid and then refuse to work on it otherwise, instead - # of scattering the retry throughout all these functions - def getLang(self): - """Get the language code corresponding to the dbname - of the dump we are creating/uploading""" - if not self.matrix: - self.matrix = SiteMatrix(self.config, self.dontSaveFile, self.verbose) - - if self.dbName in self.matrix.matrix.keys(): - return self.matrix.matrix[self.dbName]['lang'] - self.matrix.updateMatrix() - # one more try - if self.dbName in self.matrix.matrix.keys(): - return self.matrix.matrix[self.dbName]['lang'] - else: - return None - - def getLocalLangName(self): - """From the dbname, get the translation of the name of the language - for the lang code of the dump we are creating/uploading. The translation - is into the content language of the site from which we retrieve - the sitematrix information; typically this should be English, since we are - uploading to archive.org and the description keywords used there - are generally English.""" - if not self.matrix: - self.matrix = SiteMatrix(self.config, self.dontSaveFile, self.verbose) - - if self.dbName in self.matrix.matrix.keys(): - return self.matrix.matrix[self.dbName]['locallangname'] - self.matrix.updateMatrix() - # one more try - if self.dbName in self.matrix.matrix.keys(): - return self.matrix.matrix[self.dbName]['locallangname'] - else: - return None - - def getProject(self): - """From the dbname, get the project name of the dump we are - creating/uploading.""" - if not self.matrix: - self.matrix = SiteMatrix(self.config, self.dontSaveFile, self.verbose) - - if self.dbName in self.matrix.matrix.keys(): - return self.matrix.matrix[self.dbName]['project'] - self.matrix.updateMatrix() - # one more try - if self.dbName in self.matrix.matrix.keys(): - return self.matrix.matrix[self.dbName]['project'] - else: - return None - - def updateItem(self): - """Update an item (bucket); this entails a full update of the metadata. The - objects (files) it contains are not touched in any way.""" - self.createItem(True) - - def createItem(self, rewrite = False): - """Create an item (bucket) S3-style. Args: - rewrite -- if true, we are updating the metadata of an item that - already exists; default false""" - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - curlCommand.extend(self.getS3AuthCurlArgs()) - if (rewrite): - curlCommand.extend(self.getIgnoreExistingBucketCurlArg()) - else: - self.checkIfItemExists() - if self.existence == "200": - raise ArchiveUploaderError("Item " + self.itemName + " already exists, giving up.") - curlCommand.extend(self.getItemMetaHeaderArgs()) - curlCommand.extend(self.getItemCreationCurlArgs()) - curlCommand.append(self.getArchiveItemUrl()) - if (self.dryrun): - self.showCommand(curlCommand) - else: - self.doCurlCommand(curlCommand) - - def listAllItems(self): - """List all items for the user associated with the accesskey/secretkey.""" - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - curlCommand.extend(self.getS3AuthCurlArgs()) - if (not self.verbose): - curlCommand.extend(self.getQuietCurlArg()) - curlCommand.append(self.getArchiveBaseS3Url()) - if (self.dryrun): - self.showCommand(curlCommand) - else: - output = self.doCurlCommand(curlCommand, True) - if (self.verbose): - print "About to parse output" - xml.sax.parseString(output, ListAllItemsCH()) - - def listObjects(self): - """List all objects (files) contained in a specific item (bucket).""" - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - curlCommand.extend(self.getS3AuthCurlArgs()) - if (not self.verbose): - curlCommand.extend(self.getQuietCurlArg()) - curlCommand.append(self.getArchiveItemUrl()) - if (self.dryrun): - self.showCommand(curlCommand) - else: - output = self.doCurlCommand(curlCommand, True) - if (self.verbose): - print "About to parse output" - xml.sax.parseString(output, ListObjectsCH()) - - def showItem(self): - """Show metadata associated with a particular item (bucket).""" - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - if (not self.verbose): - curlCommand.extend(self.getQuietCurlArg()) - curlCommand.append(self.getArchiveItemDetailsUrl()) - if (self.dryrun): - self.showCommand(curlCommand) - else: - output = self.doCurlCommand(curlCommand, True) - if (self.verbose): - print "About to parse output" - self.showItemMetadataFromJson(output) - - def showItemMetadataFromJson(self,jsonString): - """Grab the metadata for an item from the json for the item details - (contains lost of other cruft) and display to stdout""" - # sample output: - #"metadata":{ - # "identifier":["elwiktionary-dumps"], - # "description":["Dumps of el.wiktionary created by the Wikimedia Foundation and downloadable from http:\/\/dumps.wikimedia.org\/elwiktionary\/"], - # "language":["el (Modern Greek)"], - # "licenseurl":["http:\/\/wikimediafoundation.org\/wiki\/Terms_of_Use"], - # "mediatype":["web"], - # "subject":["xml,dump,wikimedia,el,wiktionary"], - # "title":["Wikimedia database dumps of el.wiktionary, format:xml and sql"], - # "publicdate":["2012-02-17 11:03:45"], - # "collection":["opensource"], - # "addeddate":["2012-02-17 11:03:45"]}, - details = json.loads(jsonString) - if 'metadata' in details.keys(): - metadata = details['metadata'] - print "Item metadata for", self.itemName - for k in details['metadata'].keys(): - print "%s:" % k, - print " | ".join(details['metadata'][k]) - else: - print "No metadata for", self.itemName, "is available." - - def logIn(self): - if not self.sessionCookies: - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLoginFormCurlArgs()) - curlCommand.extend(self.getRestOfIckyLoginCurlArgs()) - curlCommand.append(self.getLoginFormUrl()) - if (self.dryrun): - self.showCommand(curlCommand) - else: - output = self.doCurlCommand(curlCommand, True) - if (self.verbose): - print "About to dig cookie out of login response:" - print output - self.sessionCookies = self.getLoginCookies(output) - if not self.sessionCookies: - raise ArchiveUploaderError("Login failed.") - - def getLoginCookies(self, text): - """Get cookie out of the text returned from the - archive.org login form. gahhhh""" - # format: .archive.org^ITRUE^I/^IFALSE^I1361562342^Ilogged-in-sig^Isomehugenumberhere - # .archive.org^ITRUE^I/^IFALSE^I1361562342^Ilogged-in-user^Ijohndoe%40wikimedia.org$ - # plus others which we will ignore. - cookies = [] - lines = text.split('\n') - for l in lines: - if (not len(l)) or (l[0] == '#'): - continue - parts = l.split('\t') - print parts - if parts[5] == 'logged-in-user': - cookies.append( "%s=%s" % (parts[5], parts[6]) ) - elif parts[5] == 'logged-in-sig': - cookies.append( "%s=%s" % (parts[5], parts[6]) ) - if (len(cookies)): - return '; '.join(cookies) - else: - return None - - def showItemStatus(self): - """Show the status of an item (bucket): which objects (files) are waiting - on further action from archive.org.""" - self.logIn() - curlCommand = [ self.config.curl ]; - curlCommand.extend(self.getLocationCurlArg()) - if (not self.verbose): - curlCommand.extend(self.getQuietCurlArg()) - curlCommand.extend(self.getCookieCurlArgs()) - curlCommand.append(self.getArchiveItemStatusUrl()) - if (self.dryrun): - self.showCommand(curlCommand) - else: - output = self.doCurlCommand(curlCommand, True) - self.showItemStatusFromHtml(output) - - def stripHidden(self, cell): - start = cell.find('<span class="catHidden">') - if start != -1: - index = start + 1 - openTags = 1 - spanOpenOrCloseExpr = re.compile('(<span[^>]+>|</span>)') - # find index of first occurrence and what was matched, so we can see if it was open or close tag - while openTags: - spanMatch = spanOpenOrCloseExpr.search(cell[index:]) - if spanMatch: - tagFound = spanMatch.group(1) - index = spanMatch.start(1) + index - if tagFound == '</span>': - openTags = openTags -1 - else: - openTags = openTags + 1 - else: - # bad html. just toss the rest of the cell - openTags = 0 - index = -1 - # now we have the index where we found the close tag for us. - # toss everything up to that, we'll lose the actual close tag when we - # toss the rest of the html - cell = cell[:start] + cell[index:] - return cell - - def showItemStatusFromHtml(self, text): - """Wade through the html output to find information - about each job we have requested and its status. - THIS IS UGLY and guaranteed to break in the future - but hey, there's no json output available, nor xml.""" - htmlTagExpr = re.compile('(<[^>]+>)+') - # get the headers for the table of tasks - start = text.find('<tr><th><b><a href="/catalog.php?history=1&identifier=') - if start >= 0: - end = text.find('<!--task_ids: -->',start) - content = text[start:end] - print "content is", content - lines = content.split('</th>') - lines = [ re.sub(htmlTagExpr,'',line).strip() for line in lines if line.find('<th><b><a href="/catalog.php?history=1&identifier=') != -1 ] - print ' | '.join(filter(None, lines)) - - # get the tasks themselves - start = text.find('<!--task_ids: -->') - if start < 0: - raise ArchiveUploaderError("Can't locate the beginning of the item status information in the html output.") - end = text.find('</table>',start) - content = text[start:end] - lines = content.split('</tr>') - - for line in lines: - line = line.replace('\n','') - cells = line.split('</td>') - cellsToPrint = [ re.sub(htmlTagExpr,'',self.stripHidden(cell)).strip() for cell in cells ] - print ' | '.join(filter(None,cellsToPrint)) - -class ListObjectsCH(xml.sax.ContentHandler): - """Read contents from a request to list all objects (files) - in a given item (bucket)""" - NONE = 0x0 - CONTENTS = 0x1 - KEY = 0x2 - LASTMODIFIED = 0x3 - SIZE = 0x4 - - # sample output: - #<?xml version='1.0' encoding='UTF-8'?> - #<ListBucketResult> - # <Name>elwiktionary-dumps</Name> - # <Contents> - # <Key>elwiktionary-20060703.tar</Key> - # <LastModified>2012-02-17T11:22:21.000Z</LastModified> - # <ETag>2012-02-17T11:22:21.000Z</ETag> - # <Size>10076160</Size> - # <StorageClass>STANDARD</StorageClass> - # <Owner> - # <ID>OpaqueIDStringGoesHere</ID> - # <DisplayName>Readable ID Goes Here</DisplayName> - # </Owner> - # </Contents> - #</ListBucketResult> - - def __init__(self): - xml.sax.ContentHandler.__init__(self) - self.Key = "" - self.LastModified = "" - self.Size = "" - self.state = ListObjectsCH.NONE - - def startElement(self, name, attrs): - if name == "Contents": - self.state = ListObjectsCH.CONTENTS - elif name == "Key" and self.state == ListObjectsCH.CONTENTS: - self.state = ListObjectsCH.KEY - elif name == "LastModified" and self.state == ListObjectsCH.CONTENTS: - self.state = ListObjectsCH.LASTMODIFIED - elif name == "Size" and self.state == ListObjectsCH.CONTENTS: - self.state = ListObjectsCH.SIZE - - def endElement(self, name): - if name == "Contents": - self.state = ListObjectsCH.NONE - # FIXME really, a print? Do better. - print "Object: %s, last modified: %s, size: %s" % (self.key, self.lastModified, self.size) - self.itemName = "" - self.itemCreationDate = "" - - elif name == "Key" and self.state == ListObjectsCH.KEY: - self.state = ListObjectsCH.CONTENTS - elif name == "LastModified" and self.state == ListObjectsCH.LASTMODIFIED: - self.state = ListObjectsCH.CONTENTS - elif name == "Size" and self.state == ListObjectsCH.SIZE: - self.state = ListObjectsCH.CONTENTS - - def characters(self, content): - if self.state == ListObjectsCH.KEY: - self.key = content - elif self.state == ListObjectsCH.LASTMODIFIED: - self.lastModified = content - elif self.state == ListObjectsCH.SIZE: - self.size = content - -class ListAllItemsCH(xml.sax.ContentHandler): - """Read contents from a request to list all items (buckets)""" - NONE = 0x0 - BUCKET = 0x1 - NAME = 0x2 - CREATIONDATE = 0x3 - - # sample output: - - #<?xml version='1.0' encoding='UTF-8'?> - #<ListAllMyBucketsResult> - # <Owner> - # <ID>OpaqueIDStringGoesHere</ID> - # <DisplayName>atglenn</DisplayName> - # </Owner> - # <Buckets> - # <Bucket> - # <Name>elwiktionary-dumps</Name> - # <CreationDate>1970-01-01T00:00:00.000Z</CreationDate> - # </Bucket> - # </Buckets> - #</ListAllMyBucketsResult> - - def __init__(self): - xml.sax.ContentHandler.__init__(self) - self.itemName = "" - self.itemCreationDate = "" - self.state = ListAllItemsCH.NONE - - def startElement(self, name, attrs): - if name == "Bucket": - self.state = ListAllItemsCH.BUCKET - elif name == "Name" and self.state == ListAllItemsCH.BUCKET: - self.state = ListAllItemsCH.NAME - elif name == "CreationDate" and self.state == ListAllItemsCH.BUCKET: - self.state = ListAllItemsCH.CREATIONDATE - - def endElement(self, name): - if name == "Bucket": - self.state = ListAllItemsCH.NONE - # FIXME really, a print? Do better. - print "Item: %s, created: %s" % (self.itemName, self.itemCreationDate) - self.itemName = "" - self.itemCreationDate = "" - - elif name == "Name" and self.state == ListAllItemsCH.NAME: - self.state = ListAllItemsCH.BUCKET - elif name == "CreationDate" and self.state == ListAllItemsCH.CREATIONDATE: - self.state = ListAllItemsCH.BUCKET - - def characters(self, content): - if self.state == ListAllItemsCH.NAME: - self.itemName = content - elif self.state == ListAllItemsCH.CREATIONDATE: - self.itemCreationDate = content - - -class ArchiveKey(object): - """Authentication to the archive.org api, S3-style.""" - - def __init__(self, config): - """Constructor. Args: - config -- a populated ArchiveUploaderConfig object.""" - self.config = config - self.accessKey = self.config.accessKey - self.secretKey = self.config.secretKey - - def getAuthHeader(self): - """Returns the http header needed for authentication to the archive.org - api.""" - return "authorization: LOW %s:%s" % ( self.accessKey, self.secretKey ) - -def usage(message = None): +def usage(message=None): """Print comprehensive help information to stdout, including a specified message if any, and then error exit.""" if message: - print message - print + sys.stderr.write(message + "\n") - print "Usage: python archiveuploader.py [options]" - print "Mandatory options: --accesskey, --secretkey" - print "--accesskey <key>: The access key from archive.org used to create items and upload objects." - print "--secretkey <key>: The secret key corresponding to the access key described above." - print "Action options (choose one):" - print "--createitem <item>: The item specified will be created. Fails if item already exists." - print "--updateitem <item>: The metadata for the specified item will be updated." - print "--uploadobject <item>: An object will be created by uploading to the specified item the file" - print " given by --filename. Requires the --objectName option." - print "--verifyobject <item>: An object in an item will be verified by checking its md5sum locally and on" - print " the server. Requires the --objectName and the --filename options." - print "--listitems: List all items belonging to the account identified by the --accesskey" - print " and --secretkey options." - print "--showitem <item>: Show metadata about the specified item." - print "--showitemstatus <item>: Show pending tasks related to the specified item." - print "--listobjects <item>: List all objects in the specified item." - print "Other options:" - print "--configfile <file>: Name of optional configuration file with access keys, etc." - print "--dryrun: Don't create or update items or objects but show the commands that would" - print " be run. This option also means that updates to the sitematrix cache file" - print " will not be done, although it will be read from if it exists, and the" - print " MediaWiki instance will be queried via the api as well, if needed." - print "--filename <file>: The full path to the file to upload, when --uploadobject is specified." - print "--objectname <object>: The name of an object as it is to appear in a aurl." - print "--verbose: Display progress bars and other output." + usage_message = """ +Usage: python archiveuploader.py [options] + +Mandatory options: --action --accesskey, --secretkey + --accesskey <key>: The access key from archive.org used to + create items and upload objects. + --secretkey <key>: The secret key corresponding to the access + key described above. + --action <action>: See below for the list of actions +Action options (choose one): + create_item: The item specified will be created. Fails + if item already exists. + update_item: The metadata for the specified item will + be updated. + upload_object: An object will be created by uploading to + the specified item the file given by + --filename. Requires the --objectname option. + verify_object: An object in an item will be verified by + checking its md5sum locally and on the server. + Requires the --objectname and the --filename + options. + show_item: Show metadata about the specified item. + show_item_status: Show pending tasks related to specified item. + list_objects: List all objects in the specified item. + +The above actions all require the --itemname argument be specified. + + list_items: List all items belonging to the account + identified by the --accesskey and --secretkey + options. + +Other options: + --configfile <file>: Name of optional configuration file with + access keys, etc. + --dryrun: Don't create or update items or objects but + show the commands that would be run. This + option also means that updates to the + sitematrix cache file will not be done, + although it will be read from if it exists, + and the MediaWiki instance will be queried + via the api as well, if needed. + --filename <file>: The full path to the file to upload, when + --uploadobject is specified. + --objectname <object>: The name of an object as it is to appear in +; a url. + --verbose: Display progress bars and other output. +""" + sys.stderr.write(usage_message) sys.exit(1) -if __name__ == "__main__": - verbose = False - accessKey = None - secretKey = None - itemName = None - objectName = None - createItem = False - updateItem = False - uploadObject = False - verifyObject = False - configFile = None - fileName = None - listItems = False - listObjects = False - showItem = False - showItemStatus = False - dryrun = False - try: - (options, remainder) = getopt.gnu_getopt(sys.argv[1:], "", - ['accesskey=', 'secretkey=', 'createitem=', 'updateitem=', 'uploadobject=', 'objectname=', 'filename=', 'configfile=', 'listitems', 'showitem=', 'showitemstatus=', 'listobjects=', 'verifyobject=', 'dryrun', 'verbose' ]) - except: - exc_type, exc_value, exc_traceback = sys.exc_info() - print repr(traceback.format_exception(exc_type, exc_value, exc_traceback)) - usage("Unknown option or other error encountered") +def get_opt_vals(options): + """ + get and return values of options, with + appropriate defaults + """ + opt_vals = { + 'action': None, + 'verbose': False, + 'access_key': None, + 'secret_key': None, + 'item_name': None, + 'object_name': None, + 'file_name': None, + 'config_file': None, + 'dryrun': False + } for (opt, val) in options: if opt == "--accesskey": - accessKey = val + opt_vals['access_key'] = val elif opt == "--secretkey": - secretKey = val - elif opt == '--uploadobject': - itemName = val - uploadObject = True - elif opt == '--verifyobject': - itemName = val - verifyObject = True - elif opt == '--objectname': - objectName = val - elif opt == '--createitem': - itemName = val - createItem = True - elif opt == '--updateitem': - itemName = val - updateItem = True - elif opt == '--listitems': - listItems = True - elif opt == '--listobjects': - itemName = val - listObjects = True - elif opt == '--showitem': - itemName = val - showItem = True - elif opt == '--showitemstatus': - itemName = val - showItemStatus = True + opt_vals['secret_key'] = val + elif opt == '--action': + opt_vals['action'] = val elif opt == "--dryrun": - dryrun = True + opt_vals['dryrun'] = True + elif opt == "--itemname": + opt_vals['item_name'] = val + elif opt == "--objectname": + opt_vals['object_name'] = val elif opt == "--filename": - fileName = val + opt_vals['file_name'] = val elif opt == "--configfile": - configFile = val + opt_vals['config_file'] = val elif opt == "--verbose": - verbose = True + opt_vals['verbose'] = True + return opt_vals - if len(remainder): - usage("Error: unknown option specified.") - if (uploadObject or verifyObject) and not fileName: - usage("Error: a filename for upload or verification must be specified with --uploadobject/--verifyobject.") - - if (uploadObject or verifyObject) and not objectName: - usage("Error: the option --objectname must be specified with --uploadobject/--verifyobject.") - - actionOptsCount = len(filter(None, [ createItem, updateItem, uploadObject, verifyObject, listItems, listObjects, showItem, showItemStatus ])) - - if actionOptsCount > 1: - usage("Error: conflicting action options specified.") - elif actionOptsCount < 1: - usage("Error: no action option specified.") - - config = ArchiveUploaderConfig(configFile) - - if not config.accessKey: - config.accessKey = accessKey - if not config.secretKey: - config.secretKey = secretKey - - if (not config.accessKey or not config.secretKey): - usage("Error: one of the mandatory options was not specified.") - - archiveKey = ArchiveKey(config) - archiveUploader = ArchiveUploader(config, archiveKey, itemName, verbose, dryrun) - - if uploadObject: - result = archiveUploader.uploadObject(objectName, fileName) - elif verifyObject: - result = archiveUploader.verifyObject(objectName, fileName) - elif createItem: - result = archiveUploader.createItem() - elif updateItem: - result = archiveUploader.updateItem() - elif listItems: - result = False - archiveUploader.listAllItems() - elif listObjects: - result = False - archiveUploader.listObjects() - elif showItem: - result = False - archiveUploader.showItem() - elif showItemStatus: - result = False - archiveUploader.showItemStatus() +def do_action(archive_uploader, action, object_name, file_name): + """ + do the specified action and display success or failure + """ + result = False + if action == 'upload_object': + result = archive_uploader.upload_object(object_name, file_name) + elif action == 'verify_object': + result = archive_uploader.verify_object(object_name, file_name) + elif action == 'create_item': + result = archive_uploader.create_item() + elif action == 'update_item': + result = archive_uploader.update_item() + elif action == 'list_items': + archive_uploader.list_all_items() + elif action == 'list_objects': + archive_uploader.list_objects() + elif action == 'show_item': + archive_uploader.show_item() + elif action == 'show_item_status': + archive_uploader.show_item_status() if result: print "Failed." else: print "Successful." + +def do_main(): + try: + (options, remainder) = getopt.gnu_getopt( + sys.argv[1:], "", + ['accesskey=', 'secretkey=', 'action=', 'objectname=', 'filename=', + 'itemname=', 'configfile=', 'dryrun', 'verbose']) + except Exception: + exc_type, exc_value, exc_traceback = sys.exc_info() + print repr(traceback.format_exception(exc_type, exc_value, exc_traceback)) + usage("Unknown option or other error encountered") + + opt_vals = get_opt_vals(options) + + if len(remainder): + usage("Error: unknown option specified.") + + if opt_vals['action'] is None: + usage("Error: no action option specified.") + + if opt_vals['action'] not in ['upload_object', 'verify_object', 'create_item', 'update_item', + 'list_items', 'list_objects', 'show_item', 'show_item_status']: + usage("Error: unknown action " + opt_vals['action']) + + if (opt_vals['action'] in ['upload_object', 'verify_object']): + if not opt_vals['file_name']: + usage("Error: a filename for upload or verification must " + "be specified with uploadobject/verifyobject action.") + if not opt_vals['object_name']: + usage("Error: the option --objectname must be specified " + "with uploadobject/verifyobject action.") + + config = ArchiveUploaderConfig(opt_vals['config_file']) + + if not config.settings['access_key']: + config.settings['access_key'] = opt_vals['access_key'] + if not config.settings['secret_key']: + config.settings['secret_key'] = opt_vals['secret_key'] + + if not config.settings['access_key'] or not config.settings['secret_key']: + usage("Error: one of the mandatory options was not specified.") + + archive_key = ArchiveKey(config) + debugging = [] + if opt_vals['verbose']: + debugging.append('verbose') + if opt_vals['dryrun']: + debugging.append('verbose') + archive_uploader = ArchiveUploader(config, archive_key, opt_vals['item_name'], debugging) + + do_action(archive_uploader, opt_vals['action'], opt_vals['object_name'], opt_vals['file_name']) + + +if __name__ == "__main__": + do_main() -- To view, visit https://gerrit.wikimedia.org/r/280102 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I67526d9b154d2b7a5aa9e38fb8561916aaefb454 Gerrit-PatchSet: 6 Gerrit-Project: operations/dumps Gerrit-Branch: ariel Gerrit-Owner: ArielGlenn <[email protected]> Gerrit-Reviewer: ArielGlenn <[email protected]> Gerrit-Reviewer: jenkins-bot <> _______________________________________________ MediaWiki-commits mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
