EBernhardson has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/335158 )

Change subject: Script to drop mediawiki log partitions in HDFS
......................................................................

Script to drop mediawiki log partitions in HDFS

Log data in the mediawiki -> kafka -> camus -> hdfs pipeline has
various PII in it that needs to be deleted after no more than 90
days. This script adapts from the existing scripts to support the
need to drop old data.

Change-Id: Ia6811a7447ff63c2b0b78ed4a90d801e700a9379
---
A bin/refinery-drop-mediawiki-partitions
1 file changed, 116 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/analytics/refinery 
refs/changes/58/335158/1

diff --git a/bin/refinery-drop-mediawiki-partitions 
b/bin/refinery-drop-mediawiki-partitions
new file mode 100755
index 0000000..1a97016
--- /dev/null
+++ b/bin/refinery-drop-mediawiki-partitions
@@ -0,0 +1,116 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Note: You should make sure to put refinery/python on your PYTHONPATH.
+#   export PYTHONPATH=$PYTHONPATH:/path/to/refinery/python
+
+"""
+Automatically deletes the hourly time bucketed mediawiki logging
+directories from HDFS.
+
+Usage: refinery-drop-mediawiki-partitions [options]
+
+Options:
+    -h --help                           Show this help message and exit.
+    -d --older-than-days=<days>         Drop data older than this number of 
days.  [default: 60]
+    -l --location=<location>            Base HDFS location path of the 
mediawiki log data.
+    -v --verbose                        Turn on verbose debug logging.
+    -n --dry-run                        Don't actually delete any data. Print 
the HDFS directory paths
+                                        that will be deleted
+"""
+__author__ = 'Madhumitha Viswanathan <madhuvi...@wikimedia.org>'
+
+import datetime
+from   docopt   import docopt
+import logging
+import re
+import os
+import sys
+from refinery.util import HiveUtils, HdfsUtils
+
+
+if __name__ == '__main__':
+    # parse arguments
+    arguments = docopt(__doc__)
+    # pp(arguments)
+    days            = int(arguments['--older-than-days'])
+    location        = arguments['--location']
+    verbose         = arguments['--verbose']
+    dry_run         = arguments['--dry-run']
+
+    log_level = logging.INFO
+    if verbose:
+        log_level = logging.DEBUG
+
+    logging.basicConfig(level=log_level,
+                        format='%(asctime)s %(levelname)-6s %(message)s',
+                        datefmt='%Y-%m-%dT%H:%M:%S')
+
+
+    if not HdfsUtils.validate_path(location):
+        logging.error('Location \'{0}\' is not a valid HDFS path.  Path must 
start with \'/\' or \'hdfs://\'.  Aborting.'
+            .format(location))
+        sys.exit(1)
+
+
+    # This glob will be used to list out all partition paths in HDFS.
+    partition_glob = os.path.join(location, 'hourly', '*', '*', '*', '*')
+
+    # This regex tells HiveUtils partition_datetime_from_path
+    # how to extract just the date portion from a partition path.
+    # The first match group will be passed to datetime.datetime.strptime
+    # using one of the below date_formats.
+    date_regex = re.compile(r'.*/hourly/(.+)$')
+
+    # This regex will be used to extract a datetime object from the string
+    # matched by date_regex in HiveUtils partition_datetime_from_path
+    date_format = '%Y/%m/%d/%H'
+
+    # Delete partitions older than this.
+    old_partition_datetime_threshold = datetime.datetime.now() - 
datetime.timedelta(days=days)
+
+    partition_paths_to_delete = []
+
+    # Loop through all the partition directory paths for this table
+    # and check if any of them are old enough for deletion.
+    for partition_path in HdfsUtils.ls(partition_glob, include_children=False):
+        try:
+            partition_datetime = HiveUtils.partition_datetime_from_path(
+                partition_path,
+                date_regex,
+                date_format
+            )
+        except ValueError as e:
+            logging.error(
+                'HiveUtils.partition_datetime_from_path could not parse date 
found in {0} using pattern {1}. Skipping. ({2})'
+                .format(partition_path, date_regex.pattern, e)
+            )
+            continue
+
+        if partition_datetime and partition_datetime < 
old_partition_datetime_threshold:
+            partition_paths_to_delete.append(partition_path)
+
+
+    # Delete any old HDFS data
+    if partition_paths_to_delete:
+        if dry_run:
+            print('hdfs dfs -rm -R ' + ' '.join(partition_paths_to_delete))
+        else:
+            logging.info('Removing {0} eventlogging partition directories from 
{1}.'
+                .format(len(partition_paths_to_delete), location)
+            )
+            HdfsUtils.rm(' '.join(partition_paths_to_delete))
+    else:
+        logging.info('No eventlogging partition directories need to be 
removed')

-- 
To view, visit https://gerrit.wikimedia.org/r/335158
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia6811a7447ff63c2b0b78ed4a90d801e700a9379
Gerrit-PatchSet: 1
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to