Milimetric has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/365517 )

Change subject: Create Oozie job for interlanguage nav table
......................................................................


Create Oozie job for interlanguage nav table

We are aggregating pageviews from webrequest that are referred from one
language to another inside the same project family.  For example, if
someone navigates from en.wikipedia to de.wikipedia, we count that as
(wikipedia, en, de, 1).  The language team has been doing this manually
so far and we are just oozifying it so it can be monitored, reliable,
and automated.

Bug: T170764
Change-Id: I35bc02e046931735f89c0e7bcda676da172a81aa
---
M .gitignore
A hive/interlanguage/create_interlanguage_navigation_table.hql
A oozie/interlanguage/daily/README.md
A oozie/interlanguage/daily/coordinator.properties
A oozie/interlanguage/daily/coordinator.xml
A oozie/interlanguage/daily/interlanguage_navigation.hql
A oozie/interlanguage/daily/workflow.xml
A oozie/interlanguage/datasets.xml
8 files changed, 473 insertions(+), 0 deletions(-)

Approvals:
  Milimetric: Verified; Looks good to me, approved
  Amire80: Looks good to me, but someone else must approve



diff --git a/.gitignore b/.gitignore
index c9b568f..1559f2e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,2 +1,3 @@
 *.pyc
 *.swp
+*.swo
diff --git a/hive/interlanguage/create_interlanguage_navigation_table.hql 
b/hive/interlanguage/create_interlanguage_navigation_table.hql
new file mode 100644
index 0000000..939f8fa
--- /dev/null
+++ b/hive/interlanguage/create_interlanguage_navigation_table.hql
@@ -0,0 +1,22 @@
+-- Create table statement for interlanguage_navigation table.
+--
+-- Parameters:
+--     <none>
+--
+-- Usage
+--     hive -f create_interlanguage_navigation_table.hql \
+--         --database wmf
+--
+
+CREATE EXTERNAL TABLE IF NOT EXISTS `wmf.interlanguage_navigation` (
+    `project_family`    string  COMMENT 'The project family to aggregate on',
+    `previous_project`  string  COMMENT 'The project (language) found in the 
referers of this group of requests',
+    `current_project`   string  COMMENT 'The project (language) of this group 
of requests',
+    `navigation_count`  bigint  COMMENT 'The number of times a user navigated 
from the previous to the current project'
+)
+PARTITIONED BY (
+    `date`              string  COMMENT 'Date in YYYY-MM-DD format'
+)
+STORED AS PARQUET
+LOCATION '/wmf/data/wmf/interlanguage/navigation'
+;
diff --git a/oozie/interlanguage/daily/README.md 
b/oozie/interlanguage/daily/README.md
new file mode 100644
index 0000000..5cc51cf
--- /dev/null
+++ b/oozie/interlanguage/daily/README.md
@@ -0,0 +1,7 @@
+# Aggregate cross-wiki navigation counts using HTTP referrer
+
+This job aggregates webrequest records into counts of navigation by users
+from one wiki project to another.  Only counting desktop site browsing.
+
+Output is appended into (yyyy-mm-dd) daily partitions
+in /wmf/data/wmf/interlanguage/navigation
diff --git a/oozie/interlanguage/daily/coordinator.properties 
b/oozie/interlanguage/daily/coordinator.properties
new file mode 100644
index 0000000..822b592
--- /dev/null
+++ b/oozie/interlanguage/daily/coordinator.properties
@@ -0,0 +1,72 @@
+# Configures a coordinator to manage automatically aggregating interlanguage 
navigation from
+# the refined webrequest table.
+# Any of the following properties are overidable with -D.
+# Usage:
+#   oozie job -Duser=$USER -Dstart_time=2017-10-03T00:00Z -submit -config 
oozie/interlanguage/daily/coordinator.properties
+#
+# NOTE:  The $oozie_directory must be synced to HDFS so that all relevant
+#        .xml files exist there when this job is submitted.
+
+
+name_node                         = hdfs://analytics-hadoop
+job_tracker                       = resourcemanager.analytics.eqiad.wmnet:8032
+queue_name                        = default
+
+user                              = hdfs
+
+# Base path in HDFS to refinery.
+# When submitting this job for production, you should
+# override this to point directly at a deployed
+# directory name, and not the 'symbolic' 'current' directory.
+# E.g.  /wmf/refinery/2015-01-05T17.59.18Z--7bb7f07
+refinery_directory                = ${name_node}/wmf/refinery/current
+
+# Base path in HDFS to oozie files.
+# Other files will be used relative to this path.
+oozie_directory                   = ${refinery_directory}/oozie
+artifacts_directory               = ${refinery_directory}/artifacts
+
+# Version of Hive UDF jar to import
+refinery_jar_version              = 0.0.53
+
+# HDFS path to hive-site.xml file.  This is needed to run hive actions.
+hive_site_xml                     = ${name_node}/user/hive/hive-site.xml
+
+# HDFS path to coordinator to run for each webrequest_source.
+coordinator_file                  = 
${oozie_directory}/interlanguage/daily/coordinator.xml
+
+# HDFS path to workflow to run.
+workflow_file                     = 
${oozie_directory}/interlanguage/daily/workflow.xml
+
+# Initial import time of the webrequest dataset.
+start_time                        = 2017-10-01T00:00Z
+
+# Time to stop running this coordinator.  Year 3000 == never!
+stop_time                         = 3000-01-01T00:00Z
+
+# HDFS path to refine webrequest dataset definitions
+webrequest_datasets_file          = ${oozie_directory}/webrequest/datasets.xml
+webrequest_data_directory         = ${name_node}/wmf/data/wmf/webrequest
+
+# HDFS path to interlanguage definitions
+interlanguage_datasets_file       = 
${oozie_directory}/interlanguage/datasets.xml
+interlanguage_data_directory      = ${name_node}/wmf/data/wmf/interlanguage
+
+# Fully qualified Hive table name.
+webrequest_table                  = wmf.webrequest
+interlanguage_navigation_table    = wmf.interlanguage_navigation
+
+
+# HDFS path to workflow to mark a directory as done
+mark_directory_done_workflow_file = 
${oozie_directory}/util/mark_directory_done/workflow.xml
+
+# The email address where to send SLA alerts
+sla_alert_contact                 = [email protected]
+
+# Workflow to send an error email
+send_error_email_workflow_file    = 
${oozie_directory}/util/send_error_email/workflow.xml
+
+# Coordintator to start.
+oozie.coord.application.path      = ${coordinator_file}
+oozie.use.system.libpath          = true
+oozie.action.external.stats.write = true
diff --git a/oozie/interlanguage/daily/coordinator.xml 
b/oozie/interlanguage/daily/coordinator.xml
new file mode 100644
index 0000000..d367fc7
--- /dev/null
+++ b/oozie/interlanguage/daily/coordinator.xml
@@ -0,0 +1,115 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<coordinator-app xmlns="uri:oozie:coordinator:0.4"
+    xmlns:sla="uri:oozie:sla:0.2"
+    name="interlanguage-navigation-daily-coord"
+    frequency="${coord:days(1)}"
+    start="${start_time}"
+    end="${stop_time}"
+    timezone="Universal">
+
+    <parameters>
+
+        <!-- Required properties. -->
+        <property><name>name_node</name></property>
+        <property><name>job_tracker</name></property>
+        <property><name>queue_name</name></property>
+        <property><name>artifacts_directory</name></property>
+        <property><name>refinery_jar_version</name></property>
+        <property><name>hive_site_xml</name></property>
+
+        <property><name>workflow_file</name></property>
+
+        <property><name>start_time</name></property>
+        <property><name>stop_time</name></property>
+
+        <property><name>webrequest_datasets_file</name></property>
+        <property><name>webrequest_data_directory</name></property>
+
+        <property><name>interlanguage_datasets_file</name></property>
+        <property><name>interlanguage_data_directory</name></property>
+
+        <property><name>webrequest_table</name></property>
+        <property><name>interlanguage_navigation_table</name></property>
+
+        <property><name>mark_directory_done_workflow_file</name></property>
+        <property><name>sla_alert_contact</name></property>
+        <property><name>send_error_email_workflow_file</name></property>
+    </parameters>
+
+    <controls>
+        <!--
+        By having materialized jobs not timeout, we ease backfilling incidents
+        after recoverable hiccups on the dataset producers.
+        -->
+        <timeout>-1</timeout>
+
+        <!--
+        interlanguage aggregation is not high priority, so we limit
+        concurrency for resource sharing.
+        -->
+        <concurrency>2</concurrency>
+
+        <throttle>2</throttle>
+    </controls>
+
+    <datasets>
+        <!--
+        Include refined and interlanguage datasets files.
+        $webrequest_datasets_file will be used as the input events
+        $interlanguage_datasets_file will be used as the output events
+        -->
+        <include>${webrequest_datasets_file}</include>
+        <include>${interlanguage_datasets_file}</include>
+    </datasets>
+
+    <input-events>
+        <data-in name="text_refined_input" dataset="webrequest_text">
+            <instance>${coord:current(0)}</instance>
+        </data-in>
+    </input-events>
+
+    <output-events>
+        <data-out name="interlanguage_navigation_daily_output" 
dataset="interlanguage_navigation_daily">
+            <instance>${coord:current(0)}</instance>
+        </data-out>
+    </output-events>
+
+    <action>
+        <workflow>
+            <app-path>${workflow_file}</app-path>
+            <configuration>
+
+                <property>
+                    <name>year</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"yyyy")}</value>
+                </property>
+                <property>
+                    <name>month</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"MM")}</value>
+                </property>
+                <property>
+                    <name>day</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"dd")}</value>
+                </property>
+                <property>
+                    <!-- To mark directory done after success -->
+                    <name>interlanguage_navigation_dataset_directory</name>
+                    
<value>${coord:dataOut('interlanguage_navigation_daily_output')}</value>
+                </property>
+
+            </configuration>
+        </workflow>
+        <sla:info>
+            <!--
+                Use action actual time as SLA base, since it's the time used
+                to compute timeout
+                Put two days as SLA since webrequest has 5 hours and this
+                job is not urgent
+            -->
+            <sla:nominal-time>${coord:actualTime()}</sla:nominal-time>
+            <sla:should-end>${2 * DAYS}</sla:should-end>
+            <sla:alert-events>end_miss</sla:alert-events>
+            <sla:alert-contact>${sla_alert_contact}</sla:alert-contact>
+        </sla:info>
+    </action>
+</coordinator-app>
diff --git a/oozie/interlanguage/daily/interlanguage_navigation.hql 
b/oozie/interlanguage/daily/interlanguage_navigation.hql
new file mode 100644
index 0000000..3844226
--- /dev/null
+++ b/oozie/interlanguage/daily/interlanguage_navigation.hql
@@ -0,0 +1,63 @@
+-- Overview:
+--      This query works like this:
+--          * filters webrequest for user (not spider) pageviews to wikis
+--            with referrers from a wiki in the same project family
+--          * aggregates view count by project family, source project 
(previous)
+--            and destination project (current)
+--
+-- Parameters:
+--      source_table      -- Fully qualified table name to compute the
+--                           aggregation from.
+--      artifacts_directory
+--                        -- The artifact directory where to find
+--                           jar files to import for UDFs
+--      refinery_jar_version
+--                        -- Version of the jar to import for UDFs
+--      destination_table -- Fully qualified table name to fill in
+--                           aggregated values.
+--      year              -- year of partition to aggregate
+--      month             -- month of partition to aggregate, left zero-padded
+--      day               -- day of partition to aggregate, left zero-padded
+--
+-- Usage:
+--     hive -f interlanguage_links.hql                                         
             \
+--         -d 
artifacts_directory=hdfs://analytics-hadoop/wmf/refinery/current/artifacts    \
+--         -d refinery_jar_version=0.0.53                                      
             \
+--         -d source_table=wmf.webrequest                                      
             \
+--         -d destination_table=wmf.interlanguage_links                        
             \
+--         -d year=2017                                                        
             \
+--         -d month=10                                                         
             \
+--         -d day=03
+--
+-- example: ADD JAR 
hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.53.jar;
+ADD JAR 
${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-${refinery_jar_version}.jar;
+
+SET parquet.compression = SNAPPY;
+CREATE TEMPORARY FUNCTION normalize_host AS 
'org.wikimedia.analytics.refinery.hive.GetHostPropertiesUDF';
+
+INSERT OVERWRITE TABLE ${destination_table}
+    PARTITION(date='${year}-${month}-${day}')
+
+     SELECT normalized_host.project_family,
+            normalized_host.project AS current_project,
+            normalize_host(parse_url(referer, 'HOST')).project as 
previous_project,
+            COUNT(*) AS navigation_count
+
+       FROM ${source_table}
+
+      WHERE webrequest_source='text'
+        AND year=${year} AND month=${month} AND day=${day}
+        AND is_pageview
+        AND agent_type = 'user'
+        -- The project is the same as the referer project
+        AND normalized_host.project_family = normalize_host(parse_url(referer, 
'HOST')).project_family
+        -- The referer host has no .zero, .m, or other qualifiers
+        AND size(normalize_host(parse_url(referer, 'HOST')).qualifiers) = 0
+        -- The referer path was something with a /wiki/ beginning, like a 
normal article path
+        AND parse_url(referer,'PATH') LIKE '/wiki/%'
+        AND normalized_host.project <> normalize_host(parse_url(referer, 
'HOST')).project
+
+      GROUP BY normalized_host.project_family,
+            normalized_host.project,
+            normalize_host(parse_url(referer, 'HOST')).project
+;
diff --git a/oozie/interlanguage/daily/workflow.xml 
b/oozie/interlanguage/daily/workflow.xml
new file mode 100644
index 0000000..75bf630
--- /dev/null
+++ b/oozie/interlanguage/daily/workflow.xml
@@ -0,0 +1,158 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<workflow-app xmlns="uri:oozie:workflow:0.4"
+    name="interlanguage-navigation-daily-wf-${year}-${month}-${day}">
+
+    <parameters>
+
+        <!-- Default values for inner oozie settings -->
+        <property>
+            <name>oozie_launcher_queue_name</name>
+            <value>${queue_name}</value>
+        </property>
+        <property>
+            <name>oozie_launcher_memory</name>
+            <value>2048</value>
+        </property>
+        <!-- Required properties -->
+        <property><name>name_node</name></property>
+        <property><name>job_tracker</name></property>
+        <property><name>queue_name</name></property>
+        <!-- Aggregation related configuration properties-->
+        <property>
+            <name>hive_interlanguage_aggregation</name>
+            <!-- This is relative to the containing directory of this file. -->
+            <value>interlanguage_navigation.hql</value>
+            <description>Hive script to run.</description>
+        </property>
+        <property><name>artifacts_directory</name></property>
+        <property><name>refinery_jar_version</name></property>
+        <property>
+            <name>hive_site_xml</name>
+            <description>hive-site.xml file path in HDFS</description>
+        </property>
+
+        <property>
+            <name>interlanguage_navigation_dataset_directory</name>
+            <description>Directory to generate the done flag in</description>
+        </property>
+
+        <property>
+            <name>webrequest_table</name>
+            <description>Hive table to refine</description>
+        </property>
+        <property>
+            <name>interlanguage_navigation_table</name>
+            <description>The destinaton table to store aggregated data 
in.</description>
+        </property>
+
+        <property>
+            <name>year</name>
+            <description>The partition's year</description>
+        </property>
+        <property>
+            <name>month</name>
+            <description>The partition's month, left padded</description>
+        </property>
+        <property>
+            <name>day</name>
+            <description>The partition's day, left padded</description>
+        </property>
+
+        <property>
+            <name>mark_directory_done_workflow_file</name>
+            <description>Workflow for marking a directory done</description>
+        </property>
+        <property>
+            <name>send_error_email_workflow_file</name>
+            <description>Workflow for sending an email</description>
+        </property>
+
+    </parameters>
+
+    <start to="aggregate"/>
+
+    <action name="aggregate">
+        <hive xmlns="uri:oozie:hive-action:0.2">
+            <job-tracker>${job_tracker}</job-tracker>
+            <name-node>${name_node}</name-node>
+            <job-xml>${hive_site_xml}</job-xml>
+            <configuration>
+                <property>
+                    <name>mapreduce.job.queuename</name>
+                    <value>${queue_name}</value>
+                </property>
+                <!--make sure oozie:launcher runs in a low priority queue -->
+                <property>
+                    <name>oozie.launcher.mapred.job.queue.name</name>
+                    <value>${oozie_launcher_queue_name}</value>
+                </property>
+                <property>
+                    <name>oozie.launcher.mapreduce.map.memory.mb</name>
+                    <value>${oozie_launcher_memory}</value>
+                </property>
+                <property>
+                    <name>hive.exec.scratchdir</name>
+                    <value>/tmp/hive-${user}</value>
+                </property>
+            </configuration>
+            <script>${hive_interlanguage_aggregation}</script>
+            <param>artifacts_directory=${artifacts_directory}</param>
+            <param>refinery_jar_version=${refinery_jar_version}</param>
+            <param>source_table=${webrequest_table}</param>
+            <param>destination_table=${interlanguage_navigation_table}</param>
+            <param>year=${year}</param>
+            <param>month=${month}</param>
+            <param>day=${day}</param>
+        </hive>
+
+        <ok to="mark_aggregated_interlanguage_navigation_dataset_done"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="mark_aggregated_interlanguage_navigation_dataset_done">
+        <sub-workflow>
+            <app-path>${mark_directory_done_workflow_file}</app-path>
+            <configuration>
+                <property>
+                    <name>directory</name>
+                    
<value>${interlanguage_navigation_dataset_directory}</value>
+                </property>
+            </configuration>
+        </sub-workflow>
+        <ok to="end"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="send_error_email">
+        <sub-workflow>
+            <app-path>${send_error_email_workflow_file}</app-path>
+            <!-- email address is passed through this propagation -->
+            <propagate-configuration/>
+            <configuration>
+                <property>
+                    <name>parent_name</name>
+                    <value>${wf:name()}</value>
+                </property>
+                <property>
+                    <name>parent_failed_action</name>
+                    <value>${wf:lastErrorNode()}</value>
+                </property>
+                <property>
+                    <name>parent_error_code</name>
+                    <value>${wf:errorCode(wf:lastErrorNode())}</value>
+                </property>
+                <property>
+                    <name>parent_error_message</name>
+                    <value>${wf:errorMessage(wf:lastErrorNode())}</value>
+                </property>
+            </configuration>
+        </sub-workflow>
+        <ok to="kill"/>
+        <error to="kill"/>
+    </action>
+
+    <kill name="kill">
+        <message>Action failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
+    </kill>
+    <end name="end"/>
+</workflow-app>
diff --git a/oozie/interlanguage/datasets.xml b/oozie/interlanguage/datasets.xml
new file mode 100644
index 0000000..18875b5
--- /dev/null
+++ b/oozie/interlanguage/datasets.xml
@@ -0,0 +1,35 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Defines reusable datasets for aggregated interlanguage navigation data.
+Use this dataset in your coordinator.xml files by setting:
+
+    ${start_time}                   - the initial instance of your data.
+                                      Example: 2014-04-01T00:00Z
+    ${interlanguage_data_directory} - Path to the interlanguage data directory.
+                                      Example: /wmf/data/wmf/interlanguage
+-->
+
+<datasets>
+
+    <!--
+    The interlanguage_navigation_daily dataset contains aggregated pageviews
+    that are referred from another wiki within the same project family but
+    in a different language.  This allows the tracking of inter-language user
+    navigation.
+
+    Note that we do not use “${...}” but “${"$"}{...}", as dataset files are
+    passed to EL twice in cascade, and in the first EL level, ${MONTH}
+    evaluates to the string “${MONTH}”. Hence, we escape the dollar sign in
+    “${....}" to “${"$"}{...}”. At the first EL level, “${"$"}” gets turned
+    into a dollar sign, and “{...}”  is just passed along. Hence, we arrive
+    at “${...}” as input for the second EL level. There, the variables hold
+    their expected values, and we can start unpadding them.
+    -->
+    <dataset name="interlanguage_navigation_daily"
+             frequency="${coord:days(1)}"
+             initial-instance="${start_time}"
+             timezone="Universal">
+        
<uri-template>${interlanguage_data_directory}/navigation/daily/date=${YEAR}-${"$"}{MONTH
 + 0}-${"$"}{DAY + 0}</uri-template>
+        <done-flag>_SUCCESS</done-flag>
+    </dataset>
+</datasets>

-- 
To view, visit https://gerrit.wikimedia.org/r/365517
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I35bc02e046931735f89c0e7bcda676da172a81aa
Gerrit-PatchSet: 7
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: Amire80 <[email protected]>
Gerrit-Reviewer: Amire80 <[email protected]>
Gerrit-Reviewer: Joal <[email protected]>
Gerrit-Reviewer: Milimetric <[email protected]>
Gerrit-Reviewer: Nuria <[email protected]>
Gerrit-Reviewer: Ottomata <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to