Joal has submitted this change and it was merged.

Change subject: Include webrequest refine oozie job into load one
......................................................................


Include webrequest refine oozie job into load one

Load and refine oozie jobs are currently run in sequence,
using two bundles for the exact same type of data.
We have not experienced real advantages of having them
decoupled, so we merge them into a single job, reusing
the load one.

Bug: T130731
Change-Id: Id8c8cfdca786a87a8757881d076ada679ddcf069
---
M hive/webrequest/create_webrequest_raw_table.hql
M oozie/webrequest/load/README.md
M oozie/webrequest/load/bundle.properties
M oozie/webrequest/load/bundle.xml
M oozie/webrequest/load/coordinator.xml
R oozie/webrequest/load/refine_webrequest.hql
M oozie/webrequest/load/workflow.xml
D oozie/webrequest/refine/README.md
D oozie/webrequest/refine/bundle.properties
D oozie/webrequest/refine/bundle.xml
D oozie/webrequest/refine/coordinator.xml
D oozie/webrequest/refine/workflow.xml
12 files changed, 169 insertions(+), 505 deletions(-)

Approvals:
  Ottomata: Looks good to me, but someone else must approve
  Joal: Verified; Looks good to me, approved



diff --git a/hive/webrequest/create_webrequest_raw_table.hql 
b/hive/webrequest/create_webrequest_raw_table.hql
index 49cdddc..09908b8 100644
--- a/hive/webrequest/create_webrequest_raw_table.hql
+++ b/hive/webrequest/create_webrequest_raw_table.hql
@@ -16,6 +16,8 @@
 --         --database wmf_raw
 --
 
+ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
+
 CREATE EXTERNAL TABLE IF NOT EXISTS `webrequest` (
     `hostname`          string  COMMENT 'Source node hostname',
     `sequence`          bigint  COMMENT 'Per host sequence number',
diff --git a/oozie/webrequest/load/README.md b/oozie/webrequest/load/README.md
index 37e0a78..45d70a7 100644
--- a/oozie/webrequest/load/README.md
+++ b/oozie/webrequest/load/README.md
@@ -1,4 +1,4 @@
-# Basic verification for webrequest logs
+# Verify webrequest logs and refine them
 
 The basic verification analyzes each log line's sequence number and
 computes per host statistics. It detects holes, duplicates, and nulls
@@ -9,10 +9,13 @@
 duplicates, holes, or nulls, the directory gets a ```_SUCCESS```
 marker.
 
+Then data get refined, meaning converted from raw JSON
+logs imported from Kafka into a clustered-bucketed table
+stored in Parquet format with newly computed fields.
+
 # Outline
 
-* ```bundle.properties``` can be used to inject the whole verification
-  pipeline into oozie.
+* ```bundle.properties``` can be used to inject the whole pipeline into oozie.
 * ```bundle.xml``` injects separate coordinators for each of the
   webrequest_sources.
 * ```coordinator.xml``` injects a workflow for each dataset.
@@ -23,10 +26,11 @@
     (so the statistics are easily queryable and need not be recomputed
     when drilling in)
   * and puts per dataset information into separate files,
-  * analyzes those files to determine whether or not the dataset is
-    ok, and
-  * finally writes the ```_SUCCESS``` marker to the dataset, if it is
-    ok.
+  * analyzes those files to determine whether or not the raw dataset is
+    ok
+  * writes the ```_SUCCESS``` marker to the raw dataset, if it is ok.
+  * In that case, compute the new refined partition, and
+  * writes the ```_SUCCESS``` marker to the refined partition, if it is ok.
 
 Note that we add the partition to the table before verification, and
 do not drop the partition if there is an error. Hence, the table might
@@ -35,4 +39,7 @@
 table is not meant for researchers.
 
 Icinga monitoring for the ```_SUCCESS``` marker is not part of this
-setup and can be found at {{Citation needed}}.
\ No newline at end of file
+setup and can be found at {{Citation needed}}.
+
+Please update the record_version if you change the refined table content
+definition and/or schema.
\ No newline at end of file
diff --git a/oozie/webrequest/load/bundle.properties 
b/oozie/webrequest/load/bundle.properties
index 5846e1e..8eb354e 100644
--- a/oozie/webrequest/load/bundle.properties
+++ b/oozie/webrequest/load/bundle.properties
@@ -32,9 +32,6 @@
 # HDFS path to workflow to run.
 workflow_file                     = 
${oozie_directory}/webrequest/load/workflow.xml
 
-# HDFS path to webrequest dataset definition
-datasets_raw_file                 = 
${oozie_directory}/webrequest/datasets_raw.xml
-
 # Initial import time of the webrequest dataset.
 start_time                        = 2014-04-01T00:00Z
 
@@ -53,8 +50,17 @@
 # HDFS path to hive-site.xml file.  This is needed to run hive actions.
 hive_site_xml                     = ${oozie_directory}/util/hive/hive-site.xml
 
-# Fully qualified Hive table name.
-table                             = wmf_raw.webrequest
+# Fully qualified raw webrequests hive table name.
+webrequest_raw_table              = wmf_raw.webrequest
+
+# Fully qualified webrequest hive table name.
+webrequest_table                  = wmf.webrequest
+
+# Version of Hive UDF jar to import
+refinery_jar_version              = 0.0.30
+
+# Record version to keep track of changes
+record_version                    = 0.0.15
 
 # Hive table name.
 statistics_table                  = wmf_raw.webrequest_sequence_stats
@@ -69,9 +75,18 @@
 error_data_loss_threshold         = 5
 warning_data_loss_threshold       = 1
 
+# HDFS path to webrequest dataset definition
+webrequest_raw_datasets_file      = 
${oozie_directory}/webrequest/datasets_raw.xml
+
 # HDFS path to directory where webrequest raw data is time bucketed.
 webrequest_raw_data_directory     = ${name_node}/wmf/data/raw/webrequest
 
+# HDFS path to webrequest dataset definition
+webrequest_datasets_file          = ${oozie_directory}/webrequest/datasets.xml
+
+# HDFS path to directory where webrequest data is time bucketed.
+webrequest_data_directory         = ${name_node}/wmf/data/wmf/webrequest
+
 # The email address where to send SLA alerts
 sla_alert_contact                 = [email protected]
 
diff --git a/oozie/webrequest/load/bundle.xml b/oozie/webrequest/load/bundle.xml
index 3e2af23..6f43c48 100644
--- a/oozie/webrequest/load/bundle.xml
+++ b/oozie/webrequest/load/bundle.xml
@@ -12,14 +12,20 @@
         <property><name>workflow_file</name></property>
         <property><name>start_time</name></property>
         <property><name>stop_time</name></property>
-        <property><name>datasets_raw_file</name></property>
+        <property><name>webrequest_raw_datasets_file</name></property>
         <property><name>webrequest_raw_data_directory</name></property>
+        <property><name>webrequest_datasets_file</name></property>
+        <property><name>webrequest_data_directory</name></property>
 
         <property><name>hive_site_xml</name></property>
         <property><name>add_partition_workflow_file</name></property>
-        <property><name>table</name></property>
+        <property><name>refinery_jar_version</name></property>
+        <property><name>artifacts_directory</name></property>
+        <property><name>webrequest_raw_table</name></property>
+        <property><name>webrequest_table</name></property>
         <property><name>statistics_table</name></property>
         <property><name>statistics_hourly_table</name></property>
+        <property><name>record_version</name></property>
         <property><name>data_loss_check_directory_base</name></property>
         <property><name>error_data_loss_threshold</name></property>
         <property><name>warning_data_loss_threshold</name></property>
diff --git a/oozie/webrequest/load/coordinator.xml 
b/oozie/webrequest/load/coordinator.xml
index 1ebfb1d..b8a6ab8 100644
--- a/oozie/webrequest/load/coordinator.xml
+++ b/oozie/webrequest/load/coordinator.xml
@@ -16,15 +16,21 @@
         <property><name>workflow_file</name></property>
         <property><name>start_time</name></property>
         <property><name>stop_time</name></property>
-        <property><name>datasets_raw_file</name></property>
+        <property><name>webrequest_raw_datasets_file</name></property>
         <property><name>webrequest_raw_data_directory</name></property>
+        <property><name>webrequest_datasets_file</name></property>
+        <property><name>webrequest_data_directory</name></property>
 
         <property><name>hive_site_xml</name></property>
         <property><name>add_partition_workflow_file</name></property>
-        <property><name>table</name></property>
+        <property><name>refinery_jar_version</name></property>
+        <property><name>artifacts_directory</name></property>
+        <property><name>webrequest_raw_table</name></property>
+        <property><name>webrequest_table</name></property>
         <property><name>statistics_table</name></property>
         <property><name>statistics_hourly_table</name></property>
         <property><name>webrequest_source</name></property>
+        <property><name>record_version</name></property>
         <property><name>data_loss_check_directory_base</name></property>
         <property><name>error_data_loss_threshold</name></property>
         <property><name>warning_data_loss_threshold</name></property>
@@ -76,17 +82,29 @@
 
     <datasets>
         <!--
-        Include the given $datasets_raw_file file.  This should
-        define the "webrequest_*_raw" datasets for this coordinator.
+        Include the given datasets files.  This should
+        define the "webrequest_*_raw" and "webrequest_*" datasets
         -->
-        <include>${datasets_raw_file}</include>
+        <include>${webrequest_raw_datasets_file}</include>
+        <include>${webrequest_datasets_file}</include>
     </datasets>
 
     <input-events>
-        <data-in name="input" 
dataset="webrequest_${webrequest_source}_raw_unchecked">
+        <data-in name="raw_unchecked_input" 
dataset="webrequest_${webrequest_source}_raw_unchecked">
             <instance>${coord:current(0)}</instance>
         </data-in>
     </input-events>
+
+    <output-events>
+        <data-out name="raw_output" 
dataset="webrequest_${webrequest_source}_raw">
+            <instance>${coord:current(0)}</instance>
+        </data-out>
+
+        <data-out name="refined_output" 
dataset="webrequest_${webrequest_source}">
+            <instance>${coord:current(0)}</instance>
+        </data-out>
+    </output-events>
+
 
     <action>
         <workflow>
@@ -112,8 +130,12 @@
                     <value>${coord:formatTime(coord:nominalTime(), 
"H")}</value>
                 </property>
                 <property>
-                    <name>location</name>
-                    <value>${coord:dataIn('input')}</value>
+                    <name>webrequest_raw_location</name>
+                    <value>${coord:dataOut('raw_output')}</value>
+                </property>
+                <property>
+                    <name>webrequest_location</name>
+                    <value>${coord:dataOut('refined_output')}</value>
                 </property>
             </configuration>
         </workflow>
@@ -123,7 +145,7 @@
                 to compute timeout
             -->
             <sla:nominal-time>${coord:actualTime()}</sla:nominal-time>
-            <sla:should-end>${3 * HOURS}</sla:should-end>
+            <sla:should-end>${4 * HOURS}</sla:should-end>
             <sla:alert-events>end_miss</sla:alert-events>
             <sla:alert-contact>${sla_alert_contact}</sla:alert-contact>
         </sla:info>
diff --git a/oozie/webrequest/refine/refine_webrequest.hql 
b/oozie/webrequest/load/refine_webrequest.hql
similarity index 100%
rename from oozie/webrequest/refine/refine_webrequest.hql
rename to oozie/webrequest/load/refine_webrequest.hql
diff --git a/oozie/webrequest/load/workflow.xml 
b/oozie/webrequest/load/workflow.xml
index 9ceb290..d1a5f82 100644
--- a/oozie/webrequest/load/workflow.xml
+++ b/oozie/webrequest/load/workflow.xml
@@ -28,8 +28,20 @@
             <description>hive-site.xml file path in HDFS</description>
         </property>
         <property>
-            <name>table</name>
-            <description>Hive table to partition.</description>
+            <name>refinery_jar_version</name>
+            <description>Version of the refinery-hive jar file to import for 
UDFs</description>
+        </property>
+        <property>
+            <name>artifacts_directory</name>
+            <description>Path in HDFS to artifacts. refinery-hive.jar should 
be here.</description>
+        </property>
+        <property>
+            <name>webrequest_raw_table</name>
+            <description>Raw webrequests hive table</description>
+        </property>
+        <property>
+            <name>webrequest_table</name>
+            <description>Webrequests hive table</description>
         </property>
         <property>
             <name>webrequest_source</name>
@@ -52,8 +64,16 @@
             <description>The partition's hour</description>
         </property>
         <property>
-            <name>location</name>
-            <description>HDFS path(s) naming the input dataset.</description>
+            <name>record_version</name>
+            <description>The record_version at the given moment</description>
+        </property>
+        <property>
+            <name>webrequest_raw_location</name>
+            <description>HDFS path(s) naming the raw webrequest 
dataset</description>
+        </property>
+        <property>
+            <name>webrequest_location</name>
+            <description>HDFS path(s) naming the webrequest 
dataset</description>
         </property>
         <property>
             <name>statistics_table</name>
@@ -104,6 +124,14 @@
             <propagate-configuration/>
             <configuration>
                 <property>
+                    <name>table</name>
+                    <value>${webrequest_raw_table}</value>
+                </property>
+                <property>
+                    <name>location</name>
+                    <value>${webrequest_raw_location}</value>
+                </property>
+                <property>
                     <name>partition_spec</name>
                     
<value>webrequest_source='${webrequest_source}',year=${year},month=${month},day=${day},hour=${hour}</value>
                 </property>
@@ -124,7 +152,7 @@
             <configuration>
                 <property>
                     <name>directory</name>
-                    <value>${location}</value>
+                    <value>${webrequest_raw_location}</value>
                 </property>
                 <property>
                     <name>done_file</name>
@@ -159,7 +187,7 @@
 
             <script>generate_sequence_statistics.hql</script>
 
-            <param>source_table=${table}</param>
+            <param>source_table=${webrequest_raw_table}</param>
             <param>destination_table=${statistics_table}</param>
             <param>year=${year}</param>
             <param>month=${month}</param>
@@ -229,17 +257,73 @@
                 </property>
             </configuration>
         </sub-workflow>
-        <ok to="mark_dataset_done"/>
+        <ok to="mark_raw_dataset_done"/>
         <error to="kill"/>
     </action>
 
-    <action name="mark_dataset_done">
+    <action name="mark_raw_dataset_done">
         <sub-workflow>
             <app-path>${mark_directory_done_workflow_file}</app-path>
             <configuration>
                 <property>
                     <name>directory</name>
-                    <value>${location}</value>
+                    <value>${webrequest_raw_location}</value>
+                </property>
+            </configuration>
+        </sub-workflow>
+        <ok to="refine"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="refine">
+        <hive xmlns="uri:oozie:hive-action:0.2">
+            <job-tracker>${job_tracker}</job-tracker>
+            <name-node>${name_node}</name-node>
+            <job-xml>${hive_site_xml}</job-xml>
+            <configuration>
+                <property>
+                    <name>mapreduce.job.queuename</name>
+                    <value>${queue_name}</value>
+                </property>
+                <!--make sure oozie:launcher runs in a low priority queue -->
+                <property>
+                    <name>oozie.launcher.mapred.job.queue.name</name>
+                    <value>${oozie_launcher_queue_name}</value>
+                </property>
+                <property>
+                    <name>oozie.launcher.mapreduce.map.memory.mb</name>
+                    <value>${oozie_launcher_memory}</value>
+                </property>
+                <property>
+                    <name>hive.exec.scratchdir</name>
+                    <value>/tmp/hive-webrequest-refine</value>
+                </property>
+            </configuration>
+
+            <script>refine_webrequest.hql</script>
+            <param>refinery_jar_version=${refinery_jar_version}</param>
+            <param>artifacts_directory=${artifacts_directory}</param>
+            <param>source_table=${webrequest_raw_table}</param>
+            <param>destination_table=${webrequest_table}</param>
+            <param>webrequest_source=${webrequest_source}</param>
+            <param>record_version=${record_version}</param>
+            <param>year=${year}</param>
+            <param>month=${month}</param>
+            <param>day=${day}</param>
+            <param>hour=${hour}</param>
+        </hive>
+
+        <ok to="mark_refined_dataset_done"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="mark_refined_dataset_done">
+        <sub-workflow>
+            <app-path>${mark_directory_done_workflow_file}</app-path>
+            <configuration>
+                <property>
+                    <name>directory</name>
+                    <value>${webrequest_location}</value>
                 </property>
             </configuration>
         </sub-workflow>
diff --git a/oozie/webrequest/refine/README.md 
b/oozie/webrequest/refine/README.md
deleted file mode 100644
index 93c6a4a..0000000
--- a/oozie/webrequest/refine/README.md
+++ /dev/null
@@ -1,24 +0,0 @@
-# Refine phase for webrequest logs
-
-This job is responsible for the refine (ETL?) phase of
-webrequest logs.  It currently converts the raw JSON
-logs imported from Kafka into a clustered-bucketed table
-stored in Parquet format.
-
-# Outline
-
-* ```bundle.properties``` can be used to inject the whole refine
-  pipeline into oozie.
-* ```bundle.xml``` injects separate coordinators for each of the
-  webrequest_sources.
-* ```coordinator.xml``` injects a workflow for each dataset.
-* ```workflow.xml```
-  * Runs a hive query to convert from JSON into the refined data.
-
-Note that this job uses the checked dataset.  If a raw webrequest import
-does not have the _SUCCESS done-flag in the directory, the data for that
-hour will not be refined until it does.
-
-Please update the record_version if you change the table content definition
-and/or schema.
-_
diff --git a/oozie/webrequest/refine/bundle.properties 
b/oozie/webrequest/refine/bundle.properties
deleted file mode 100644
index 4c582c1..0000000
--- a/oozie/webrequest/refine/bundle.properties
+++ /dev/null
@@ -1,73 +0,0 @@
-# Configures a bundle to manage automatically refining partitions of a Hive
-# webrequest table.  Any of the following properties are overidable with -D.
-# Usage:
-#   oozie job -Duser=$USER -Dstart_time=2015-01-05T00:00Z -submit -config 
oozie/webrequest/refine/bundle.properties
-#
-# NOTE:  The $oozie_directory must be synced to HDFS so that all relevant
-#        .xml files exist there when this job is submitted.
-
-
-name_node                         = hdfs://analytics-hadoop
-job_tracker                       = resourcemanager.analytics.eqiad.wmnet:8032
-queue_name                        = default
-
-user                              = hdfs
-
-# Base path in HDFS to refinery.
-# When submitting this job for production, you should
-# override this to point directly at a deployed
-# directory name, and not the 'symbolic' 'current' directory.
-# E.g.  /wmf/refinery/2015-01-05T17.59.18Z--7bb7f07
-refinery_directory                = ${name_node}/wmf/refinery/current
-
-# HDFS path to artifacts that will be used by this job.
-# E.g. refinery-hive.jar should exist here.
-artifacts_directory               = ${refinery_directory}/artifacts
-
-# Base path in HDFS to oozie files.
-# Other files will be used relative to this path.
-oozie_directory                   = ${refinery_directory}/oozie
-
-# HDFS path to coordinator to run for each webrequest_source.
-coordinator_file                  = 
${oozie_directory}/webrequest/refine/coordinator.xml
-
-# HDFS path to workflow to run.
-workflow_file                     = 
${oozie_directory}/webrequest/refine/workflow.xml
-
-# HDFS path to webrequest dataset definitions
-datasets_raw_file                 = 
${oozie_directory}/webrequest/datasets_raw.xml
-datasets_file                     = ${oozie_directory}/webrequest/datasets.xml
-
-# Initial import time of the webrequest dataset.
-start_time                        = 2015-01-01T00:00Z
-
-# Time to stop running this coordinator.  Year 3000 == never!
-stop_time                         = 3000-01-01T00:00Z
-
-# HDFS path to workflow to mark a directory as done
-mark_directory_done_workflow_file = 
${oozie_directory}/util/mark_directory_done/workflow.xml
-
-# Workflow to send an error email
-send_error_email_workflow_file    = 
${oozie_directory}/util/send_error_email/workflow.xml
-
-# HDFS path to hive-site.xml file.  This is needed to run hive actions.
-hive_site_xml                     = ${oozie_directory}/util/hive/hive-site.xml
-
-# Version of Hive UDF jar to import
-refinery_jar_version              = 0.0.30
-
-# Fully qualified Hive table name.
-source_table                      = wmf_raw.webrequest
-destination_table                 = wmf.webrequest
-
-# Record version to keep track of changes
-record_version                    = 0.0.15
-
-# HDFS path to directory where webrequest data is time bucketed.
-webrequest_raw_data_directory     = ${name_node}/wmf/data/raw/webrequest
-webrequest_data_directory         = ${name_node}/wmf/data/wmf/webrequest
-
-# Coordintator to start.
-oozie.bundle.application.path     = 
${oozie_directory}/webrequest/refine/bundle.xml
-oozie.use.system.libpath          = true
-oozie.action.external.stats.write = true
diff --git a/oozie/webrequest/refine/bundle.xml 
b/oozie/webrequest/refine/bundle.xml
deleted file mode 100644
index 1e1cb14..0000000
--- a/oozie/webrequest/refine/bundle.xml
+++ /dev/null
@@ -1,71 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<bundle-app xmlns="uri:oozie:bundle:0.2"
-    name="webrequest-refine-bundle">
-
-    <parameters>
-
-        <!-- Required properties -->
-        <property><name>queue_name</name></property>
-        <property><name>coordinator_file</name></property>
-        <property><name>name_node</name></property>
-        <property><name>job_tracker</name></property>
-        <property><name>workflow_file</name></property>
-        <property><name>mark_directory_done_workflow_file</name></property>
-        <property><name>send_error_email_workflow_file</name></property>
-
-        <property><name>start_time</name></property>
-        <property><name>stop_time</name></property>
-
-        <property><name>webrequest_raw_data_directory</name></property>
-        <property><name>datasets_raw_file</name></property>
-        <property><name>webrequest_data_directory</name></property>
-        <property><name>datasets_file</name></property>
-
-        <property><name>hive_site_xml</name></property>
-        <property><name>refinery_jar_version</name></property>
-        <property><name>artifacts_directory</name></property>
-        <property><name>source_table</name></property>
-        <property><name>destination_table</name></property>
-        <property><name>record_version</name></property>
-    </parameters>
-
-    <coordinator name="webrequest-refine-coord-maps">
-        <app-path>${coordinator_file}</app-path>
-        <configuration>
-            <property>
-                <name>webrequest_source</name>
-                <value>maps</value>
-            </property>
-        </configuration>
-    </coordinator>
-
-    <coordinator name="webrequest-refine-coord-misc">
-        <app-path>${coordinator_file}</app-path>
-        <configuration>
-            <property>
-                <name>webrequest_source</name>
-                <value>misc</value>
-            </property>
-        </configuration>
-    </coordinator>
-
-    <coordinator name="webrequest-refine-coord-text">
-        <app-path>${coordinator_file}</app-path>
-        <configuration>
-            <property>
-                <name>webrequest_source</name>
-                <value>text</value>
-            </property>
-        </configuration>
-    </coordinator>
-
-    <coordinator name="webrequest-refine-coord-upload">
-        <app-path>${coordinator_file}</app-path>
-        <configuration>
-            <property>
-                <name>webrequest_source</name>
-                <value>upload</value>
-            </property>
-        </configuration>
-    </coordinator>
-</bundle-app>
diff --git a/oozie/webrequest/refine/coordinator.xml 
b/oozie/webrequest/refine/coordinator.xml
deleted file mode 100644
index 12c307f..0000000
--- a/oozie/webrequest/refine/coordinator.xml
+++ /dev/null
@@ -1,128 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<coordinator-app xmlns="uri:oozie:coordinator:0.4"
-    name="webrequest-refine-coord-${webrequest_source}"
-    frequency="${coord:hours(1)}"
-    start="${start_time}"
-    end="${stop_time}"
-    timezone="Universal">
-
-    <parameters>
-
-        <!-- Required properties -->
-        <property><name>queue_name</name></property>
-        <property><name>name_node</name></property>
-        <property><name>job_tracker</name></property>
-        <property><name>workflow_file</name></property>
-        <property><name>mark_directory_done_workflow_file</name></property>
-        <property><name>send_error_email_workflow_file</name></property>
-
-        <property><name>start_time</name></property>
-        <property><name>stop_time</name></property>
-
-        <property><name>webrequest_raw_data_directory</name></property>
-        <property><name>datasets_raw_file</name></property>
-        <property><name>webrequest_data_directory</name></property>
-        <property><name>datasets_file</name></property>
-
-        <property><name>hive_site_xml</name></property>
-        <property><name>refinery_jar_version</name></property>
-        <property><name>artifacts_directory</name></property>
-        <property><name>source_table</name></property>
-        <property><name>destination_table</name></property>
-        <property><name>webrequest_source</name></property>
-        <property><name>record_version</name></property>
-    </parameters>
-
-    <controls>
-        <!--
-        By having materialized jobs not timeout, we ease backfilling incidents
-        after recoverable hiccups on the dataset producers.
-        -->
-        <timeout>-1</timeout>
-
-        <!--
-        Refining is not too cheap, so we limit
-        concurrency.
-
-        Note, that this is per coordinator. So if we run this
-        coordinator for say 4 webrequest_sources (see bundle.xml :-)),
-        we effectively compute sequence statistics for up to 8
-        datasets in parallel.
-
-        Also note, that back-filling is not limited by the
-        coordinator's frequency, so back-filling works nicely
-        even-though the concurrency is low.
-        -->
-        <concurrency>2</concurrency>
-
-
-        <!--
-        Since we expect only one incarnation per hourly dataset, the
-        default throttle of 12 is way to high, and there is not need
-        to keep that many materialized jobs around.
-
-        By resorting to 2, we keep the hdfs checks on the datasets
-        low, while still being able to easily feed the concurrency.
-        -->
-        <throttle>2</throttle>
-    </controls>
-
-    <datasets>
-        <!--
-        Include both raw and refined datasets files.
-        $datasets_raw_file will be used as the input events,
-        and $datasets_file will be used to determine output
-        location in which to add a done-flag.
-        -->
-        <include>${datasets_raw_file}</include>
-        <include>${datasets_file}</include>
-    </datasets>
-
-    <input-events>
-        <!--
-        For now, since we definitly want the refined data to exist, even if
-        there is some faulty data (e.g. missing or duplicate), we rely on the
-        *_partioned datasets.  This means that data in the refined webrequest
-        table may be missing some data.  Be warned!  Check the
-        wmf_raw.webrequest_sequence_stats table if you are unsure of the
-        quality of this data.
-        -->
-        <data-in name="raw_input" 
dataset="webrequest_${webrequest_source}_raw">
-            <instance>${coord:current(0)}</instance>
-        </data-in>
-    </input-events>
-
-    <output-events>
-        <data-out name="refined_output" 
dataset="webrequest_${webrequest_source}">
-            <instance>${coord:current(0)}</instance>
-        </data-out>
-    </output-events>
-
-    <action>
-        <workflow>
-            <app-path>${workflow_file}</app-path>
-            <configuration>
-                <property>
-                    <name>year</name>
-                    <value>${coord:formatTime(coord:nominalTime(), 
"y")}</value>
-                </property>
-                <property>
-                    <name>month</name>
-                    <value>${coord:formatTime(coord:nominalTime(), 
"M")}</value>
-                </property>
-                <property>
-                    <name>day</name>
-                    <value>${coord:formatTime(coord:nominalTime(), 
"d")}</value>
-                </property>
-                <property>
-                    <name>hour</name>
-                    <value>${coord:formatTime(coord:nominalTime(), 
"H")}</value>
-                </property>
-                <property>
-                    <name>destination_dataset_directory</name>
-                    <value>${coord:dataOut('refined_output')}</value>
-                </property>
-            </configuration>
-        </workflow>
-    </action>
-</coordinator-app>
diff --git a/oozie/webrequest/refine/workflow.xml 
b/oozie/webrequest/refine/workflow.xml
deleted file mode 100644
index 7a9dcad..0000000
--- a/oozie/webrequest/refine/workflow.xml
+++ /dev/null
@@ -1,176 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<workflow-app xmlns="uri:oozie:workflow:0.4"
-    
name="webrequest-refine-wf-${webrequest_source}-${year}-${month}-${day}-${hour}">
-
-    <parameters>
-
-        <!-- Default values for inner oozie settings -->
-        <property>
-            <name>oozie_launcher_queue_name</name>
-            <value>${queue_name}</value>
-        </property>
-        <property>
-            <name>oozie_launcher_memory</name>
-            <value>256</value>
-        </property>
-
-        <!-- Required properties -->
-        <property><name>queue_name</name></property>
-        <property><name>name_node</name></property>
-        <property><name>job_tracker</name></property>
-
-        <property>
-            <name>hive_script</name>
-            <!-- This is relative to the containing directory of this file. -->
-            <value>refine_webrequest.hql</value>
-            <description>Hive script to run.</description>
-        </property>
-
-        <property>
-            <name>hive_site_xml</name>
-            <description>hive-site.xml file path in HDFS</description>
-        </property>
-        <property>
-            <name>refinery_jar_version</name>
-            <description>Version of the refinery-hive jar file to import for 
UDFs</description>
-        </property>
-        <property>
-            <name>artifacts_directory</name>
-            <description>Path in HDFS to artifacts.  refinery-hive.jar should 
be here.</description>
-        </property>
-        <property>
-            <name>source_table</name>
-            <description>Hive table to refine</description>
-        </property>
-        <property>
-            <name>destination_table</name>
-            <description>The destinaton table to store refined data 
in.</description>
-        </property>
-        <property>
-            <name>webrequest_source</name>
-            <description>The partition's webrequest_source</description>
-        </property>
-        <property>
-            <name>record_version</name>
-            <description>The record_version at the given moment</description>
-        </property>
-        <property>
-            <name>year</name>
-            <description>The partition's year</description>
-        </property>
-        <property>
-            <name>month</name>
-            <description>The partition's month</description>
-        </property>
-        <property>
-            <name>day</name>
-            <description>The partition's day</description>
-        </property>
-        <property>
-            <name>hour</name>
-            <description>The partition's hour</description>
-        </property>
-        <property>
-            <name>mark_directory_done_workflow_file</name>
-            <description>Workflow for marking a directory done</description>
-        </property>
-        <property>
-            <name>send_error_email_workflow_file</name>
-            <description>Workflow for sending an error email</description>
-        </property>
-        <property>
-            <name>destination_dataset_directory</name>
-            <description>Directory to generate the done flag in</description>
-        </property>
-    </parameters>
-
-    <start to="refine"/>
-
-    <action name="refine">
-        <hive xmlns="uri:oozie:hive-action:0.2">
-            <job-tracker>${job_tracker}</job-tracker>
-            <name-node>${name_node}</name-node>
-            <job-xml>${hive_site_xml}</job-xml>
-            <configuration>
-                <property>
-                    <name>mapreduce.job.queuename</name>
-                    <value>${queue_name}</value>
-                </property>
-                <!--make sure oozie:launcher runs in a low priority queue -->
-                <property>
-                    <name>oozie.launcher.mapred.job.queue.name</name>
-                    <value>${oozie_launcher_queue_name}</value>
-                </property>
-                <property>
-                    <name>oozie.launcher.mapreduce.map.memory.mb</name>
-                    <value>${oozie_launcher_memory}</value>
-                </property>
-                <property>
-                    <name>hive.exec.scratchdir</name>
-                    <value>/tmp/hive-${user}</value>
-                </property>
-            </configuration>
-
-            <script>${hive_script}</script>
-            <param>refinery_jar_version=${refinery_jar_version}</param>
-            <param>artifacts_directory=${artifacts_directory}</param>
-            <param>source_table=${source_table}</param>
-            <param>destination_table=${destination_table}</param>
-            <param>webrequest_source=${webrequest_source}</param>
-            <param>record_version=${record_version}</param>
-            <param>year=${year}</param>
-            <param>month=${month}</param>
-            <param>day=${day}</param>
-            <param>hour=${hour}</param>
-        </hive>
-
-        <ok to="mark_dataset_done"/>
-        <error to="send_error_email"/>
-    </action>
-
-    <action name="mark_dataset_done">
-        <sub-workflow>
-            <app-path>${mark_directory_done_workflow_file}</app-path>
-            <configuration>
-                <property>
-                    <name>directory</name>
-                    <value>${destination_dataset_directory}</value>
-                </property>
-            </configuration>
-        </sub-workflow>
-        <ok to="end"/>
-        <error to="send_error_email"/>
-    </action>
-
-    <action name="send_error_email">
-        <sub-workflow>
-            <app-path>${send_error_email_workflow_file}</app-path>
-            <propagate-configuration/>
-            <configuration>
-                <property>
-                    <name>parent_name</name>
-                    <value>${wf:name()}</value>
-                </property>
-                <property>
-                    <name>parent_failed_action</name>
-                    <value>${wf:lastErrorNode()}</value>
-                </property>
-                <property>
-                    <name>parent_error_code</name>
-                    <value>${wf:errorCode(wf:lastErrorNode())}</value>
-                </property>
-                <property>
-                    <name>parent_error_message</name>
-                    <value>${wf:errorMessage(wf:lastErrorNode())}</value>
-                </property>
-            </configuration>
-        </sub-workflow>
-        <ok to="kill"/>
-        <error to="kill"/>
-    </action>
-
-    <kill name="kill">
-        <message>Action failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
-    </kill>
-    <end name="end"/>
-</workflow-app>

-- 
To view, visit https://gerrit.wikimedia.org/r/285998
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Id8c8cfdca786a87a8757881d076ada679ddcf069
Gerrit-PatchSet: 4
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: Joal <[email protected]>
Gerrit-Reviewer: Elukey <[email protected]>
Gerrit-Reviewer: Joal <[email protected]>
Gerrit-Reviewer: Nuria <[email protected]>
Gerrit-Reviewer: Ottomata <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to