Mforns has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/271253

Change subject: Improve the format of the browser report
......................................................................

Improve the format of the browser report

Makes the browser report output to an external table instead of
a set of files. Also splits the data into some new columns and
changes the percentage to an absolute value of view counts.

In addition to that, adds the long tail of the report (that was
previously cut off) to the data by collapsing it into a single
row where all dimension values are 'other'. Thus preserving
privacy and ensuring that the data doesn't get too big.

Bug: T126282
Change-Id: I19074c3f449a81c0eeeeb666964bbcfae0f05363
---
A hive/browser/create_browser_general_table.hql
M oozie/browser/general/README.md
M oozie/browser/general/browser_general.hql
M oozie/browser/general/coordinator.properties
M oozie/browser/general/coordinator.xml
M oozie/browser/general/workflow.xml
6 files changed, 217 insertions(+), 113 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/analytics/refinery 
refs/changes/53/271253/1

diff --git a/hive/browser/create_browser_general_table.hql 
b/hive/browser/create_browser_general_table.hql
new file mode 100644
index 0000000..4e39c2c
--- /dev/null
+++ b/hive/browser/create_browser_general_table.hql
@@ -0,0 +1,38 @@
+-- Create table statement for browser general data.
+--
+-- This is an intermediate table that serves as a base for various
+-- traffic reports, i.e.: mobile web browser breakdown, desktop os
+-- breakdown, or desktop+mobile web os+browser breakdown, etc.
+-- It has a weekly granularity, is partitioned by year, month and day,
+-- and is an external table stored as tsv files.
+--
+-- Note that the long tail of the table (meaning the rows that have
+-- a view_count relatively smaller than a given threshold) are collapsed
+-- into a single row with all dimension values equal to 'Other'
+-- (some columns may have a default value different from 'Other').
+-- This ensures the data in this table is not privacy sensitive,
+-- and that the size of the files keeps considerably small.
+--
+-- Usage
+--     hive -f create_browser_general_table.hql --database wmf
+--
+
+CREATE EXTERNAL TABLE IF NOT EXISTS `browser_general`(
+    `access_method`     string     COMMENT '(desktop|mobile web|mobile app)',
+    `os_family`         string     COMMENT 'OS family: Windows, Android, etc.',
+    `os_major`          string     COMMENT 'OS major version: 8, 10, etc.',
+    `browser_family`    string     COMMENT 'Browser family: Chrome, Safari, 
etc.',
+    `browser_major`     string     COMMENT 'Browser major version: 47, 11, 
etc.',
+    `view_count`        bigint     COMMENT 'Number of pageviews.'
+)
+PARTITIONED BY (
+    `year`              int     COMMENT 'Unpadded year of request',
+    `month`             int     COMMENT 'Unpadded month of request',
+    `day`               int     COMMENT 'Unpadded day of request'
+)
+ROW FORMAT DELIMITED
+FIELDS TERMINATED BY '\t'
+LINES TERMINATED BY '\n'
+STORED AS TEXTFILE
+LOCATION '/wmf/data/archive/browser/general'
+;
diff --git a/oozie/browser/general/README.md b/oozie/browser/general/README.md
index d2388e3..7f65450 100644
--- a/oozie/browser/general/README.md
+++ b/oozie/browser/general/README.md
@@ -1,25 +1,21 @@
 # Browser usage
 
-This job computes weekly browser usage reports from the table:
-pageview_hourly. It generates 2 TSV datasets: mobile report, and
-desktop+mobile report.
-
-Output is archived in the folders:
-```archive/browser/general/mobile_web-{year}-{month}-{day}``` and
-```archive/browser/general/desktop_and_mobile_web-{year}-{month}-{day}```
+This job computes weekly browser usage stats from pageview_hourly.
+The results serve as an intermediate table for various traffic reports,
+i.e.: mobile web browser breakdown, desktop os breakdown, or
+desktop+mobile web os+browser breakdown, etc. Output is stored in
+the table: wmf.browser_general
 
 # Outline
 
 * ```browser_general.hql``` is the hive query that collects the
-  data from the tables, aggregates it, and writes the reports in
-  the given destination directory. It is actually a template with
-  some dynamic parameters: input and output paths, time info and
-  a filter for access methods to include in the report.
+  data from the pageview_hourly, aggregates it, and writes the
+  results into the given destination table. It is actually a
+  template with some dynamic parameters passed in by the workflow.
 
-* ```workflow.xml``` declares the actions that oozie will take
-  when calling the hive query. In this case, it executes the
-  hive query twice, once for mobile and once for desktop+mobile.
-  It also sets up some oozie-specific configuration parameters.
+* ```workflow.xml``` sets up some oozie-specific configuration
+  parameters, and calls the hql query, passing in the necessary
+  parameters. Also defines what to do in calse of query failure.
 
 * ```coordinator.xml``` determines when the workflow shoud be
   executed and on which frequency. In this case, it depends on
diff --git a/oozie/browser/general/browser_general.hql 
b/oozie/browser/general/browser_general.hql
index 93cdc7b..f42653a 100644
--- a/oozie/browser/general/browser_general.hql
+++ b/oozie/browser/general/browser_general.hql
@@ -1,10 +1,14 @@
-
+-- Hql query for browser general data.
+--
+-- Updates the external table browser_general with traffic stats broken down
+-- by access_method, os, browser and other dimensions. This table serves as
+-- intermediate data source for various traffic reports, i.e.: mobile web
+-- browser breakdown, desktop os breakdown, or desktop+mobile web os+browser
+-- breakdown, etc.
+--
 -- Parameters:
 --     projectview_source       -- Table containing hourly projectviews.
 --     pageview_source          -- Table containing hourly pageviews.
---     destination_directory    -- Directory where to write the report.
---     access_methods           -- Comma-separated list of access methods
---                                 to include: 'xxx', 'yyy', ...
 --     year                     -- Year of interval's start date.
 --     month                    -- Month of interval's start date.
 --     day                      -- Day of interval's start date.
@@ -12,50 +16,56 @@
 --                                 date will be calculated adding this to
 --                                 the start date. Start date is included
 --                                 in the report, but end date is not.
+--     os_family_other          -- Default other value for os family.
+--     os_major_other           -- Default other value for os major.
+--     browser_family_other     -- Default other value for browser family.
+--     browser_major_other      -- Default other value for browser major.
+--     destination_table        -- Table where to write the report.
 --
 -- Usage:
---     hive -f browser_general.hql                             \
---         -d projectview_source=wmf.projectview_hourly        \
---         -d pageview_source=wmf.pageview_hourly              \
---         -d destination_directory=/tmp/foo                   \
---         -d access_methods='desktop'                         \
---         -d year=2015                                        \
---         -d month=10                                         \
---         -d day=11                                           \
+--     hive -f browser_general.hql
+--         -d projectview_source=wmf.projectview_hourly
+--         -d pageview_source=wmf.pageview_hourly
+--         -d year=2016
+--         -d month=1
+--         -d day=1
 --         -d time_window=7
---
+--         -d os_family_other=Other
+--         -d os_major_other=-
+--         -d browser_family_other=Other
+--         -d browser_major_other=-
+--         -d destination_table=wmf.browser_general
 
--- Permits cartesian join of small enough table.
-SET hive.mapred.mode = nonstrict;
 
-SET start_date = CONCAT('${year}-', LPAD(${month}, 2, '0'), '-', LPAD(${day}, 
2, '0'));
+SET mapred.reduce.tasks = 1; -- Force 1 reducer to output to a single file.
+SET hive.mapred.mode = nonstrict; -- Permits cartesian join of small enough 
table.
+SET start_date = CONCAT(${year}, '-', LPAD(${month}, 2, '0'), '-', 
LPAD(${day}, 2, '0'));
 SET end_date   = DATE_ADD(${hiveconf:start_date}, ${time_window});
 
-WITH total AS (
-    SELECT
-        SUM(view_count) as view_count_total
-    FROM
-        ${projectview_source}
-    WHERE
-        CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= 
${hiveconf:start_date}
-        AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < 
${hiveconf:end_date}
-        AND agent_type = 'user'
-        AND access_method IN (${access_methods})
-)
-INSERT OVERWRITE DIRECTORY '${destination_directory}'
-    SELECT
-        tsv_line
-    FROM (
+WITH
+    total AS (
+        -- This select returns 1 row with the total view counts for the whole 
data set.
         SELECT
-            -- Since "ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'" only
-            -- works for hive>=1.2.0 (see HIVE-5672), we have to prepare the
-            -- lines by hand through concatenation.
-            CONCAT_WS(
-                '\t',
-                CONCAT(user_agent_map['os_family'], ' ', 
user_agent_map['os_major']),
-                CONCAT(user_agent_map['browser_family'], ' ', 
user_agent_map['browser_major']),
-                CAST(ROUND(SUM(view_count) * 100 / total.view_count_total, 2) 
AS string)
-            ) AS tsv_line,
+            SUM(view_count) as view_count_total
+        FROM
+            ${projectview_source}
+        WHERE
+            CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= 
${hiveconf:start_date}
+            AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) 
< ${hiveconf:end_date}
+            AND agent_type = 'user'
+            AND access_method IN ('desktop', 'mobile web')
+    ),
+    stats AS (
+        -- This select calculates the main stats (view count sum and percent)
+        -- over the data set, grouped by os/browser family and major. The 
percent
+        -- is calculated joining (cartesian) with the total CTE.
+        SELECT
+            access_method,
+            user_agent_map['os_family'] AS os_family,
+            user_agent_map['os_major'] AS os_major,
+            user_agent_map['browser_family'] AS browser_family,
+            user_agent_map['browser_major'] AS browser_major,
+            SUM(view_count) AS view_count,
             SUM(view_count) * 100 / total.view_count_total AS percent
         FROM
             ${pageview_source}
@@ -64,13 +74,81 @@
             CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= 
${hiveconf:start_date}
             AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) 
< ${hiveconf:end_date}
             AND agent_type = 'user'
-            AND access_method IN (${access_methods})
+            AND access_method IN ('desktop', 'mobile web')
         GROUP BY
-            CONCAT(user_agent_map['os_family'], ' ', 
user_agent_map['os_major']),
-            CONCAT(user_agent_map['browser_family'], ' ', 
user_agent_map['browser_major']),
+            access_method,
+            user_agent_map['os_family'],
+            user_agent_map['os_major'],
+            user_agent_map['browser_family'],
+            user_agent_map['browser_major'],
             total.view_count_total
-        HAVING
-            (SUM(view_count) * 100 / total.view_count_total) > 0.1
-        ORDER BY percent DESC
-    ) AS tsv_lines
+    ),
+    significant_stats AS (
+        -- These are the rows whose percentage is higher than a given 
threshold.
+        -- The long tail is cut out using the percent field.
+        SELECT
+            access_method,
+            os_family,
+            os_major,
+            browser_family,
+            browser_major,
+            view_count
+        FROM stats
+        WHERE
+            os_family <> ${os_family_other}
+            AND os_major <> ${os_major_other}
+            AND browser_family <> ${browser_family_other}
+            AND browser_major <> ${browser_major_other}
+            AND percent > 0.1
+    ),
+    other_stats AS (
+        -- And these are the rows belonging to the long tail, with the
+        -- percent lower than the threshold. They are grouped all together
+        -- under the specified default other values.
+        SELECT
+            access_method,
+            ${os_family_other} AS os_family,
+            ${os_major_other} AS os_major,
+            ${browser_family_other} AS browser_family,
+            ${browser_major_other} AS browser_major,
+            SUM(view_count) AS view_count
+        FROM stats
+        WHERE
+            os_family == ${os_family_other}
+            AND os_major == ${os_major_other}
+            AND browser_family == ${browser_family_other}
+            AND browser_major == ${browser_major_other}
+            OR percent <= 0.1
+        GROUP BY
+            access_method,
+            ${os_family_other},
+            ${os_major_other},
+            ${browser_family_other},
+            ${browser_major_other}
+    ),
+    final_stats AS (
+        -- The final stats include both the significant stats and the
+        -- long tail stats grouped under the 'other' terms.
+        SELECT *
+        FROM significant_stats
+        UNION ALL
+        SELECT *
+        FROM other_stats
+    )
+
+INSERT OVERWRITE TABLE ${destination_table}
+    PARTITION(year=${year}, month=${month}, day=${day})
+SELECT
+    access_method,
+    os_family,
+    os_major,
+    browser_family,
+    browser_major,
+    CAST(ROUND(view_count, 2) AS string)
+FROM (
+    -- Order the results by view count, from high to low.
+    SELECT *
+    FROM final_stats
+    ORDER BY view_count DESC
+) AS ordered_final_stats
 ;
diff --git a/oozie/browser/general/coordinator.properties 
b/oozie/browser/general/coordinator.properties
index 5cf8c64..18cf4d2 100644
--- a/oozie/browser/general/coordinator.properties
+++ b/oozie/browser/general/coordinator.properties
@@ -4,8 +4,8 @@
 #   oozie job -Duser=$USER -Dstart_time=2015-10-11T00:00Z -submit \
 #       -config oozie/browser/general/coordinator.properties
 #
-# NOTE:  The $oozie_directory must be synced to HDFS so that all
-#        relevant .xml files exist there when this job is submitted.
+# NOTE: The $oozie_directory must be synced to HDFS so that all
+#       relevant .xml files exist there when this job is submitted.
 
 name_node                         = hdfs://analytics-hadoop
 job_tracker                       = resourcemanager.analytics.eqiad.wmnet:8032
@@ -37,8 +37,8 @@
 pageview_datasets_file            = ${oozie_directory}/pageview/datasets.xml
 pageview_data_directory           = ${name_node}/wmf/data/wmf/pageview
 
-# HDFS path where to write the reports.
-output_base_path                  = 
${name_node}/wmf/data/archive/browser/general
+# Hive table where to write the reports.
+destination_table                 = wmf.browser_general
 
 # Initial import time of the datasets.
 start_time                        = 2015-10-11T00:00Z
@@ -58,6 +58,12 @@
 # Fully qualified Hive table name for pageviews.
 pageview_source                   = wmf.pageview_hourly
 
+# Default 'other' values to use when collapsing the long tail.
+os_family_other                   = Other
+os_major_other                    = -
+bowser_family_other               = Other
+bowser_major_other                = -
+
 # Coordinator to start.
 oozie.coord.application.path      = ${coordinator_file}
 oozie.use.system.libpath          = true
diff --git a/oozie/browser/general/coordinator.xml 
b/oozie/browser/general/coordinator.xml
index 88b5b78..af8bb58 100644
--- a/oozie/browser/general/coordinator.xml
+++ b/oozie/browser/general/coordinator.xml
@@ -27,7 +27,12 @@
         <property><name>hive_site_xml</name></property>
         <property><name>projectview_source</name></property>
         <property><name>pageview_source</name></property>
-        <property><name>output_base_path</name></property>
+        <property><name>destination_table</name></property>
+
+        <property><name>os_family_other</name></property>
+        <property><name>os_major_other</name></property>
+        <property><name>browser_family_other</name></property>
+        <property><name>browser_major_other</name></property>
     </parameters>
 
     <controls>
diff --git a/oozie/browser/general/workflow.xml 
b/oozie/browser/general/workflow.xml
index 814553d..bccf98f 100644
--- a/oozie/browser/general/workflow.xml
+++ b/oozie/browser/general/workflow.xml
@@ -38,8 +38,8 @@
             <description>Table containing hourly pageviews.</description>
         </property>
         <property>
-            <name>output_base_path</name>
-            <description>Directory where to write the reports.</description>
+            <name>destination_table</name>
+            <description>Table where to write the reports.</description>
         </property>
         <property>
             <name>output_timestamp</name>
@@ -61,15 +61,32 @@
             <name>time_window</name>
             <description>Length of interval in days.</description>
         </property>
+        <property>
+            <name>os_family_other</name>
+            <description>Default other value for os family.</description>
+        </property>
+        <property>
+            <name>os_major_other</name>
+            <description>Default other value for os major.</description>
+        </property>
+        <property>
+            <name>browser_family_other</name>
+            <description>Default other value for browser family.</description>
+        </property>
+        <property>
+            <name>browser_major_other</name>
+            <description>Default other value for browser major.</description>
+        </property>
     </parameters>
 
-    <start to="compute_mobile_web"/>
+    <start to="compute_browser_general"/>
 
-    <action name="compute_mobile_web">
+    <action name="compute_browser_general">
         <hive xmlns="uri:oozie:hive-action:0.2">
             <job-tracker>${job_tracker}</job-tracker>
             <name-node>${name_node}</name-node>
             <job-xml>${hive_site_xml}</job-xml>
+
             <configuration>
                 <property>
                     <name>mapreduce.job.queuename</name>
@@ -93,52 +110,15 @@
             <script>${hive_script}</script>
             <param>projectview_source=${projectview_source}</param>
             <param>pageview_source=${pageview_source}</param>
-            
<param>destination_directory=${output_base_path}/mobile_web-${output_timestamp}</param>
-            <param>access_methods='mobile web'</param>
             <param>year=${year}</param>
             <param>month=${month}</param>
             <param>day=${day}</param>
             <param>time_window=${time_window}</param>
-        </hive>
-
-        <ok to="compute_desktop_and_mobile_web"/>
-        <error to="kill"/>
-    </action>
-
-    <action name="compute_desktop_and_mobile_web">
-        <hive xmlns="uri:oozie:hive-action:0.2">
-            <job-tracker>${job_tracker}</job-tracker>
-            <name-node>${name_node}</name-node>
-            <job-xml>${hive_site_xml}</job-xml>
-            <configuration>
-                <property>
-                    <name>mapreduce.job.queuename</name>
-                    <value>${queue_name}</value>
-                </property>
-                <!--make sure oozie:launcher runs in a low priority queue -->
-                <property>
-                    <name>oozie.launcher.mapred.job.queue.name</name>
-                    <value>${oozie_launcher_queue_name}</value>
-                </property>
-                <property>
-                    <name>oozie.launcher.mapreduce.map.memory.mb</name>
-                    <value>${oozie_launcher_memory}</value>
-                </property>
-                <property>
-                    <name>hive.exec.scratchdir</name>
-                    <value>/tmp/hive-${user}</value>
-                </property>
-            </configuration>
-
-            <script>${hive_script}</script>
-            <param>projectview_source=${projectview_source}</param>
-            <param>pageview_source=${pageview_source}</param>
-            
<param>destination_directory=${output_base_path}/desktop_and_mobile_web-${output_timestamp}</param>
-            <param>access_methods='desktop','mobile web'</param>
-            <param>year=${year}</param>
-            <param>month=${month}</param>
-            <param>day=${day}</param>
-            <param>time_window=${time_window}</param>
+            <param>os_family_other=${os_family_other}</param>
+            <param>os_major_other=${os_major_other}</param>
+            <param>browser_family_other=${browser_family_other}</param>
+            <param>browser_major_other=${browser_major_other}</param>
+            <param>destination_table=${destination_table}</param>
         </hive>
 
         <ok to="end"/>
@@ -148,5 +128,6 @@
     <kill name="kill">
         <message>Action failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
     </kill>
+
     <end name="end"/>
 </workflow-app>

-- 
To view, visit https://gerrit.wikimedia.org/r/271253
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I19074c3f449a81c0eeeeb666964bbcfae0f05363
Gerrit-PatchSet: 1
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: Mforns <mfo...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to