Nuria has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/351613 )

Change subject: Update daily unique devices druid loading job
......................................................................


Update daily unique devices druid loading job

Current job loads every row, while we documented that projects
with less than 1000 uniques shouldn't be taken into account because
of their variability.

This patch prevent those small-number-of-uniques projects to be loaded.

See:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution#Data_Quality_Analysis

Bug: T164183

Change-Id: I4a8359b53d0d37063844304d19a34025a05aab5b
---
M oozie/last_access_uniques/druid/daily/generate_daily_druid_uniques.hql
1 file changed, 16 insertions(+), 0 deletions(-)

Approvals:
  Nuria: Verified; Looks good to me, approved



diff --git 
a/oozie/last_access_uniques/druid/daily/generate_daily_druid_uniques.hql 
b/oozie/last_access_uniques/druid/daily/generate_daily_druid_uniques.hql
index 1d670d4..f6e0870 100644
--- a/oozie/last_access_uniques/druid/daily/generate_daily_druid_uniques.hql
+++ b/oozie/last_access_uniques/druid/daily/generate_daily_druid_uniques.hql
@@ -33,6 +33,20 @@
 LOCATION '${destination_directory}';
 
 
+WITH filtered_hosts AS (
+    SELECT
+        uri_host as filtered_host,
+        SUM(uniques_estimate) as checked_uniques_estimate
+    FROM ${source_table}
+    WHERE year=${year}
+      AND month=${month}
+      AND day=${day}
+    GROUP BY
+        uri_host
+    HAVING
+        SUM(uniques_estimate) >= 1000
+)
+
 INSERT OVERWRITE TABLE tmp_daily_druid_uniques_${year}_${month}_${day}
 SELECT
     CONCAT(
@@ -46,6 +60,8 @@
     uniques_offset AS uniques_offset,
     uniques_estimate AS uniques_estimate
 FROM ${source_table}
+    INNER JOIN filtered_hosts
+        ON uri_host = filtered_host
 WHERE year = ${year}
     AND month = ${month}
     AND day = ${day};

-- 
To view, visit https://gerrit.wikimedia.org/r/351613
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I4a8359b53d0d37063844304d19a34025a05aab5b
Gerrit-PatchSet: 2
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: Joal <[email protected]>
Gerrit-Reviewer: Nuria <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to