Hi Oliver, On Sun, Feb 22, 2015 at 06:22:25PM -0500, Oliver Keyes wrote: > This was narrowed > down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] > and a breakpoint was then seen at approximately 22:35 UTC.[2]
Yay for isolating the time. > [ We see duplicates! ] How and where did you detect the duplicates? How can one reproduce your observation? > As I understand it (and > I really hope I'm wrong about this), each hostname issues a > unique-to-the-host sequence number with each request, incrementing > each time. The sequence numbers are not unique per host. For varnishes, the sequence numbers should be unique per tuple: (host, service_start_time, sequence_number_overflow_count, logging_backend). For nginx (back then still available on udp2log IIRC), the numbers are not unique at all. Have fun, Christian P.S.: I had a look at the relevant hour in the wmf.webrequest table, and I could not detect duplicates [*]. P.P.S.: On 2015-01-27, GHOST [**] got patched on WMF servers. That accounted for quite some restarts and config changes. So expect many sequence number resets during that day. [*] (See the one_hour_stats_all_sources.hql attachment) _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 10:54:08 // exit code: 0 cwd: ~ hive -f one_hour_stats_all_sources.hql -d table=wmf.webrequest -d year=2015 -d month=1 -d day=27 -d hour=22 [...] all_hosts.hostname [...] all_hosts.count_duplicate [...] cp4016.ulsfo.wmnet [...] 0 [...] amssq46.esams.wmnet [...] 0 [...] cp3005.esams.wikimedia.org [...] 0 [...] amssq59.esams.wikimedia.org [...] 0 [...] [**] https://access.redhat.com/articles/1332213 -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: [email protected] 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------
-- hive -f one_hour_stats_all_sources.hql -d table=wmf_raw.webrequest -d
year=2014 -d month=10 -d day=9 -d hour=17
SELECT * FROM (
SELECT
hostname,
MIN(sequence) AS
sequence_min,
MAX(sequence) AS
sequence_max,
COUNT(*) AS
count_actual,
MAX(sequence) - MIN(sequence) + 1 AS
count_expected,
MAX(sequence) - MIN(sequence) + 1 - COUNT(hostname) AS
count_different,
COUNT(*) - COUNT(DISTINCT sequence) AS
count_duplicate,
SUM(if(sequence IS NULL,1,0)) AS
count_null_sequence,
((COUNT(*) / (MAX(sequence) - MIN(sequence) + 1)) - 1) * 100.0 AS
percent_different
FROM
${table}
WHERE
year=${year} AND month=${month} AND day=${day} AND hour=${hour}
GROUP BY
hostname, webrequest_source, year, month, day
) all_hosts WHERE (
count_duplicate != 0 -- Host has duplicates
OR
count_different != 0 -- Host has duplicates or holes
OR
count_null_sequence != 0 -- Host has NULL sequence numbers
);
signature.asc
Description: Digital signature
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
