Hi Oliver,

On Sun, Feb 22, 2015 at 06:22:25PM -0500, Oliver Keyes wrote:
> This was narrowed
> down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1]
> and a breakpoint was then seen at approximately 22:35 UTC.[2]

Yay for isolating the time.



> [ We see duplicates! ]

How and where did you detect the duplicates?
How can one reproduce your observation?



> As I understand it (and
> I really hope I'm wrong about this), each hostname issues a
> unique-to-the-host sequence number with each request, incrementing
> each time.

The sequence numbers are not unique per host.

For varnishes, the sequence numbers should be unique per tuple:
  (host, service_start_time, sequence_number_overflow_count, logging_backend).

For nginx (back then still available on udp2log IIRC), the numbers are
not unique at all.



Have fun,
Christian



P.S.: I had a look at the relevant hour in the wmf.webrequest table,
and I could not detect duplicates [*].

P.P.S.: On 2015-01-27, GHOST [**] got patched on WMF servers. That
accounted for quite some restarts and config changes. So expect many
sequence number resets during that day.



[*]
(See the one_hour_stats_all_sources.hql attachment)
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 10:54:08 // exit code: 0
cwd: ~
hive -f one_hour_stats_all_sources.hql -d table=wmf.webrequest -d year=2015 -d 
month=1 -d day=27 -d hour=22
[...]
all_hosts.hostname      [...]       all_hosts.count_duplicate       [...]
cp4016.ulsfo.wmnet      [...]       0       [...]
amssq46.esams.wmnet     [...]       0       [...]
cp3005.esams.wikimedia.org   [...]  0       [...]
amssq59.esams.wikimedia.org  [...]  0       [...]



[**] https://access.redhat.com/articles/1332213



-- 
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  [email protected]
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------
-- hive -f one_hour_stats_all_sources.hql -d table=wmf_raw.webrequest -d 
year=2014 -d month=10 -d day=9 -d hour=17

SELECT * FROM (
    SELECT
        hostname,
        MIN(sequence)                                                  AS 
sequence_min,
        MAX(sequence)                                                  AS 
sequence_max,
        COUNT(*)                                                       AS 
count_actual,
        MAX(sequence) - MIN(sequence) + 1                              AS 
count_expected,
        MAX(sequence) - MIN(sequence) + 1 - COUNT(hostname)            AS 
count_different,
        COUNT(*) - COUNT(DISTINCT sequence)                            AS 
count_duplicate,
        SUM(if(sequence IS NULL,1,0))                                  AS 
count_null_sequence,
        ((COUNT(*) / (MAX(sequence) - MIN(sequence) + 1)) - 1) * 100.0 AS 
percent_different
      FROM
        ${table}
      WHERE
        year=${year} AND month=${month} AND day=${day} AND hour=${hour}
      GROUP BY
        hostname, webrequest_source, year, month, day
    ) all_hosts WHERE (
        count_duplicate != 0      -- Host has duplicates
    OR
       count_different != 0      -- Host has duplicates or holes
    OR
       count_null_sequence != 0  -- Host has NULL sequence numbers
);

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to