Aha, so wmf_raw.webrequest is expected to have duplicates? Okay! That could do it :). I'll re-run across wmf.webrequest; thanks Christian for the spot, and Andrew for having thought 3 stages ahead as usual :D
On 23 February 2015 at 09:35, Andrew Otto <[email protected]> wrote: > We should address automatic duplicate cleaning very soon, as Christian > warned a while ago. He manually cleaned up duplicates a few times but we > know it's a problem that needs solving. > > Duplicates are already cleaned up, in the refined table. There should never > be any duplicates in the wmf.webrequest table. > > https://gerrit.wikimedia.org/r/#/c/177522/ > > Seeing as this was merged on Jan 26, it is possible that it was not deployed > when on Jan 27 when Oliver is noticing duplicates. > > We should be calculating a per-host arithmetic series over the sequence > numbers > when data is loaded. > > Please see the wmf_raw.webrequest_sequence_stats tables, for hourly > partition statistics, including duplicates and losses. > > -Ao > > > > > On Feb 23, 2015, at 09:01, Dan Andreescu <[email protected]> wrote: > > We should address automatic duplicate cleaning very soon, as Christian > warned a while ago. He manually cleaned up duplicates a few times but we > know it's a problem that needs solving. > > On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner > <[email protected]> wrote: >> >> Hi Oliver, >> >> On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote: >> > And, an additional point; I don't understand why, if dupes is the >> > problem, the Hive query was not hit as badly by this as the equivalent >> > UDF. >> >> just shooting in the dark, since you did not provide your query, but >> if you by accident had been querying the >> >> wmf_raw.webrequest >> >> (database name ending in “_raw”) table instead of >> >> wmf.webrequest >> >> (no “_raw” in the database name), the difference you described would >> be plausible (and given the patching of GHOST, they'd even be >> expected). >> >> >> Have fun, >> Christian >> >> >> >> -- >> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- >> Companies' registry: 360296y in Linz >> Christian Aistleitner >> Kefermarkterstrasze 6a/3 Email: [email protected] >> 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 >> Fax: +43 7946 / 20 5 81 >> Homepage: http://quelltextlich.at/ >> --------------------------------------------------------------- >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
