> We should address automatic duplicate cleaning very soon, as Christian warned > a while ago. He manually cleaned up duplicates a few times but we know it's > a problem that needs solving. Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.
https://gerrit.wikimedia.org/r/#/c/177522/ <https://gerrit.wikimedia.org/r/#/c/177522/> Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates. > We should be calculating a per-host arithmetic series over the sequence > numbers > when data is loaded. Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses. -Ao > On Feb 23, 2015, at 09:01, Dan Andreescu <[email protected]> wrote: > > We should address automatic duplicate cleaning very soon, as Christian warned > a while ago. He manually cleaned up duplicates a few times but we know it's > a problem that needs solving. > > On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner > <[email protected] <mailto:[email protected]>> wrote: > Hi Oliver, > > On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote: > > And, an additional point; I don't understand why, if dupes is the > > problem, the Hive query was not hit as badly by this as the equivalent > > UDF. > > just shooting in the dark, since you did not provide your query, but > if you by accident had been querying the > > wmf_raw.webrequest > > (database name ending in “_raw”) table instead of > > wmf.webrequest > > (no “_raw” in the database name), the difference you described would > be plausible (and given the patching of GHOST, they'd even be > expected). > > > Have fun, > Christian > > > > -- > ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- > Companies' registry: 360296y in Linz > Christian Aistleitner > Kefermarkterstrasze 6a/3 Email: [email protected] > <mailto:[email protected]> > 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 > <tel:%2B43%207946%20%2F%2020%205%2081> > Fax: +43 7946 / 20 5 81 > <tel:%2B43%207946%20%2F%2020%205%2081> > Homepage: http://quelltextlich.at/ > <http://quelltextlich.at/> > --------------------------------------------------------------- > > _______________________________________________ > Analytics mailing list > [email protected] <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
