> We should address automatic duplicate cleaning very soon, as Christian warned 
> a while ago.  He manually cleaned up duplicates a few times but we know it's 
> a problem that needs solving.
Duplicates are already cleaned up, in the refined table.  There should never be 
any duplicates in the wmf.webrequest table.

https://gerrit.wikimedia.org/r/#/c/177522/ 
<https://gerrit.wikimedia.org/r/#/c/177522/>

Seeing as this was merged on Jan 26, it is possible that it was not deployed 
when on Jan 27 when Oliver is noticing duplicates.

> We should be calculating a per-host arithmetic series over the sequence 
> numbers
> when data is loaded.

Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition 
statistics, including duplicates and losses.

-Ao




> On Feb 23, 2015, at 09:01, Dan Andreescu <[email protected]> wrote:
> 
> We should address automatic duplicate cleaning very soon, as Christian warned 
> a while ago.  He manually cleaned up duplicates a few times but we know it's 
> a problem that needs solving.
> 
> On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner 
> <[email protected] <mailto:[email protected]>> wrote:
> Hi Oliver,
> 
> On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
> > And, an additional point; I don't understand why, if dupes is the
> > problem, the Hive query was not hit as badly by this as the equivalent
> > UDF.
> 
> just shooting in the dark, since you did not provide your query, but
> if you by accident had been querying the
> 
>   wmf_raw.webrequest
> 
> (database name ending in “_raw”) table instead of
> 
>   wmf.webrequest
> 
> (no “_raw” in the database name), the difference you described would
> be plausible (and given the patching of GHOST, they'd even be
> expected).
> 
> 
> Have fun,
> Christian
> 
> 
> 
> --
> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
>                            Companies' registry: 360296y in Linz
> Christian Aistleitner
> Kefermarkterstrasze 6a/3     Email:  [email protected] 
> <mailto:[email protected]>
> 4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81 
> <tel:%2B43%207946%20%2F%2020%205%2081>
>                              Fax:            +43 7946 / 20 5 81 
> <tel:%2B43%207946%20%2F%2020%205%2081>
>                              Homepage: http://quelltextlich.at/ 
> <http://quelltextlich.at/>
> ---------------------------------------------------------------
> 
> _______________________________________________
> Analytics mailing list
> [email protected] <mailto:[email protected]>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to