Aha, so wmf_raw.webrequest is expected to have duplicates? Okay! That
could do it :). I'll re-run across wmf.webrequest; thanks Christian
for the spot, and Andrew for having thought 3 stages ahead as usual :D

On 23 February 2015 at 09:35, Andrew Otto <[email protected]> wrote:
> We should address automatic duplicate cleaning very soon, as Christian
> warned a while ago.  He manually cleaned up duplicates a few times but we
> know it's a problem that needs solving.
>
> Duplicates are already cleaned up, in the refined table.  There should never
> be any duplicates in the wmf.webrequest table.
>
> https://gerrit.wikimedia.org/r/#/c/177522/
>
> Seeing as this was merged on Jan 26, it is possible that it was not deployed
> when on Jan 27 when Oliver is noticing duplicates.
>
> We should be calculating a per-host arithmetic series over the sequence
> numbers
> when data is loaded.
>
> Please see the wmf_raw.webrequest_sequence_stats tables, for hourly
> partition statistics, including duplicates and losses.
>
> -Ao
>
>
>
>
> On Feb 23, 2015, at 09:01, Dan Andreescu <[email protected]> wrote:
>
> We should address automatic duplicate cleaning very soon, as Christian
> warned a while ago.  He manually cleaned up duplicates a few times but we
> know it's a problem that needs solving.
>
> On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner
> <[email protected]> wrote:
>>
>> Hi Oliver,
>>
>> On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
>> > And, an additional point; I don't understand why, if dupes is the
>> > problem, the Hive query was not hit as badly by this as the equivalent
>> > UDF.
>>
>> just shooting in the dark, since you did not provide your query, but
>> if you by accident had been querying the
>>
>>   wmf_raw.webrequest
>>
>> (database name ending in “_raw”) table instead of
>>
>>   wmf.webrequest
>>
>> (no “_raw” in the database name), the difference you described would
>> be plausible (and given the patching of GHOST, they'd even be
>> expected).
>>
>>
>> Have fun,
>> Christian
>>
>>
>>
>> --
>> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
>>                            Companies' registry: 360296y in Linz
>> Christian Aistleitner
>> Kefermarkterstrasze 6a/3     Email:  [email protected]
>> 4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
>>                              Fax:            +43 7946 / 20 5 81
>>                              Homepage: http://quelltextlich.at/
>> ---------------------------------------------------------------
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to