Re: [Analytics] udp2log shutdown (for analytics instances) next week

Andrew Otto Mon, 27 Apr 2015 10:16:45 -0700

Hi again!

Today I turned of most udp2log webrequest filters.  For now, I have left
the Fundraising filters, as well as the 5xx and sampled-1000 filters
running.  All of these filters are now running on erbium.  oxygen's udp2log
instance has been shut off.


Instead of constantly updating this thread, I will track this here:
https://phabricator.wikimedia.org/T97294

Thanks!

On Tue, Apr 21, 2015 at 3:49 PM, Andrew Otto <[email protected]> wrote:

> Hi all!
>
> Now that all data that is generated by udp2log is also being generated by
> the Analytics Cluster, we are finally ready to turn off analytics udp2log
> instances.  I will start with the ones that are used to generate the logs
> on stat1002 at /a/squid/archive.  The (identical) cluster generated logs
> can be found on stat1002 at /a/log/webrequest/archive.  I will paste the
> contents of the README file in /a/squid/archive describing the differences
> at the bottom of this email.
>
> If you use any of the logs in /a/squid/archive for regular statistics, you
> will need to switch your code to use files in /a/log/webrequest/archive
> instead.  I plan to start turning off udp2log instances on  Monday April
> 27th (that’s next week!).
>
>
> From the README:
>
> [@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17
> ***********************************************************************
> *                                                                     *
> *  This directory will run stale once udp2log will get turned off.    *
> *  Please use the corresponding TSVs from /a/log/webrequest/archive/  *
> *  instead.                                                           *
> *                                                                     *
> ***********************************************************************
>
>
>
> The TSV files in this directory underneath /a/squid/archive get
> generated by udp2log and suffer from
>
> * Sub-par data quality (E.g.: udp2log had an inherent loss).
> * Lack of a way to backfill/fix data.
> * Some files consuming https requests twice, which made filtering
>   necessary.
> * Consfusing naming scheme, where each file covered 24 hours, but not
>   midnight to midnight, but ~06:30 previous day to ~06:30 current day.
>
> The new TSVs at /a/log/webrequest/archive/ contain the same
> information but get generated by Hive, and address the above four
> issues:
>
> * By using Hive's webrequest table as input, the inherent loss is
>   gone. Also statistics on the hour's data quality are available.
> * Hive data allows to backfill/fix data.
> * Only data from the varnishes gets picked up. So https traffic no
>   longer gets duplicated.
> * The files now cover 24 hours from midnight to midnight. No more
>   stitching/cutting is needed to get the logs for a given day.
>
>
> Please migrate to using the Hive-generated TSVs from
>
>   /a/log/webrequest/archive/
>
>
> Thanks!  I’ll keep you updated as this happens.
>
> -Andrew Otto
>
>
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] udp2log shutdown (for analytics instances) next week

Reply via email to