Ok thanks for the heads up!

On Mon, 27 Apr 2015, Andrew Otto wrote:

Hi again!
Today I turned of most udp2log webrequest filters.  For now, I have left the 
Fundraising filters, as well as the 5xx and
sampled-1000 filters running.  All of these filters are now running on erbium.  
oxygen's udp2log instance has been shut off.

Instead of constantly updating this thread, I will track this here: 
https://phabricator.wikimedia.org/T97294

Thanks!

On Tue, Apr 21, 2015 at 3:49 PM, Andrew Otto <[email protected]> wrote:
      Hi all!

      Now that all data that is generated by udp2log is also being generated by 
the Analytics Cluster, we are finally ready
      to turn off analytics udp2log instances.  I will start with the ones that 
are used to generate the logs on stat1002 at
      /a/squid/archive.  The (identical) cluster generated logs can be found on 
stat1002 at /a/log/webrequest/archive.  I
      will paste the contents of the README file in /a/squid/archive describing 
the differences at the bottom of this email.

      If you use any of the logs in /a/squid/archive for regular statistics, 
you will need to switch your code to use files
      in /a/log/webrequest/archive instead.  I plan to start turning off 
udp2log instances on  Monday April 27th (that’s next
      week!).


      >From the README:

      [@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17
      ***********************************************************************
      *                                                                     *
      *  This directory will run stale once udp2log will get turned off.    *
      *  Please use the corresponding TSVs from /a/log/webrequest/archive/  *
      *  instead.                                                           *
      *                                                                     *
      ***********************************************************************



      The TSV files in this directory underneath /a/squid/archive get
      generated by udp2log and suffer from

      * Sub-par data quality (E.g.: udp2log had an inherent loss).
      * Lack of a way to backfill/fix data.
      * Some files consuming https requests twice, which made filtering
        necessary.
      * Consfusing naming scheme, where each file covered 24 hours, but not
        midnight to midnight, but ~06:30 previous day to ~06:30 current day.

      The new TSVs at /a/log/webrequest/archive/ contain the same
      information but get generated by Hive, and address the above four
      issues:

      * By using Hive's webrequest table as input, the inherent loss is
        gone. Also statistics on the hour's data quality are available.
      * Hive data allows to backfill/fix data.
      * Only data from the varnishes gets picked up. So https traffic no
        longer gets duplicated.
      * The files now cover 24 hours from midnight to midnight. No more
        stitching/cutting is needed to get the logs for a given day.


      Please migrate to using the Hive-generated TSVs from

        /a/log/webrequest/archive/


      Thanks!  I’ll keep you updated as this happens.

      -Andrew Otto





_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to