Hi again! Today I turned of most udp2log webrequest filters. For now, I have left the Fundraising filters, as well as the 5xx and sampled-1000 filters running. All of these filters are now running on erbium. oxygen's udp2log instance has been shut off.
Instead of constantly updating this thread, I will track this here: https://phabricator.wikimedia.org/T97294 Thanks! On Tue, Apr 21, 2015 at 3:49 PM, Andrew Otto <[email protected]> wrote: > Hi all! > > Now that all data that is generated by udp2log is also being generated by > the Analytics Cluster, we are finally ready to turn off analytics udp2log > instances. I will start with the ones that are used to generate the logs > on stat1002 at /a/squid/archive. The (identical) cluster generated logs > can be found on stat1002 at /a/log/webrequest/archive. I will paste the > contents of the README file in /a/squid/archive describing the differences > at the bottom of this email. > > If you use any of the logs in /a/squid/archive for regular statistics, you > will need to switch your code to use files in /a/log/webrequest/archive > instead. I plan to start turning off udp2log instances on Monday April > 27th (that’s next week!). > > > From the README: > > [@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17 > *********************************************************************** > * * > * This directory will run stale once udp2log will get turned off. * > * Please use the corresponding TSVs from /a/log/webrequest/archive/ * > * instead. * > * * > *********************************************************************** > > > > The TSV files in this directory underneath /a/squid/archive get > generated by udp2log and suffer from > > * Sub-par data quality (E.g.: udp2log had an inherent loss). > * Lack of a way to backfill/fix data. > * Some files consuming https requests twice, which made filtering > necessary. > * Consfusing naming scheme, where each file covered 24 hours, but not > midnight to midnight, but ~06:30 previous day to ~06:30 current day. > > The new TSVs at /a/log/webrequest/archive/ contain the same > information but get generated by Hive, and address the above four > issues: > > * By using Hive's webrequest table as input, the inherent loss is > gone. Also statistics on the hour's data quality are available. > * Hive data allows to backfill/fix data. > * Only data from the varnishes gets picked up. So https traffic no > longer gets duplicated. > * The files now cover 24 hours from midnight to midnight. No more > stitching/cutting is needed to get the logs for a given day. > > > Please migrate to using the Hive-generated TSVs from > > /a/log/webrequest/archive/ > > > Thanks! I’ll keep you updated as this happens. > > -Andrew Otto > > > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
