I've updated the migration plan with in particular the clean-up queries that would delete the old history data and the old XWikiTranslator activity stream data. The volumes are much more reasonable now (100K activity stream and 30k history) so I don't think we need to delete more.
I'm finalizing the custom mapping migration locally (this will add significant history as every document will be resaved). Can the queries be reviewed ? http://l10n.xwiki.org/xwiki/bin/view/L10N/L10NMigrationPlan Ludovic 2013/1/24 Ludovic Dubost <[email protected]> > Hi, > > I've published a migration plan: > > http://l10n.xwiki.org/xwiki/bin/view/L10N/L10NMigrationPlan > > Review welcome > > Ludovic > > > 2013/1/24 Ludovic Dubost <[email protected]> > >> FYI, here are the running times of these queries: >> >> mysql> create table l10n_stats_author as select year(xwr_date)*100+ >> month(xwr_date) as l10n_month, XWR_AUTHOR as l10n_author, count(distinct >> XWR_DOCID, XWR_VERSION1, XWR_VERSION2) as l10n_count from xwikircs group by >> 1,2; >> Query OK, 763 rows affected (1 min 28.36 sec) >> Records: 763 Duplicates: 0 Warnings: 0 >> >> mysql> delete from l10n_stats_author where l10n_month=201301 or >> l10n_month=201302; >> Query OK, 29 rows affected (0.00 sec) >> >> mysql> insert into l10n_stats_author (select year(xwr_date)*100+ >> month(xwr_date) as l10n_month, XWR_AUTHOR as l10n_author, count(distinct >> XWR_DOCID, XWR_VERSION1, XWR_VERSION2) as l10n_count from xwikircs where >> xwr_date >= '2012-12-01' group by 1,2); >> Query OK, 35 rows affected (0.07 sec) >> Records: 35 Duplicates: 0 Warnings: 0 >> >> Update is very fast >> >> Ludovic >> >> >> 2013/1/24 Ludovic Dubost <[email protected]> >> >>> The custom statistics data structure is not that complicated. >>> I've already the SQL query that does the initial version (based on >>> xwikircs for now) >>> >>> create table l10n_stats_author as select year(xwr_date)*100+ >>> month(xwr_date) as l10n_month, XWR_AUTHOR as l10n_author, count(distinct >>> XWR_DOCID, XWR_VERSION1, XWR_VERSION2) as l10n_count from xwikircs group by >>> 1,2; >>> delete from l10n_stats_author where l10n_month=201301 or >>> l10n_month=201302; >>> insert into l10n_stats_author (select year(xwr_date)*100+ >>> month(xwr_date) as l10n_month, XWR_AUTHOR as l10n_author, count(distinct >>> XWR_DOCID, XWR_VERSION1, XWR_VERSION2) as l10n_count from xwikircs where >>> xwr_date >= '2012-12-01' group by 1,2); >>> select * from l10n_stats_author where l10n_month=201301 or >>> l10n_month=201302; >>> >>> >>> Putting this in a scheduler job should not be complicated >>> >>> Ludovic >>> >>> >>> >>> 2013/1/24 Sergiu Dumitriu <[email protected]> >>> >>>> On 01/24/2013 12:03 AM, Ludovic Dubost wrote: >>>> > Working on l10n change to custom mapping showed up an issue with the >>>> l10n >>>> > history. >>>> > We have a huge history table with 2,3M lines (as well as an activity >>>> > stream) and we use the history table for extracting contributor >>>> statistics. >>>> > >>>> > However looking at it more closely shows that the source of most data >>>> in >>>> > the history table is useless. >>>> > See here a grouping by dates showing that localisation of already >>>> close to >>>> > 2M lines in 6 month in 2012. >>>> > >>>> > | 201201 | 318346 | >>>> > | 201202 | 311403 | >>>> > | 201203 | 728703 | >>>> > | 201204 | 271657 | >>>> > | 201205 | 296384 | >>>> > | 201206 | 120463 | >>>> > >>>> > Here is an example of history for an entry >>>> > >>>> > mysql> select >>>> > >>>> xwr_author,xwr_version1,xwr_version2,xwr_date,xwr_comment,cast(xwr_isdiff >>>> > as unsigned), length(xwr_patch) from xwikircs where >>>> xwr_docid=596494851 ; >>>> > >>>> +-----------------------+--------------+--------------+---------------------+-------------+------------------------------+-------------------+ >>>> > | xwr_author | xwr_version1 | xwr_version2 | xwr_date >>>> > | xwr_comment | cast(xwr_isdiff as unsigned) | length(xwr_patch) | >>>> > >>>> +-----------------------+--------------+--------------+---------------------+-------------+------------------------------+-------------------+ >>>> > | XWiki.XWikiTranslator | 1 | 1 | 2010-02-23 >>>> 11:38:27 >>>> > | | 1 | 163 | >>>> > | XWiki.XWikiTranslator | 2 | 1 | 2010-02-23 >>>> 11:40:26 >>>> > | | 1 | 330 | >>>> > | XWiki.rbuj | 3 | 1 | 2010-03-04 >>>> 00:24:24 >>>> > | | 1 | 189 | >>>> > | XWiki.rbuj | 4 | 1 | 2010-03-04 >>>> 01:02:52 >>>> > | | 1 | 223 | >>>> > | XWiki.rbuj | 5 | 1 | 2010-07-30 >>>> 01:12:58 >>>> > | | 0 | 5026 | >>>> > | XWiki.XWikiTranslator | 6 | 1 | 2012-01-23 >>>> 11:40:25 >>>> > | | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 7 | 1 | 2012-01-23 >>>> 11:58:25 >>>> > | | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 1 | 2012-01-23 >>>> 13:31:52 >>>> > | | 1 | 234 | >>>> > | XWiki.XWikiTranslator | 8 | 2 | 2012-01-23 >>>> 15:36:02 >>>> > | Prepared | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 3 | 2012-01-24 >>>> 02:01:12 >>>> > | Prepared | 0 | 5670 | >>>> > | XWiki.XWikiTranslator | 8 | 4 | 2012-01-25 >>>> 02:04:00 >>>> > | Prepared | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 5 | 2012-01-26 >>>> 02:02:29 >>>> > | Prepared | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 6 | 2012-01-27 >>>> 02:01:18 >>>> > | Prepared | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 7 | 2012-02-04 >>>> 02:01:08 >>>> > | Prepared | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 8 | 2012-02-07 >>>> 02:02:03 >>>> > | Prepared | 0 | 5670 | >>>> > | XWiki.XWikiTranslator | 8 | 9 | 2012-02-08 >>>> 02:01:20 >>>> > | Prepared | 1 | 115 | >>>> > | XWiki.XWikiTranslator | 8 | 10 | 2012-02-14 >>>> 02:01:07 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 11 | 2012-02-28 >>>> 02:01:32 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 12 | 2012-03-03 >>>> 02:01:20 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 13 | 2012-03-14 >>>> 02:01:50 >>>> > | Prepared | 0 | 5671 | >>>> > | XWiki.XWikiTranslator | 8 | 14 | 2012-03-14 >>>> 18:23:06 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 15 | 2012-03-14 >>>> 18:32:38 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 16 | 2012-03-14 >>>> 18:37:17 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 17 | 2012-03-24 >>>> 02:01:34 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 18 | 2012-03-26 >>>> 16:40:35 >>>> > | Prepared | 0 | 5671 | >>>> > | XWiki.XWikiTranslator | 8 | 19 | 2012-03-28 >>>> 02:01:23 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 20 | 2012-03-29 >>>> 02:01:45 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 21 | 2012-04-11 >>>> 02:01:36 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 22 | 2012-04-19 >>>> 15:20:10 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 24 | 2012-04-29 >>>> 02:01:45 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 25 | 2012-05-04 >>>> 18:55:30 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 26 | 2012-05-04 >>>> 19:02:09 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 27 | 2012-05-10 >>>> 02:04:36 >>>> > | Prepared | 1 | 116 | >>>> > | XWiki.XWikiTranslator | 8 | 28 | 2012-05-30 >>>> 02:02:08 >>>> > | Prepared | 0 | 5671 | >>>> > | XWiki.XWikiTranslator | 8 | 29 | 2012-06-07 >>>> 10:44:28 >>>> > | Prepared | 0 | 5782 | >>>> > >>>> +-----------------------+--------------+--------------+---------------------+-------------+------------------------------+-------------------+ >>>> > >>>> > >>>> http://l10n.xwiki.org/xwiki/bin/view/XE/XEXWikiCoreResources_2107067965_core-menu-watchlist-add-page_nl?viewer=history&showminor=true >>>> > >>>> > Most entries are "no changes" and have been cause by the L10NUpdater >>>> which >>>> > wrongefully saved the document with no changes. I believe this must >>>> have >>>> > been fixed (by Thomas M.?) mid 2012. >>>> > >>>> > Now the 2M lines impact performance significantly and loads the DB for >>>> > nothing (and in the activity stream as well). >>>> > >>>> > I suggest we clean up the history and activity stream. We have 2 >>>> > possibilities: >>>> > >>>> > For xwikircs: >>>> > >>>> > 1/ Clean up only the bad data from XWikiTranslator when there are no >>>> > changes: >>>> > >>>> > This is complicated as you need to verify if the change is actually a >>>> > change and you cannot do that just with sql queries. It could be very >>>> long >>>> > >>>> > 2/ Clean up old data from pre-201206 from XWikiTranslator >>>> > >>>> > Simpler if it is safe to delete by date in the DB. After discussion >>>> with >>>> > sergui this is a bit complicated because you need to make sure you >>>> don't >>>> > delete the latest full version before the versions you keep. So you >>>> would >>>> > have to do it by API which will take ages. >>>> > >>>> > 3/ Clean up up old data from pre-201206 from all users >>>> > >>>> > This is simpler as you can safely delete from the database everything >>>> older >>>> > than a certain versions. Cleans-up even more but would loose >>>> contributor >>>> > statistics unless we store 2012 contributor counts in an alternate >>>> table >>>> > which would then be regularly updated >>>> >>>> +1 for 3/, without the extra stats table yet. See below. >>>> >>>> Technically, we could also use /2, since we'll still have history >>>> summary (who changed when), but some actual versions won't be >>>> retrievable. Is the actual revision important? IMHO, only if we want to >>>> investigate some foul play. >>>> >>>> > In any case we should probably create this intermediary table for >>>> > statistics as it would be much faster anyway. >>>> > >>>> > For activitystream: >>>> > >>>> > 1/ Clean up old data from XWikiTranslator 201206 or earlier >>>> > >>>> > 2/ Clean up old date from everybody 201206 or earlier >>>> >>>> +1 for 1/ >>>> >>>> > What value do we see in the l10n history and actvity stream and which >>>> > solutions do the other commiters suggest ? >>>> > >>>> > I would say it's interesting for contributor statistics (counting >>>> number of >>>> > contribution by translators) but beyond that we can delete the data. >>>> > So we would fix that by storing monthly statistics in a table and >>>> updating >>>> > the latest 2 month through a scheduler job. This means that we can >>>> also >>>> > delete history over 2 month. >>>> >>>> We can get contributor activity from the activity stream, independently >>>> from the RCS table. Since the activity stream doesn't have inter-version >>>> dependencies like the RCS does, we can freely discard irrelevant rows >>>> from it, and use the remaining valid data. The same information is >>>> present in the activity stream and in the RCS table: who did what, when, >>>> and where. >>>> >>>> I mean, this is a good fast solution for the moment. We could still make >>>> a custom statistics data structure for translation contributors, but >>>> that would take longer, and thus it delays the migration of the l10n >>>> wiki to a newer version, and delays the performance improvement that >>>> we'd gain by dropping data. >>>> >>>> -- >>>> Sergiu Dumitriu >>>> http://purl.org/net/sergiu >>>> _______________________________________________ >>>> devs mailing list >>>> [email protected] >>>> http://lists.xwiki.org/mailman/listinfo/devs >>>> >>> >>> >>> >>> -- >>> Ludovic Dubost >>> Founder and CEO >>> Blog: http://blog.ludovic.org/ >>> XWiki: http://www.xwiki.com >>> Skype: ldubost GTalk: ldubost >>> >> >> >> >> -- >> Ludovic Dubost >> Founder and CEO >> Blog: http://blog.ludovic.org/ >> XWiki: http://www.xwiki.com >> Skype: ldubost GTalk: ldubost >> > > > > -- > Ludovic Dubost > Founder and CEO > Blog: http://blog.ludovic.org/ > XWiki: http://www.xwiki.com > Skype: ldubost GTalk: ldubost > -- Ludovic Dubost Founder and CEO Blog: http://blog.ludovic.org/ XWiki: http://www.xwiki.com Skype: ldubost GTalk: ldubost _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

