ArielGlenn has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/325945 )
Change subject: cleanup of README for general configuration and sample config file ...................................................................... cleanup of README for general configuration and sample config file * whitespace cleanup * get rid of unused config options halt, forcenormal, perdumpindex in docs and/or sample config * add docs for stubs options, per-wiki config * get rid of dead dblists in sample config, add tabledocs option * add other standard options to sample config to fill it out some Bug: T152679 Change-Id: I6a80ecdee474449d200979fca2ffe0839f45f0a4 --- M xmldumps-backup/doc/README.config M xmldumps-backup/samples/wikidump.conf.sample 2 files changed, 97 insertions(+), 63 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/operations/dumps refs/changes/45/325945/1 diff --git a/xmldumps-backup/doc/README.config b/xmldumps-backup/doc/README.config index 2e17831..d4d76f4 100644 --- a/xmldumps-backup/doc/README.config +++ b/xmldumps-backup/doc/README.config @@ -24,14 +24,14 @@ ===Structure of a configuration file -Each section of the configuration file starts with a name in brackets, with +Each section of the configuration file starts with a name in brackets, with no leading spaces. For example: [wiki] This would introduce the options related to the wikis that are processed. -The following sections are recognized and must be present, even if no +The following sections are recognized and must be present, even if no configuration options are provided for the section: wiki, output, reporting, database, tools, cleanup, chunks @@ -43,51 +43,44 @@ The wiki section accepts the following configuration options: -dblist -- File with list of all databases for which dumps will be generated - Default value: none -skipdblist -- ... except for the ones in this file. (This is a bit odd; - why not just list the ones you want and be done with it? - Because the WMF list is generated automatically and used - for other things, so it is not feasible to remove dbs - from it by hand and still keep it in sync as new projects - are created.) - Default value: none -privatelist -- File with list of databases which should have dumps produced - that are put in the "private" dirctory. At WMF this means - wikis that are not publically readable by the world. - Default value: none -flaggedrevslist -- File with list of databases which have flagged revisions - enabled. (Really, we should be able to determine this - another way instead of keeping a separate list, right?) -wikidatalist -- File with list of databases which act as a wikibase - repo. For Wikimedia projects this currently consists - of the project 'wikidata'. -globalusagelist -- File with list of databases which act as a media - repo with the GlobalUsage extension. For Wikimedia projects - this currently consists of the project 'commons'. -biglist -- File with list of large wikis for which no history dumps are - generated because they are too huge. (This must be an old - deprecated option; these days we do not care how big they - are, we dump them anyways.) - Default value: none -dir -- Full path to the root directory of the MediaWiki installation for which - dumps are produced. This assumes one installation for - multiple wikis, nd therefore one LocalSettings.php or +dblist -- File with list of all databases for which dumps will be generated + Default value: none +skipdblist -- ... except for the ones in this file. (This is a bit odd; + why not just list the ones you want and be done with it? + Because the WMF list is generated automatically and used + for other things, so it is not feasible to remove dbs + from it by hand and still keep it in sync as new projects + are created.) + Default value: none +privatelist -- File with list of databases which should have dumps produced + that are put in the "private" dirctory. At WMF this means + wikis that are not publically readable by the world. + Default value: none +flowlist -- File with list of databases which have the Flow extension + enabled on them; these will have Flow page content dumped. + Default value: none +dir -- Full path to the root directory of the MediaWiki installation + for which dumps are produced. This assumes one installation + for multiple wikis, nd therefore one LocalSettings.php or equivalent that covers all the projects. At WMF this is done - by having the files InitialiseSetttings.php and + by having the files InitialiseSetttings.php and CommonSettings.php which have various if stanzas depending on what it enabled on specific projects. - Default value: none -halt -- what does this do? - Default value: 0 + Per-wiki configuration of this option can be done in separate + sections, as described later. + Default value: none +tablejobs -- Full path to the yaml file describing the tables to be dumped + via mysql for each wiki. It is fine to add tables here that + do not exist on all wikis; table existence will be checked + before a dump is attempted. Of those options, the following are required: ... === Output section -public -- full path to directory under which all dumps will be created, - in subdirectories named for the name of the database +public -- full path to directory under which all dumps will be created, + in subdirectories named for the name of the database (wikiproject) being dumped, in subdirectories by date Default value: /dumps/public private -- full path to directory under which all dumps of private wikis @@ -98,22 +91,19 @@ temp -- full path to directory under which temporary files will be created; this should not be the same as the public or private directory. Default value: /dumps/temp -index -- name of the top-level index file for all projects that is +index -- name of the top-level index file for all projects that is automatically created by the monitoring process Default value: index.html webroot -- url to root of the web directory which serves the public files (this is simply the web url that gets people to the content in the "public" directory defined earlier) Default value: http://localhost/dumps -templatedir -- directory in which various template files such as those for mail or - error reports, rss feed updates or the per-project-and-date html files +templatedir -- directory in which various template files such as those for mail or + error reports, rss feed updates or the per-project-and-date html files are found Default value: home -perdumpindex -- name of the index file created for a dump for a given project - on a given date - Default value: index.html -The above options do not have to be specified in the config file, +The above options do not have to be specified in the config file, since default values are provided. === Reporting section @@ -133,7 +123,7 @@ any more Default value: 3600 -The above options do not have to be specified in the config file, +The above options do not have to be specified in the config file, since default values are provided. === Database section @@ -146,7 +136,7 @@ config value has. Default value: 16M -The above options do not have to be specified in the config file, +The above options do not have to be specified in the config file, since default values are provided. === Tools section @@ -173,11 +163,11 @@ Default value:/bin/grep checkforbz2footer -- Location of the checkforbz2footer binary This is part of the mwbzutils package. - Default value: /usr/local/bin/checkforbz2footer + Default value: /usr/local/bin/checkforbz2footer recompressxml -- Location of the recompressxml binary Default value: /usr/local/bin/recompressxml -The above options do not have to be specified in the config file, +The above options do not have to be specified in the config file, since default values are provided. === Cleanup section @@ -185,34 +175,34 @@ removing the oldest one each time a new one is created Default value: 3 -The above option does not have to be specified in the config file, +The above option does not have to be specified in the config file, since a default is provided. === Chunks section -chunksEnabled -- buggy. set to any value to enable. Why? Because +chunksEnabled -- buggy. set to any value to enable. Why? Because any string value counts as "true", even the value... "False" :-D Default value: False pagesPerChunkHistory Set to a comma separated ist of starting page ID nums - in order to generate a set of stub files each one + in order to generate a set of stub files each one starting from the next pageID. Example: pagesPerChunkHistory=5000,5000,100000,100000 This would generate four chunks, containing: - 1 to 5000, 5001 through 10000, 10001 through 110000, + 1 to 5000, 5001 through 10000, 10001 through 110000, 110001 through end Alternatively you can provide one number in which case the job will be split into chunks each containing that number of pages. Example: pagesPerChunkHistory=50000 This will generate a number of chunks with pages from - 1 through 50000, 50001 through 100000, 100001 through + 1 through 50000, 50001 through 100000, 100001 through 150000, and so on. Default value: False revsPerChunkHistory -- currently disabled, do not use! Default value: False -pagesPerChunkAbstract -- as pagesPerChunkHistory but for the abstract +pagesPerChunkAbstract -- as pagesPerChunkHistory but for the abstract generation phase Default value: False checkpointTime -- save checkpoints of files containing revision text @@ -223,12 +213,28 @@ written, and opening a new file for the next portion of the XML output. This can be useful if you want to produce a large number of smaller files as input - to XML-crunching scripts, or if you are dumping - a very large wiki which has a tendency to fail + to XML-crunching scripts, or if you are dumping + a very large wiki which has a tendency to fail somewhere in the middle (*cough*en wikipedia*cough*). Default value: 0 (no checkpoints produced) -The above options do not have to be specified in the config file, +The above options do not have to be specified in the config file, +since default values are provided. + +=== Stubs section (i.e.: [stubs]) +orderrevs -- set to 1 if it is desired that the dump is ordered + by revision id within each page + Default: 0 (false) +minpages -- stubs (revision metadata) are retrieved in smallish + (hopefully) resultsets such that the retrieval query + for any set is not too slow; specify minimum number + of pages for which to retrieve revisions + Default: 1 +maxrevs -- maximum number of revisions to retrieve at one time, + subject to the minpages setting + Default: 50000 + +The above options do not have to be specified in the config file, since default values are provided. === Other formats section (i.e.: [otherformats]) @@ -236,5 +242,26 @@ compression of pages-articles. Default value: 0 (no multistream files produced) -The above options do not have to be specified in the config file, +The above options do not have to be specified in the config file, since default values are provided. + +=== Per-wiki configuration +The following settings may be overriden for specific wikis by specifying +their name (the name of the db in the database) as a section header, +e.g. [elwiktionary]: + +dir +user +password +max_allowed_packet +orderrevs +minpages +maxrevs +multistream +chunksEnabled +jobsperbatch +pagesPerChunkHistory +pagesPerChunkAbstract +chunksForAbstract +checkpointTime +recombineHistory diff --git a/xmldumps-backup/samples/wikidump.conf.sample b/xmldumps-backup/samples/wikidump.conf.sample index f0c1911..6394952 100644 --- a/xmldumps-backup/samples/wikidump.conf.sample +++ b/xmldumps-backup/samples/wikidump.conf.sample @@ -4,17 +4,17 @@ dblist=/home/ariel/src/mediawiki/testing/backup/all.dblist skipdblist=/home/ariel/src/mediawiki/testing/backup/skip.dblist privatelist=/home/ariel/src/mediawiki/testing/backup/private.dblist -flaggedrevslist=/home/ariel/src/mediawiki/testing/backup/flagged.dblist -wikidatalist=/home/ariel/src/mediawiki/testing/backup/wikidata.dblist -biglist=/home/ariel/src/mediawiki/testing/backup/big.dblist +flowlist=/home/ariel/src/mediawiki/testing/backup/flow.dblist dir=/home/ariel/src/mediawiki/1.16wmf4/phase3 -forcenormal=0 +tablejobs=/home/ariel/srv/mediawiki/testing/backup/tablejobs.yaml [output] public=/home/ariel/src/mediawiki/testing/dumps/public private=/home/ariel/src/mediawiki/testing/dumps/private +temp=/home/ariel/src/mediawiki/testing/dumps/temp index=backup-index.html webroot=http://localhost/mydumps +templatedir=/home/ariel/src/mediawiki/testing/dumps/templs [reporting] staleage=3600 @@ -26,6 +26,7 @@ [database] user=root password="" +max_allowed_packet=32M [tools] php=/usr/bin/php @@ -44,3 +45,9 @@ chunksEnabled=1 pagesPerChunkHistory=10000,50000,50000,50000,50000 pagesPerChunkAbstract=100000,100000 + +[otherformats] +multistream=1 + +[elwikt] +dir=/var/www/html/elwikt -- To view, visit https://gerrit.wikimedia.org/r/325945 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I6a80ecdee474449d200979fca2ffe0839f45f0a4 Gerrit-PatchSet: 1 Gerrit-Project: operations/dumps Gerrit-Branch: master Gerrit-Owner: ArielGlenn <ar...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits