>From CNET interview to Brion
http://news.cnet.com/8301-17939_109-10103177-2.html

> The text alone is less 500 MB compressed. 

That statement struck me, as I wouldn't think that big wikis could fit
on that, much less all wikis.

So I went and spent some CPU on calculations:

I first looked at dewiki:
$ 7z e -so dewiki-20081011-pages-meta-history.xml.7z|sed -n 's/\s*<text
xml:space="preserve">\([^<]*\)\(<\/text>\)\?/\1/gp'| bzip2 -9 | wc -c
325915907 bytes = 310.8 MB

Not bad for a 5.1 GB 7z file. :)


Then I to enwiki, begining with the current versions:
$  bzcat enwiki-20081008-pages-meta-current.xml.bz2|sed -n 's/\s*<text
xml:space="preserve">\([^<]*\)\(<\/text>\)\?/\1/gp'|bzip2 -9 | wc -c
253648578

253648578 bytes = 241.898 MB

Again, a gigantic file (7.8 GB bz2) was reduced to less than 500MB.
Maybe it *can* be done after all. There're much more revisions, but
the compression ratio is greater.


So I had to go to turn to the beast, enwiki history files. As there
hasn't been any successful enwiki history dump on the last months, I
used an old dump I had, which is nearly a year old and fills 18G.

$ 7z e -so enwiki-20080103-pages-meta-history.xml.7z |sed -n 's/\s*<text
xml:space="preserve">\([^<]*\)\(<\/text>\)\?/\1/gp'|bzip2 -9 | wc -c

1092104465 bytes = 1041.5 MB = 1.01 GB


So, where did those 'less than 500MB' numbers came from? Also note that
I used bzip2 instead of gzip, so external storage will be using much
more space (plus indexes, ids...).

Nonetheless, the results are impressive on how the size of *already
compressed files* get reduced just by reducing the metadata.

As a comparison, dewiki-20081011-stub-meta-history.xml.gz containing the
remaining metadata is 1.7GB. 1.7 GB + 310.8 MB is still much less than
the 51.4 GB of dewiki-20081011-pages-meta-history.xml.bz2!


Maybe we should investigate new ways of storing the dumps compressed.
Could we achieve similar gains increasing the bzip window size to
counteract the noise of revision metadata?
Or perhaps I used a wrong regex and thus large chunks of data were not
taken into account ?


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to