Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward:
1. Offer an uncompressed version of the dump for download. Bandwidth is cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a service whereby the mail the uncompressed dump to you on a hard drive. You pay for the drive and a service charge. Cheers, On Wed, Dec 24, 2008 at 5:03 PM, Erik Zachte <[email protected]>wrote: > Hi Brian, Brion once explained to me that the post processing of the dump > is > the main bottleneck. > > Compressing articles with tens of thousands of revisions is a major > resource > drain. > Right now every dump is even compressed twice, into bzip2 (for wider > platform compatibility) and 7zip format (for 20 times smaller downloads). > This may no longer be needed as 7zip presumably gained better support on > major platforms over the years. > Apart from that the job could gain from parallelization and better error > recovery. > > Erik Zachte > > ________________________________________ > > I am still quite shocked at the amount of time the english wikipedia takes > to dump, especially since we seem to have close links to folks who work at > mysql. To me it seems that one of two things must be the case: > > 1. Wikipedia has outgrown mysql, in the sense that, while we can put data > in, we cannot get it all back out. > 2. Despite aggressive hardware purchases over the years, the correct > hardware has still not been purchased. > > I wonder which of these is the case. Presumably #2 ? > > Cheers, > Brian > > > > > _______________________________________________ > foundation-l mailing list > [email protected] > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > -- (Not sent from my iPhone) _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
