Anthony wrote: > I've looked at the numbers and thought about this in detail and I don't > think so. What definitely *would* be much more user friendly is to use a > compression scheme which allows random access, so that end users don't have > to decompress everything all at once in the first place. > > The uncompressed full history English Wikipedia dump is reaching (and more > likely has already exceeded) the size which will fit on the largest consumer > hard drives. So just dealing with such a large file is a problem in > itself. And "an enormous text file" is not very useful without an index, so > you've gotta import the thing into some sort of database anyway, which, > unless you're a database guru is going to take longer than a simple > decompression. > > In the long term, and considering how long it's taking to just produce a > usable dump the long term may never come, the most user friendly dump would > already be compressed, indexed, and ready for random access, so a reuser > could just download and go (or even download on the fly as needed). It > could be done, but I make no bet on whether or not it will be done.
I did make indexed, random-access, backwards compatible, XML dumps. http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html Wouldn't be hard to plug into the dump process (just replace bzip2 on a new DumpPipeOutput) but so far nobody seemed interested on it. And there's the added benefit of the offline reader I implemented using those files. _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
