On Tue, Feb 17, 2009 at 2:24 PM, Rogan Creswick <[email protected]> wrote: > On Tue, Feb 17, 2009 at 1:15 PM, Eric Wilhelm > <[email protected]> wrote: >> `man 7z` implies that `7z x -so` gives you something resembling >> `gunzip -c` > > Yup -- I just got that to work. 7z does some "too fancy for my > tastes" pipe detection / etc.
Just following up with the solution I'm using now. I learned a few interesting things in the process, and thought I'd share. I needed to extract the top N most revised wikipedia pages -- each page is in a <page> tag, and each <page> contains a number of <revision> tags. The first attempt was to use a streaming xml parser to read in each <page> element, count the revisions, and write it to disk if it made the cut. This failed horribly. Many wikipedia pages have more than 3gb of revisions, java can't address more than 3gb on a 32bit machine, and I was (foolishly) using java's serialization api to serialize the content through a gzip stream -- that only nets you about 50% compression, where as serializing plain text through the same stream will compress things to 20-25% (or better). Since the out of memory issue took me by surprise (the entire data set is only 17gb, 7zipped, and I hadn't counted the articles yet), and it was a Friday, I decided to scan the whole archive for <page> tags, and record the byte offsets. Grep was the perfect tool (--byte-offset), and worked fast enough to complete over the weekend. Using the output of grep to calculate and sort the page entries by size was a minor adventure, but not too troublesome (there are over 11 million pages, by the way, and the largest has 52gb of revision text(!)) That obviously dictated that I come up with a fixed-memory solution, so I hacked out another java app that did not use an xml parser at all, and instead searched for <page> tags, and streamed the page content through a (100mb buffered) gzip stream to disk, using a uuid as the file name, and counting the <revision> tags as it went. The UUIDs and revision counts are tracked, and once the threshold of articles is past, that is used to delete the smaller files on disk. Unfortunately, even gzipped, the articles I'm extracting will still need over 110gigs of space -- and 188 hours to extract. (thanks to 'pv', the pipe viewer, for providing a progress bar & time estimates!) I have been able to set up a cromfs that's 7zip-compressed, however, so that's my next attempt. (Writing to the cromfs instead of through a gzipped stream.) Thanks for all the suggestions! --Rogan _______________________________________________ PLUG mailing list [email protected] http://lists.pdxlinux.org/mailman/listinfo/plug
