I have earlier encouraged you to post examples of use of GNU Parallel.

Today I had to split a 200 GB gz-file into smaller files. The file
contained records of 4 lines, so I had to unpack the .gz file, chop at a 4
line limit around 10 MB and gzip the chunk under a unique name:

zcat big.gz | parallel --block 10M -L4 --pipe gzip -1 '>'small.{#}.gz

The limiting factor in this was GNU Parallel which is not uncommon for
--pipe.

The spreadstdin() and write_record_to_pipe() are to blame. They can be sped
up by re-writing these functions in C/C++. But it might even be sufficient
to split up the parts into a reader process (which would read a chunk, find
the split point, and put it in a queue), a few writer processes (which
given a chunk would write it to the user program) and a manager process
(which would communicate between the reader and the writer and spawn off
new writer processes if needed), so fork does not have to be called for
every block. Any takers?


/Ole

Reply via email to