On Thu, Mar 27, 2014 at 2:32 PM, David <dgpick...@aol.com> wrote: > Ole, > > Yes, the idea is to level the parallel loads without a single point > bottleneck of a serial reader. In the world of big data, you want the > parallel processes to use their logical id to seek to the desired position > in the desired starting file, find the first new record, and read through > the record containing the desired end offset. Once the file division master > thread builds a listing of file names, sizes, and sets the chunk size, then > the parallel threads/processes can be created and begin reading in parallel.
I understand the idea. It does require the input to be a file, and not a pipe, and currently GNU Parallel does not support that. If GNU Parallel was to support it, I imagine I would have a process that would figure out where to chop the blocks, and then pass that to the main program which would then start a 'dd skip=XXX if=file | your_program' The file could possibly be given as -a argument to --pipe: parallel -a bigfile --pipe --block 1g --recend '\n\n' yourprogram # Not implemented If that was implemented, what should this do (multiple -a): parallel -a file1 -a file2 --pipe --block 1g --recend '\n\n' yourprogram # Not implemented > Reading sequentially and sending the records down pipes in an array in > rotation is an alternative, That is what --round-robin does now. > but prone to several problems: 1) One slow pipe > can block the reading of input. It might be possible to skip slow pipes > with some sort of per pipe buffering and non-blocking i/o. I wrote a > buffering pipe fitting that can help soften this, but that adds overhead > with an extra pipe and process. I can highly recommand mbuffer: extremely small overhead. > 2) Sometimes each parallel processing is > not N times slower than reading a file and writing a pipe. 3) The read is > not subject to any parallelism to speed it. Yep. All true. > Reading file names and assigning them to parallel threads in size descending > order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has > parallel reading, but despite size leveling, often the largest files > dominate the run time. If there are not N files, there will not be any N > way parallelism. I am wondering if that really is a job for GNU Parallel? I often use GNU Parallel for tasks, where file size does not matter at all (e.g. rename a file). Would it not make more sense if you sorted the input by file size? ls files | sort --by-size | parallel 'your program < {}' find . -type f | perl -e 'print map {$_,"\n"} sort { chomp($a,$b); -s $a <=> -s $b } <>' | parallel -k ls -l > It might be nice to have an option to have chunk sizes increased to modulo > 8192 --block 1M = --block 1048576, so try this: cat bigfile | parallel --pipe --block 1M --recend '' wc > or the like so pages are less split, but really, if there is a > delimiter, chunk edge pages are always split. Yep. > An option for undelimited, fixed length records could provide the record > size, so chunks could always be in modulo-record-size bytes. Elaborate why '--recend "" --block 8k --pipe' does not solve that. > Does parallel ever worry about unicode, euc and such that might need to work > in n-byte or variable wide characters? I guess if you knew it was a utf-8 > file, you could find the character boundaries, but not all systems have such > nice middle of file sync indicators. GNU Parallel passes that worry on to Perl. So nothing in GNU Parallel specifically deals with multibyte charsets. /Ole