Re: File divide to feed parallel

Ole Tange Thu, 27 Mar 2014 08:06:14 -0700

On Thu, Mar 27, 2014 at 2:32 PM, David <dgpick...@aol.com> wrote:
> Ole,
>
> Yes, the idea is to level the parallel loads without a single point
> bottleneck of a serial reader.  In the world of big data, you want the
> parallel processes to use their logical id to seek to the desired position
> in the desired starting file, find the first new record, and read through
> the record containing the desired end offset.  Once the file division master
> thread builds a listing of file names, sizes, and sets the chunk size, then
> the parallel threads/processes can be created and begin reading in parallel.


I understand the idea. It does require the input to be a file, and not
a pipe, and currently GNU Parallel does not support that.

If GNU Parallel was to support it, I imagine I would have a process
that would figure out where to chop the blocks, and then pass that to
the main program which would then start a 'dd skip=XXX if=file |
your_program'

The file could possibly be given as -a argument to --pipe:

    parallel -a bigfile --pipe --block 1g --recend '\n\n' yourprogram
# Not implemented

If that was implemented, what should this do (multiple -a):

    parallel -a file1 -a file2 --pipe --block 1g --recend '\n\n'
yourprogram # Not implemented

> Reading sequentially and sending the records down pipes in an array in
> rotation is an alternative,

That is what --round-robin does now.

> but prone to several problems:  1) One slow pipe
> can block the reading of input.  It might be possible to skip slow pipes
> with some sort of per pipe buffering and non-blocking i/o.  I wrote a
> buffering pipe fitting that can help soften this, but that adds overhead
> with an extra pipe and process.

I can highly recommand mbuffer: extremely small overhead.

>  2) Sometimes each parallel processing is
> not N times slower than reading a file and writing a pipe.  3) The read is
> not subject to any parallelism to speed it.

Yep. All true.

> Reading file names and assigning them to parallel threads in size descending
> order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has
> parallel reading, but despite size leveling, often the largest files
> dominate the run time.  If there are not N files, there will not be any N
> way parallelism.

I am wondering if that really is a job for GNU Parallel? I often use
GNU Parallel for tasks, where file size does not matter at all (e.g.
rename a file).

Would it not make more sense if you sorted the input by file size?

    ls files | sort --by-size | parallel 'your program < {}'

    find . -type f | perl -e 'print map {$_,"\n"} sort { chomp($a,$b);
-s $a <=> -s $b } <>' | parallel -k ls -l

> It might be nice to have an option to have chunk sizes increased to modulo
> 8192

--block 1M = --block 1048576, so try this:

    cat bigfile | parallel --pipe --block 1M --recend '' wc

> or the like so pages are less split, but really, if there is a
> delimiter, chunk edge pages are always split.

Yep.

> An option for undelimited, fixed length records could provide the record
> size, so chunks could always be in modulo-record-size bytes.

Elaborate why '--recend "" --block 8k --pipe' does not solve that.

> Does parallel ever worry about unicode, euc and such that might need to work
> in n-byte or variable wide characters?  I guess if you knew it was a utf-8
> file, you could find the character boundaries, but not all systems have such
> nice middle of file sync indicators.

GNU Parallel passes that worry on to Perl. So nothing in GNU Parallel
specifically deals with multibyte charsets.


/Ole

Re: File divide to feed parallel

Reply via email to