Re: File divide to feed parallel

David Thu, 27 Mar 2014 08:56:20 -0700

 Ole,

Well, I suspect that again, 'dd skip=XXX file' would read sequentially to find 
record delimiters.  The speed and parallelism support comes from seeking and 
then looking for the delimiter and starting after it.  I am not sure there is 
currently a general UNIX tool that does that.  If read()/fgets()'s are 
supported under the covers by mmap() (often true, now), then just N pages are 
initially involved for N parallel reads -- no extra reading, page faults, etc.

It would be nice to have an alternative to pipe input, more powerful and 
parallel friendly than 'cat * |'.  Parallel could include and integrate such an 
input reader.  There might be pipes involved, under the covers, unless you want 
to do a syscall() replacement of read() for child programs, which does not 
sound simple, portable or reliable, to hide the initial seek, search and the 
early EOF!  Including concatenation and fixed length record support also 
facilitates the user's correct choice.  One just points parallel at a pile of 
files to be read N parallel, and parallel takes care of the concatenation and 
division for even loading.

Parallel has become a sort of shell extension, and this could add a valuable 
facet to the capabilities.  Also, parallel processing often produces multiple 
files in one stage, that need to be fed into a subsequent stage.  Having this 
ensures any variation in file sizes in stage n does not imbalance stage n+1.

To support such multiple stage processing without files, I suppose parallel 
might be made capable of reading records from N named pipes in a non-blocking 
and record integrity preserving fashion to feed next stage processing with 
minimum latency and no intermediate files, file names, file space?  It'd be 
somewhat like 'sort -m <(sort a) <(sort b) <(sort c) ...', which by virtue of 
the nature of sort has low pipe blocking overhead.  It's the opposite of round 
robin, reading pipes for full records in rotation non-blocking (perhaps using 
poll()/select() -- I have found concurrent (lwp) threads and blocking to be 
higher overhead).  The pipe names might be managed under the covers by 
parallel, like the ksh/bash <(...) (Well, hopefully, better, as bash leaves a 
trail of named pipes in /var/tmp on systems without /dev/fd/# pipes.  How do 
you know, generally, that a pipe is never going to be put into use again?)

I will look into mbuffer -- rewriting things must be rife across UNIX 
developers!  I rewrote comm at one point.  I have a rewrite of join, m1join, 
that can do many to one joins without seeks (pipe friendly) (no cartesian 
capability, which is not pipe friendly, but many problems are 1 for 1 or many 
to one not many to many).

Best,

David

-----Original Message-----
From: Ole Tange <o...@tange.dk>
To: David <dgpick...@aol.com>
Cc: parallel <parallel@gnu.org>
Sent: Thu, Mar 27, 2014 11:05 am
Subject: Re: File divide to feed parallel

On Thu, Mar 27, 2014 at 2:32 PM, David <dgpick...@aol.com> wrote:
> Ole,
>
> Yes, the idea is to level the parallel loads without a single point
> bottleneck of a serial reader.  In the world of big data, you want the
> parallel processes to use their logical id to seek to the desired position
> in the desired starting file, find the first new record, and read through
> the record containing the desired end offset.  Once the file division master
> thread builds a listing of file names, sizes, and sets the chunk size, then
> the parallel threads/processes can be created and begin reading in parallel.

I understand the idea. It does require the input to be a file, and not
a pipe, and currently GNU Parallel does not support that.

If GNU Parallel was to support it, I imagine I would have a process
that would figure out where to chop the blocks, and then pass that to
the main program which would then start a 'dd skip=XXX if=file |
your_program'

The file could possibly be given as -a argument to --pipe:

    parallel -a bigfile --pipe --block 1g --recend '\n\n' yourprogram
# Not implemented

If that was implemented, what should this do (multiple -a):

    parallel -a file1 -a file2 --pipe --block 1g --recend '\n\n'
yourprogram # Not implemented

> Reading sequentially and sending the records down pipes in an array in
> rotation is an alternative,

That is what --round-robin does now.

> but prone to several problems:  1) One slow pipe
> can block the reading of input.  It might be possible to skip slow pipes
> with some sort of per pipe buffering and non-blocking i/o.  I wrote a
> buffering pipe fitting that can help soften this, but that adds overhead
> with an extra pipe and process.

I can highly recommand mbuffer: extremely small overhead.

>  2) Sometimes each parallel processing is
> not N times slower than reading a file and writing a pipe.  3) The read is
> not subject to any parallelism to speed it.

Yep. All true.

> Reading file names and assigning them to parallel threads in size descending
> order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has
> parallel reading, but despite size leveling, often the largest files
> dominate the run time.  If there are not N files, there will not be any N
> way parallelism.

I am wondering if that really is a job for GNU Parallel? I often use
GNU Parallel for tasks, where file size does not matter at all (e.g.
rename a file).

Would it not make more sense if you sorted the input by file size?

    ls files | sort --by-size | parallel 'your program < {}'

    find . -type f | perl -e 'print map {$_,"\n"} sort { chomp($a,$b);
-s $a <=> -s $b } <>' | parallel -k ls -l

> It might be nice to have an option to have chunk sizes increased to modulo
> 8192

--block 1M = --block 1048576, so try this:

    cat bigfile | parallel --pipe --block 1M --recend '' wc

> or the like so pages are less split, but really, if there is a
> delimiter, chunk edge pages are always split.

Yep.

> An option for undelimited, fixed length records could provide the record
> size, so chunks could always be in modulo-record-size bytes.

Elaborate why '--recend "" --block 8k --pipe' does not solve that.

> Does parallel ever worry about unicode, euc and such that might need to work
> in n-byte or variable wide characters?  I guess if you knew it was a utf-8
> file, you could find the character boundaries, but not all systems have such
> nice middle of file sync indicators.

GNU Parallel passes that worry on to Perl. So nothing in GNU Parallel
specifically deals with multibyte charsets.

/Ole

Re: File divide to feed parallel

Reply via email to