Ole,

Yes, the idea is to level the parallel loads without a single point bottleneck 
of a serial reader.  In the world of big data, you want the parallel processes 
to use their logical id to seek to the desired position in the desired starting 
file, find the first new record, and read through the record containing the 
desired end offset.  Once the file division master thread builds a listing of 
file names, sizes, and sets the chunk size, then the parallel threads/processes 
can be created and begin reading in parallel.

Reading sequentially and sending the records down pipes in an array in rotation 
is an alternative, but prone to several problems:  1) One slow pipe can block 
the reading of input.  It might be possible to skip slow pipes with some sort 
of per pipe buffering and non-blocking i/o.  I wrote a buffering pipe fitting 
that can help soften this, but that adds overhead with an extra pipe and 
process.  2) Sometimes each parallel processing is not N times slower than 
reading a file and writing a pipe.  3) The read is not subject to any 
parallelism to speed it.

 

 Reading file names and assigning them to parallel threads in size descending 
order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has 
parallel reading, but despite size leveling, often the largest files dominate 
the run time.  If there are not N files, there will not be any N way 
parallelism.

 
It might be nice to have an option to have chunk sizes increased to modulo 8192 
or the like so pages are less split, but really, if there is a delimiter, chunk 
edge pages are always split.

An option for undelimited, fixed length records could provide the record size, 
so chunks could always be in modulo-record-size bytes.

Does parallel ever worry about unicode, euc and such that might need to work in 
n-byte or variable wide characters?  I guess if you knew it was a utf-8 file, 
you could find the character boundaries, but not all systems have such nice 
middle of file sync indicators.

Best,

David


-----Original Message-----
From: Ole Tange <o...@tange.dk>
To: David <dgpick...@aol.com>
Cc: parallel <parallel@gnu.org>
Sent: Thu, Mar 27, 2014 4:51 am
Subject: Re: File divide to feed parallel


On Wed, Mar 26, 2014 at 9:32 PM, David <dgpick...@aol.com> wrote:
> ETL programs like Ab Initio know how to tell parallel processes to split up
> big files and process each part separately, even when the files are linefeed
> delimited (they all agree to search up (or down) for the dividing linefeed
> closest to N bytes down file).  Does anyone know of a utility that can split
> a file this way (without reading it sequentially)?  Is this in gnu parallel?

GNU Parallel will do that except it will read it sequentially.

> It'd be nice to be able to take a list of mixed size files and divide them
> by size into N chunks of approximately equal lines, estimated using byte
> sizes and with an algorythm for searching for the record delimiter
> (linefeed) such that no records are lost.  Sort of a mixed input leveller
> for parallel loads.  If it is part of parallel, then parallel can launch
> processing for each chunk and to combine the chunks.

That is what --pipe does (except it reads sequentially):

  cat files* | parallel --pipe --block 10m wc

/Ole

 

Reply via email to