Re: Parallelizing sort

Pádraig Brady Mon, 06 Apr 2009 14:33:46 -0700

Pádraig Brady wrote:
> Glen Lenker wrote:
>> On Wed, Feb 11, 2009 at 10:12:42PM -0800, Nima Nikzad wrote:
>>> Hello coreutils community,
>>> I am new to the exciting world of open source development and thought I
>>> should introduce my self and get some feedback on what I am working on.  My
>>> name is Nima Nikzad and I, along with a small group of students, am working
>>> on changes to coreutils as part of a project for Prof. Paul Eggert's
>>> software engineering course at UCLA.  Specifically, we are looking at how we
>>> can better utilize modern multi-core processors when sorting and merging
>>> files.  My group is first taking a look at how to parallelize the external
>>> sort phase of sort's operation.  Another group of students is taking a look
>>> at in-memory sort while we are working on this.
>>>
>>> Has anyone already tried their hand at parallelizing this code?  Are there
>>> any special considerations we should keep in mind while moving forward?  My
>>> group was looking to tackle the problem by using OpenMP as we have some
>>> experience working with it in the past and like that we can do quite a bit
>>> with it while (hopefully) having little impact on the structure of
>>> the existing code.  Does anyone have any feedback on threading technologies
>>> that would be appropriate for this project or does anyone think OpenMP is a
>>> poor choice?
>>>
>>> I look forward to hearing your suggestions!  We are looking to have
>>> something implemented and benchmarked soon and I will be sure to keep in
>>> contact with the community.  Please feel free to reach me at nnikzad at
>>> ucla.edu.  Thank you!
>>>
>>> - Nima Nikzad
>> Hi everyone,
>>
>> My name is Glen Lenker. I am the project leader for the second group
>> working to parallelize sort under Paul Eggert.
>>
>> As mentioned above, my group is primarily focused on parallelizing the
>> "in-memory" sort, but at the moment we are still considering working to
>> parallelize other aspects of the current sort as well. At the moment we
>> are considering using the Gnulib thread module as our threading library.
>>
>> Jim: We heard from Paul Eggert today that you have looked into how sort
>> would benefit from parallelization. I am particularly interested in your
>> approach. If you don't mind, perhaps you could start the discussion on
>> this.
>>
> 
> Well I was thinking along the lines of something more "UNIXy".
> 
> I'm worried a threading solution would complicate things.
> The trade-off is increased data copying between processes,
> but that also gives you the advantage of allowing you to
> run the processes on separate hosts if desired.
> 
> Also the "UNIXy" solution should be quite easy to implement
> and act as a baseline for comparison for any threading solution
> you might think be better.
> 
> Parameters to vary when testing solutions would be:
>   * files on SSD/ramdisk vs non cached files on harddisk
>   * data < ram_size
>   * data > ram_size
>   * number of CPUS > data_size/ram_size
>   * number of CPUS < data_size/ram_size
>   * expensive sorting (-g for example) vs simple (ASCII data)
> 
> Here's is a very quick idea of splitting the data to sort
> between different sort processes....
> 
> sort -m <(read_chunk 1 2 "$file" | sort) \
>         <(read_chunk 2 2 "$file" | sort)
> 
> A wrapper script could be created to auto determine
> (or specified with parameters) the correct number
> of sort processes to execute, bearing in mind that
> for traditional harddisks starting more than one
> per spindle will only slow things down as they
> fight over the disk head. Note the wrapper script
> should probably take more than 1 file as a parameter
> so that pre split files (possibly on different media)
> can be processed.
> 
> The `read_chunk` process above is currently awkward and
> inefficient to implement with dd and split. As a first step
> I think it would be very useful to add a --number option to
> `split`, like:
>    --number=5    #split input into 5 chunks
>    --number=2/5  #output chunk 2 of 5 to [to stdout]
> In text mode, this should handle only splitting on a line
> boundary, and adding any extra data into the last chunk.


I just noticed this today which encompasses many of
the ideas I was describing above:
http://github.com/erikfrey/bashreduce/tree/master

cheers,
Pádraig.



_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: Parallelizing sort

Reply via email to