Pádraig Brady wrote: > Glen Lenker wrote: >> On Wed, Feb 11, 2009 at 10:12:42PM -0800, Nima Nikzad wrote: >>> Hello coreutils community, >>> I am new to the exciting world of open source development and thought I >>> should introduce my self and get some feedback on what I am working on. My >>> name is Nima Nikzad and I, along with a small group of students, am working >>> on changes to coreutils as part of a project for Prof. Paul Eggert's >>> software engineering course at UCLA. Specifically, we are looking at how we >>> can better utilize modern multi-core processors when sorting and merging >>> files. My group is first taking a look at how to parallelize the external >>> sort phase of sort's operation. Another group of students is taking a look >>> at in-memory sort while we are working on this. >>> >>> Has anyone already tried their hand at parallelizing this code? Are there >>> any special considerations we should keep in mind while moving forward? My >>> group was looking to tackle the problem by using OpenMP as we have some >>> experience working with it in the past and like that we can do quite a bit >>> with it while (hopefully) having little impact on the structure of >>> the existing code. Does anyone have any feedback on threading technologies >>> that would be appropriate for this project or does anyone think OpenMP is a >>> poor choice? >>> >>> I look forward to hearing your suggestions! We are looking to have >>> something implemented and benchmarked soon and I will be sure to keep in >>> contact with the community. Please feel free to reach me at nnikzad at >>> ucla.edu. Thank you! >>> >>> - Nima Nikzad >> Hi everyone, >> >> My name is Glen Lenker. I am the project leader for the second group >> working to parallelize sort under Paul Eggert. >> >> As mentioned above, my group is primarily focused on parallelizing the >> "in-memory" sort, but at the moment we are still considering working to >> parallelize other aspects of the current sort as well. At the moment we >> are considering using the Gnulib thread module as our threading library. >> >> Jim: We heard from Paul Eggert today that you have looked into how sort >> would benefit from parallelization. I am particularly interested in your >> approach. If you don't mind, perhaps you could start the discussion on >> this. >> > > Well I was thinking along the lines of something more "UNIXy". > > I'm worried a threading solution would complicate things. > The trade-off is increased data copying between processes, > but that also gives you the advantage of allowing you to > run the processes on separate hosts if desired. > > Also the "UNIXy" solution should be quite easy to implement > and act as a baseline for comparison for any threading solution > you might think be better. > > Parameters to vary when testing solutions would be: > * files on SSD/ramdisk vs non cached files on harddisk > * data < ram_size > * data > ram_size > * number of CPUS > data_size/ram_size > * number of CPUS < data_size/ram_size > * expensive sorting (-g for example) vs simple (ASCII data) > > Here's is a very quick idea of splitting the data to sort > between different sort processes.... > > sort -m <(read_chunk 1 2 "$file" | sort) \ > <(read_chunk 2 2 "$file" | sort) > > A wrapper script could be created to auto determine > (or specified with parameters) the correct number > of sort processes to execute, bearing in mind that > for traditional harddisks starting more than one > per spindle will only slow things down as they > fight over the disk head. Note the wrapper script > should probably take more than 1 file as a parameter > so that pre split files (possibly on different media) > can be processed. > > The `read_chunk` process above is currently awkward and > inefficient to implement with dd and split. As a first step > I think it would be very useful to add a --number option to > `split`, like: > --number=5 #split input into 5 chunks > --number=2/5 #output chunk 2 of 5 to [to stdout] > In text mode, this should handle only splitting on a line > boundary, and adding any extra data into the last chunk.
I just noticed this today which encompasses many of the ideas I was describing above: http://github.com/erikfrey/bashreduce/tree/master cheers, Pádraig. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils