On Thursday, January 21, 2016 at 7:48:44 AM UTC-5, Robby Findler wrote:
> It may be the overhead of communicating the data is dominating the
> time spent working.
> 
> Would it work to the main place open the file, count the number of
> lines, and then just tell the worker places which chunks of the file
> are theirs? Or maybe just do the counting at the byte level and then
> have some way to clean up the partial lines that each place would get?
> 
> Robby

The sequential version takes 10 microseconds per line (including input, 
processing, output), so the processing part is clearly less than that. Maybe 
that amount of work is too fine grained? Can anyone with a lot of places 
experience comment on the suitability of a 10 microsecond unit of work for 
places?

The overhead should be limited to copying a byte string to the worker place, 
and then copying a different byte string to the output place. The CPU time and 
GC time for the places version is much higher than I expected.

I think my approach is pretty natural - hand off input records to workers for 
processing. I think your suggestion might complicate the program quite a bit.

Although in this *particular* case, the input file is fixed length, I'm not 
sure I'd feel comfortable depending on all 45M records being exactly the same 
size. As an experiment, I may verify that they are, and then compute a file 
offset for each worker.

*But*, that's not really what I had in mind for language supported parallelism 
- that's not very different than me just firing up N Racket processes via the 
operating system and using command-line cat to combine their individual output 
files.

The good news is that in preparation for creating the places version I switched 
from:

(fprintf e-out "~a\t~a\t..." field1 field2 ...)

to

(write-bytes (bytes-append field1 #"\t" field 2 #"\t" ...) e-out)

so that the parse-case function could return (bytes-append ...) instead of 
doing the actual output. This shaved off about 20% from the runtime! So the 
sequential version is now only 4.23x the C program where Ruby is 15.7x the C 
program. 

The sequential version would be fine in production, but after many years of 
seeing only one of my cores pegged at 100% by Ruby, I was really looking 
forward to seeing all four cores pegged :) Actually, the parallel version 
*does* in fact peg all the cores, but that's not exciting when it actually 
takes longer in total!

I was naively hoping for about a 3x speedup for the 4 core places version. I 
still wonder if causing the worker place channels to get huge is a problem - 
that might explain the very high GC time.

If anyone knows how to use buffered asynchronous channels with places, or 
simply how to block when attempting to put more than N messages in a place 
channel, I *think* that would help, but maybe not enough.

Brian

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to