On Thursday, January 21, 2016 at 7:48:44 AM UTC-5, Robby Findler wrote: > It may be the overhead of communicating the data is dominating the > time spent working. > > Would it work to the main place open the file, count the number of > lines, and then just tell the worker places which chunks of the file > are theirs? Or maybe just do the counting at the byte level and then > have some way to clean up the partial lines that each place would get? > > Robby
The sequential version takes 10 microseconds per line (including input, processing, output), so the processing part is clearly less than that. Maybe that amount of work is too fine grained? Can anyone with a lot of places experience comment on the suitability of a 10 microsecond unit of work for places? The overhead should be limited to copying a byte string to the worker place, and then copying a different byte string to the output place. The CPU time and GC time for the places version is much higher than I expected. I think my approach is pretty natural - hand off input records to workers for processing. I think your suggestion might complicate the program quite a bit. Although in this *particular* case, the input file is fixed length, I'm not sure I'd feel comfortable depending on all 45M records being exactly the same size. As an experiment, I may verify that they are, and then compute a file offset for each worker. *But*, that's not really what I had in mind for language supported parallelism - that's not very different than me just firing up N Racket processes via the operating system and using command-line cat to combine their individual output files. The good news is that in preparation for creating the places version I switched from: (fprintf e-out "~a\t~a\t..." field1 field2 ...) to (write-bytes (bytes-append field1 #"\t" field 2 #"\t" ...) e-out) so that the parse-case function could return (bytes-append ...) instead of doing the actual output. This shaved off about 20% from the runtime! So the sequential version is now only 4.23x the C program where Ruby is 15.7x the C program. The sequential version would be fine in production, but after many years of seeing only one of my cores pegged at 100% by Ruby, I was really looking forward to seeing all four cores pegged :) Actually, the parallel version *does* in fact peg all the cores, but that's not exciting when it actually takes longer in total! I was naively hoping for about a 3x speedup for the 4 core places version. I still wonder if causing the worker place channels to get huge is a problem - that might explain the very high GC time. If anyone knows how to use buffered asynchronous channels with places, or simply how to block when attempting to put more than N messages in a place channel, I *think* that would help, but maybe not enough. Brian -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.