On Thursday, January 21, 2016 at 2:53:48 PM UTC-5, Brian Adkins wrote:
> 
> Ok, so the huge place channel for each worker isn't the main issue. I changed 
> the code so the output process sends a message to the main process every N 
> messages, and the main process waits on a message from the output process 
> every N messages. This has the effect of setting a limit on the place 
> channels.
> 
> Through some trial and error, I discovered that waiting every 8,000 messages 
> provides the best performance. So, the size of the place channels for each 
> worker is < 8000 / num-workers (currently 4 workers), and the size of the 
> place channel for the output process is < 8,000 messages. 
> 
> I don't know the exact count since I don't know the ratio of consumer to 
> producer.
> 
> My input lines are all exactly 300 bytes, so the output channel has < 2.3 MB 
> and each worker has < 600 KB
> 
> Current elapsed time for sequential is 2.467s and places is 3.732 s  > 50% 
> slower.
> 
> Limiting the place channel size reduced GC time also:
> 
> cpu time: 13249 real time: 3012 gc time: 280
> 
> Ratio of CPU time to real time is 4.4 which is good, but minimizing the 
> elapsed time is the goal.
> 
> So, in summary, copy byte strings to other places is too expensive in my 
> scenario.

I did some more experimenting by adding a sleep to the processing function to 
see how expensive the processing function had to be for the sequential and 
places versions to be equal.

Adding a 2.1 microsecond sleep gives the following:

Sequential:
cpu time: 3574 real time: 3887 gc time: 77 (operating system elapsed = 4.321s)

Places:
cpu time: 13791 real time: 2958 gc time: 231 (operating system elapsed = 3.659s)

less sleep, e.g. 2.0 microseconds, puts the sequential version ahead.

I then increased the sleep as follows:

17.5 microseconds => places ~ 1/2 as long as sequential
32.0 microseconds => places ~ 1/3 as long as sequential
50.0 microseconds => places ~ 1/4 as long as sequential
100 microseconds => places ~ 1/3.8 as long as sequential (only 4 workers)

The work itself is ~ 10 microseconds, so it would appear that a (32+10)=42 
microsecond unit of work is required to get a 3x speed up on 4 cores when 
needing to copy ~ 300 bytes twice into place channels.

In hindsight, I guess that's not too surprising since the majority of work is 
copying bytes from one place to another anyway (soundex isn't that slow), so if 
the parallel version has to copy the line twice when the work is ~ copying the 
line once, it's going to be expensive getting the data into and out of places.

If the work was parsing HTML or JSON, then the places version would probably be 
worth it on a 4 core machine.

Brian

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to