>                                                        As for the
 > ReceiveData() function, that line can be directly replaced with your
 recvfrom() call (I just was too lazy too look up recvfrom() when I
 wrote that example).
Well, if you care to write the full code you will find that
this statement is not quite what it looks like.

It's not much different. This is what a modified version of your program looks like:

typedef struct {
  short header_len;  // This field is always present, and always first
  short data_len;    // This field is always present, and always second
  [ header contents, including header version, type, etc. goes here ]
  char data[NET_MULTICAST_PAYLOAD];
} NET_RX_STRUCT;

NET_RX_STRUCT msg;
rxin_char=(void*)(&timf1_char[timf1p_pa]);
timf1p_pa=(timf1p_pa+ad_read_bytes)&timf1_bytemask;
for(j=0; j<ad_read_bytes; j+=NET_MULTICAST_PAYLOAD)
  {
  recvfrom(netfd.rec_rx,&msg,sizeof(NET_RX_STRUCT),0,
                                 (struct sockaddr *) &rx_addr,&addrlen);
memcpy(&rxin_char[j], ((void *)&msg) + msg.header_len, NET_MULTICAST_PAYLOAD);
  }


For the time being I kept the data size constant, to not mix the issues (variable data size adds 5-6 lines). And yes, there is a memcpy. But read below...

By the way, there appears to be an inconsistency in your program. Here:

  timf1p_pa=(timf1p_pa+ad_read_bytes)&timf1_bytemask;

you make sure that the address pointer for the circular buffer wraps around, but I see no such protection in the for() loop. Or am I missing something ?

 > even if it
 did, it matters very little on modern CPUs (packets this size will
 remain entirely within the CPU cache).
Linrad is intended to run on elderly computers and it is also
intended to run at much higher bandwidths on modern ones.
You suggest that the data is put into a buffer to which a
pointer is returned by ReceiveData. The next step would be to
store the payload into a circular buffer. This will cause the data
to become written into memory twice.

An extra memcpy makes little difference in this loop. Note that the recvfrom() needs to do the equivalent of a memcpy() anyway. I wrote a little test program (see the bottom of this mail) to test the speed difference between 1 and 2 copy instructions if the destination buffer is larger than the cache.

On a Pentium MMX 166MHz, a Thinkpad laptop with X running, I get:

Single copy: 10000000 loops in 129.44 seconds, or 79.11 MiBps.
Double copy: 10000000 loops in 147.21 seconds, or 69.56 MiBps.

The first copy takes about two cycles per byte, adding a second copy adds less than 0.3 cycles per byte.

On a Pentium II 350MHz (rescued from the garbage a month ago):

Single copy: 10000000 loops in 55.76 seconds, or 183.64 MiBps.
Double copy: 10000000 loops in 66.36 seconds, or 154.30 MiBps.

The ratio is similar: first copy just under 2 cycles/byte, second copy adds 0.36 cycles/byte.

Your scenario will likely be even closer, since the kernel will need to read the UDP datagrams from main memory, too.

                                         Processing of the data is
in another thread that require hundreds of packages in the circular
buffer. It will fetch its input from memory because other
threads have been using the cash in the meantime.

Is there any way at all that you can avoid that, and process the data as it comes in ? My first big multi-threaded program was a real-time streaming video encoder for a quad Pentium Pro machine, and switching processing from a frame at a time to a macroblock (16x16pixels) at a time sped the encoder up tremendously, even though the required number of operations almost doubled.

 > Zero-copy architectures make
 sense for hi-speed packet switching on slow computers; as soon as you
 add any processing on the data, that single extra copy gets lost in
 the noise. Cache line/block alignment is much more important for
 performance.
Actually this is not in agreement with my observations. It does depend
on how efficcien "processing" is done.

In some cases, yes. I've re-written a fixed-point FFT for ARM so that reading the first word would trigger the loading of a full cache line, so that the FFT would never have to wait for its data. But even that would get lost in the noise once you actually started processing the data.

The most demanding task is the full bandwidth, full dynamic
range FFT. It would be identical in all computers and it does
not make any sense to do it in more than one computer.

Why ? Because this one computer would be much faster than the others ?

 > Do you want to have an exact, synchronized display on
 multiple machines ?
This would be the case also if raw data were used.

[snip]

The "innocent" slave does not have to know that a data stream is
"cooked". It can be processed as if it were raw data, but a clever
slave can make use of complex information that it might want to
as for. If you want to compute the noise floor power density
you want to know what percentage of samples that were blanked
out because of noise pulses for example. Normally one would not care
at all.

So would it be correct to say that:

(a) if all computers were equally fast, there is little advantage in cooked mode over raw mode (other than energy conservation), and

(b) cooked mode allows slow slaves that would normally not be able to keep up with the FFTs to still display the data.

 > It's what everybody else does. I know that that's not much of an
 argument ('50 million Elvis fans can't be wrong'), but in 15 years of
 working on network protocols, this is pretty much the only way that
 I've seen working reliably for successful sampled AV or radio
 projects (I've seen a similar system used on an antenna array for
 MIMO trials). Conversely, I have never ever seen a combined
 raw/cooked protocol that worked, or better: that remained working. Or
 it evolved into something like WAV: a historical accident that
 everyone loves to hate.
OK. Some day you will see an exception. Linrad needs a protocol
for cooked data and raw data is just an add-on which I will do
in the same protocol.

I sincerely wish you good luck on that, and I'm interested to see how it evolves.

By the way, when reading with recvfrom I can ask for a very large
record. Then the function will return the record size and I can
put the header as the last byte equally well as the first byte.
On the other hand, If I read just two bytes to get the record size,
I can not read again to get the rest of the same packet.

If you use MSG_PEEK you can. Normally you get around that by defining a maximum size for the data field. In this example it would likely be best to have header + data fit within one Ethernet frame, so depending on the header size the maximum data size would be between 1300 and 1400 bytes.

JDB.

PS: As a last nag, there is a reason these things are called *headers* ;-)

--- begin of memcopyspeed.c
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

#define BUFSIZE 1024
#define RINGBUFSIZE (BUFSIZE * 1024) /* 1MB, larger than cache */
#define NUM_LOOPS 10000000

int main(int argc, char ** argv) {

  int i, j, delta1, delta2, num_loops = NUM_LOOPS;
  struct timeval start, now;
  char rec_data[BUFSIZE], copy_data[BUFSIZE], ringbuf[RINGBUFSIZE];

  if(argc == 2)
    num_loops = atol(argv[1]);

  /* Initialize rec_data to keep the compiler from complaining */
  for(i = 0; i < BUFSIZE; i++)
        rec_data[i] = 0;

  j = 0;
  gettimeofday(&start, NULL);

  for(i = 0; i < num_loops; i++) {
    memcpy(&ringbuf[j], rec_data, BUFSIZE);
    j += BUFSIZE;
    if(j >= RINGBUFSIZE)
      j = 0;
  }

  gettimeofday(&now, NULL);
  delta1 = now.tv_usec - start.tv_usec;
  delta1 += 1000000 * (now.tv_sec - start.tv_sec);

  fprintf(stderr, "Single copy: %d loops in %.2f seconds, "
                  "or %.2f MiBps.\n",
                  num_loops, delta1 / 1.0e6,
                  ((double)num_loops) * BUFSIZE / delta1);

  j = 0;
  gettimeofday(&start, NULL);

  for(i = 0; i < num_loops; i++) {
    memcpy(copy_data, rec_data, BUFSIZE);
    memcpy(&ringbuf[j], copy_data, BUFSIZE);
    j += BUFSIZE;
    if(j >= RINGBUFSIZE)
      j = 0;
  }

  gettimeofday(&now, NULL);
  delta2 = now.tv_usec - start.tv_usec;
  delta2 += 1000000 * (now.tv_sec - start.tv_sec);

  fprintf(stderr, "Double copy: %d loops in %.2f seconds, "
                  "or %.2f MiBps.\n",
                  num_loops, delta2 / 1.0e6,
                  ((double)num_loops) * BUFSIZE / delta2);

  return 0;

}
--- end of memcopyspeed.c
--
In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.
                   -- RFC 1925, "Fundamental Truths of Networking"

#############################################################
This message is sent to you because you are subscribed to
 the mailing list <linrad@antennspecialisten.se>.
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>

Reply via email to