Re: File I/O within kernel threads?

Gilboa Davara Sun, 31 Jul 2005 09:39:36 -0700

On Sun, 2005-07-31 at 17:35 +0300, Gilad Ben-Yossef wrote:
> > 
> > 
> > I should add the encryption optional, depending on load and source. (And
> > more important, how fanatical is the client)
> > (There's no way in hell, I'll be able to process and encrypt two OC48
> > links in real time...)
> 
> I believe you can, but this WILL require much more explanation...


I hear hardware assisted (or based) assembly and encryption coming. 
As I wish to conserve my current work, I rather (strike that: must) stay
in software-only-lala-land.

> > Interesting... that might work.
> > Let me first point out that once the cells/frames have been processed, I
> > don't care much for timing. (Which bodes well on your solution).
> > However, I'm ****very**** tight on CPU and memory bandwidth. (Even a
> > dual Opteron machine with two memory banks tends to suffocate at a
> > certain point.)
> > No matter what I do I just can't afford to add memcpy's to my system. 
> 
> Understandable attitude, but it may be wrong. Take a look at this paper 
> from last year OLS for example:
> 
> http://www.linuxsymposium.org/proceedings/reprints/Reprint-Ronciak-OLS2004.pdf
> 
> These guys from Intel thought having a zero copy receive path for 
> network packets where the card will DMA stright into the user space 
> program buffer will be a big win due to saving a memcpy.
> They implemented and tested. Results? perfomace was *worse* from losing 
> the extra copy, not better. It turned out the extra copy actually 
> pre-loaded the cache and gained more then it costs.

*Very* interesting reading. Nice catch indeed. 
Never the less I wonder if cache soiling wouldn't be a problem under
extra (or extreme) loads. 
It fairly probable that you'll the have the following scenario:
A. IRQ
B. Software IRQ.
(B1. Possible another IRQ raise here?)
C. DMA to SKB.
D. SKB to user buffers (L2 preloaded)
E. Boom. Hardware IRQ raised. (And the cycle starts over.)
F. User mode processing. (L2 contents lost)

In this case, by the time the user-space actually gets a hold of the
data, the L1/2 contents have been flushed. Don't forget that as the data
set and throughput grows the effect of the CPU cache diminishes
considerably. (Hence 9MB L3 Titanics and the 8MB L3 Xeon MPs)

More-ever people tend to over-look the fact that DMA, even on a fast
PCI-X 1.0 / 2.0 bus (PCI in this case) is *pretty* extensive latency
wise and limited, bandwidth wise; It is conceivable that the changes
they've made to driver to allow for zero-copy DMA added sufficient
latency to cancel any positive effects they might have gotten by going
zero-copy in the first place. 
It would have been nice to see the same test running on a Dual Opteron
machine (Or Xeon) with 2 x 2 port GbE NICs. A quad machine with two PCI-
X bridges is even better.
 
On the other hand, the article was written by someone, that unlike me,
knows what he's talking about. So go figure.

At least by my experience, once you have multiple GbE or ATM cards, it's
all about *pure* memory bandwidth first. (And lots and lots [and lots]
of it) and raw CPU power second.

> 
> Does this fit your scenario? I have no idea. But there is a lesson here: 
> don't assume anything. Build a quick pilot and measure. You may very 
> well find out that your bottle necks are completly different areas (for 
> example - are your network drivers interrupt driven? you might very well 
> find that your system gets into live lock on interrupts before any 
> issues stemming from memcpy of data).

Even with a good NAPI Ethernet driver (Like the Intel's e1000) IRQ time
is indeed a problem. Combined with the software IRQ part of the driver,
I lose at 30-50% CPU time before I even start doing any real "work".
I'm actually thinking about ways to disable IRQ altogether, manually
polling the devices for RX frames periodically. (I don't care much for
the extra load when the system is idle.)

> As the man said, pre optimization ios the root of all evil.

Umm... I hear that from my team leader on a daily basis. 
On the other hand, I doubt that the other extreme (Use C# now, rewrite
if we're slower then a dead snail later) is any better...
I'm still no convinced that the reduced development time of doing it in
user-space, will come even close to rewriting the all package if things
don't perform up to par. 


> As I already said, you can use Linux sendfile() to avoid the last copy 
> if you're not messing with the data after it reached the disk. Wont help 
> the decryption case, unless you also happen to have a a hardware 
> encryption engine, which is a good idea anyways.

Baaah! They'll deduct the encryption hardware price off my paycheck ;) 
To be honest, I don't worry much about encryption. I doubt that it'll be
used in any real high-bandwidth case. It will be used in cases where
security matters most and bandwidth is *very* low to begin with.

> 
> > Oh... Thanks for the help. I appreciate it.
> 
> Thanks for the interesting subject :-)

Hehe... I do my best to serve :-)


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: File I/O within kernel threads?

Reply via email to