Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

Amin Tootoonchian Wed, 15 Dec 2010 13:05:46 -0800

I am talking about jumbo Ethernet frames here. By batching, I mean
batching outgoing messages together and writing to the underlying
layer which would be the TCP write buffer. The TCP buffer is not
limited to MTU or anything like that, so in most cases my code flushes
more than 64KB to the TCP write buffer. The gain is due to issuing a
single system call with a larger buffer rather than many system calls
with tiny buffers (e.g., 128 bytes you mentioned).


I do not sacrifice delay for throughput here. I keep a write buffer
and keep appending to it until the underlying socket is ready for
writes. Once it is ready for a write operation, buffered replies are
flush to the underlying layer immediately. This is quite different
than Nagle's algorithm and will not add any delays.

Amin

On Wed, Dec 15, 2010 at 3:47 PM, kk yap <yap...@stanford.edu> wrote:
> Oh.. another point, if you are batching the frames, then what about
> delay?  There seems to be a trade-off between delay and throughput,
> and we have went for the former by disabling Nagle's algorithm.
>
> Regards
> KK
>
> On 15 December 2010 12:46, kk yap <yap...@stanford.edu> wrote:
>> Hi Amin,
>>
>> Just to clarify, does your jumbo frames refer to the OpenFlow messages
>> or the frames in the datapath?   By OpenFlow messages, I am assuming
>> you use a TCP connection between NOX and the switches, and you are
>> batching the messages into jumbo frames of 9000 bytes before sending
>> them out.  By frames in the datapath, I mean jumbo Ethernet frames are
>> being sent in the datapath.  The latter does not make any sense to me,
>> because OpenFlow should send 128 bytes to the controller by default.
>>
>> Thanks.
>>
>> Regards
>> KK
>>
>> On 15 December 2010 12:36, Amin Tootoonchian <a...@cs.toronto.edu> wrote:
>>> I double checked. It does slightly improve the performance (in the
>>> order of a few thousand replies/sec). Larger MTUs decrease the CPU
>>> workload (by decreasing the number of transfers across the bus) and
>>> this means that more CPU cycles are available to the controller to
>>> process requests. However, I am not suggesting that people should use
>>> jumbo frames. Apparently running with more user-space threads does the
>>> trick here. Anyway, I should trust a profiler rather than guessing, so
>>> I will get back with a definite answer once I have done a more
>>> thorough evaluation.
>>>
>>> Cheers,
>>> Amin
>>>
>>> On Wed, Dec 15, 2010 at 2:51 PM, kk yap <yap...@stanford.edu> wrote:
>>>> Random curiosity: Why would jumbo frames increases replies per sec?
>>>>
>>>> Regards
>>>> KK
>>>>
>>>> On 15 December 2010 11:45, Amin Tootoonchian <a...@cs.toronto.edu> wrote:
>>>>> I missed that. The single core throughput is ~250k replies/sec, two
>>>>> cores ~450k replies/sec, three cores ~650k replies/sec, four cores
>>>>> ~800 replies/sec. These numbers are higher than what I reported in my
>>>>> previous post. That is most probably because, right now, I am testing
>>>>> with MTU 9000 (jumbo frames) and with more user-space threads.
>>>>>
>>>>> Cheers,
>>>>> Amin
>>>>>
>>>>> On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado <cas...@nicira.com> wrote:
>>>>>> Also, do you mind posting the single core throughput?
>>>>>>
>>>>>>> [cross-posting to nox-dev, openflow-discuss, ovs-discuss]
>>>>>>>
>>>>>>> I have prepared a patch based on NOX Zaku that improves its
>>>>>>> performance by a factor of>10. This implies that a single controller
>>>>>>> instance can run a large network with near a million flow initiations
>>>>>>> per second. I am writing to open up a discussion and get feedback from
>>>>>>> the community.
>>>>>>>
>>>>>>> Here are some preliminary results:
>>>>>>>
>>>>>>> - Benchmark configuration:
>>>>>>>   * Benchmark: Throughput test of cbench (controller benchmarker) with
>>>>>>> 64 switches. Cbench is a part of the OFlops package
>>>>>>> (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
>>>>>>> mode, cbench sends a batch of ofp_packet_in messages to the controller
>>>>>>> and counts the number of replies it gets back.
>>>>>>>   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
>>>>>>> quad-core Intel Xeon processor (X3210), and 4GB RAM
>>>>>>>   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
>>>>>>> quad-core Intel Xeon processor (E5405), and 4GB RAM
>>>>>>>   * Connectivity: 1Gbps
>>>>>>>
>>>>>>> - Benchmark results:
>>>>>>>   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
>>>>>>>   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
>>>>>>> available cores). The sustained controller->benchmarker throughput is
>>>>>>> ~400Mbps.
>>>>>>>
>>>>>>> The patch updates the asynchronous harness of NOX to a standard
>>>>>>> library (boost asynchronous I/O library) which simplifies the code
>>>>>>> base. It fixes the code in several areas, including but not limited
>>>>>>> to:
>>>>>>>
>>>>>>> - Multi-threading: The patch enables having any number of worker
>>>>>>> threads running on multiple cores.
>>>>>>>
>>>>>>> - Batching: Serving requests individually and sending replies one by
>>>>>>> one is quite inefficient. The patch tries to batch requests together
>>>>>>> were possible, as well replies (which reduces the number of system
>>>>>>> calls significantly).
>>>>>>>
>>>>>>> - Memory allocation: The standard C++ memory allocator is not robust
>>>>>>> in multi-threaded environments. Google's Thread-Caching Malloc
>>>>>>> (TCMalloc) or Hoard memory allocator perform much better for NOX.
>>>>>>>
>>>>>>> - Fully asynchronous operation: The patched version avoids wasting CPU
>>>>>>> cycles polling sockets, or event/timer dispatchers when not necessary.
>>>>>>>
>>>>>>> I would like to add that the patched version should perform much
>>>>>>> better than what I reported above (the number reported is with a run
>>>>>>> on 4 CPU cores). I guess a single NOX instance running on a machine
>>>>>>> with 8 CPU cores should handle well above 1 million flow initiation
>>>>>>> requests per second. Also having a more capable machine should help to
>>>>>>> serve more requests! The code will be made available soon and I will
>>>>>>> post updates as well.
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Amin
>>>>>>> _______________________________________________
>>>>>>> openflow-discuss mailing list
>>>>>>> openflow-disc...@lists.stanford.edu
>>>>>>> https://mailman.stanford.edu/mailman/listinfo/openflow-discuss
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nox-dev mailing list
>>>>> nox-dev@noxrepo.org
>>>>> http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
>>>>>
>>>>
>>>
>>
>

_______________________________________________
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org

Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

Reply via email to