Add user guide on debug and troubleshoot for common issues and bottleneck found in sample application model.
Signed-off-by: Vipin Varghese <vipin.vargh...@intel.com> Acked-by: Marko Kovacevic <marko.kovace...@intel.com> --- doc/guides/howto/debug_troubleshoot_guide.rst | 375 ++++++++++++++++++ doc/guides/howto/index.rst | 1 + 2 files changed, 376 insertions(+) create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst new file mode 100644 index 000000000..f2e337bb1 --- /dev/null +++ b/doc/guides/howto/debug_troubleshoot_guide.rst @@ -0,0 +1,375 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(c) 2018 Intel Corporation. + +.. _debug_troubleshoot_via_pmd: + +Debug & Troubleshoot guide via PMD +================================== + +DPDK applications can be designed to run as single thread simple stage to +multiple threads with complex pipeline stages. These application can use poll +mode devices which helps in offloading CPU cycles. A few models are + + * single primary + * multiple primary + * single primary single secondary + * single primary multiple secondary + +In all the above cases, it is a tedious task to isolate, debug and understand +odd behaviour which occurs randomly or periodically. The goal of guide is to +share and explore a few commonly seen patterns and behaviour. Then, isolate +and identify the root cause via step by step debug at various processing +stages. + +Application Overview +-------------------- + +Let us take up an example application as reference for explaining issues and +patterns commonly seen. The sample application in discussion makes use of +single primary model with various pipeline stages. The application uses PMD +and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS +and eth. + +The overview of an application modeled using PMD is shown in +:numref:`dtg_sample_app_model`. + +.. _dtg_sample_app_model: + +.. figure:: img/dtg_sample_app_model.* + + Overview of pipeline stage of an application + +Bottleneck Analysis +------------------- + +To debug the bottleneck and performance issues the desired application +is made to run in an environment matching as below + +#. Linux 64-bit|32-bit +#. DPDK PMD and libraries are used +#. Libraries and PMD are either static or shared. But not both +#. Machine flag optimizations of gcc or compiler are made constant + +Is there mismatch in packet rate (received < send)? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RX Port and associated core :numref:`dtg_rx_rate`. + +.. _dtg_rx_rate: + +.. figure:: img/dtg_rx_rate.* + + RX send rate compared against Received rate + +#. Are generic configuration correct? + - What is port Speed, Duplex? rte_eth_link_get() + - Are packets of higher sizes are dropped? rte_eth_get_mtu() + - Are only specific MAC received? rte_eth_promiscuous_get() + +#. Are there NIC specific drops? + - Check rte_eth_rx_queue_info_get() for nb_desc and scattered_rx + - Is RSS enabled? rte_eth_dev_rss_hash_conf_get() + - Are packets spread on all queues? rte_eth_dev_stats() + - If stats for RX and drops updated on same queue? check receieve thread + - If packet does not reach PMD? check if offload for port and queue + matches to traffic pattern send. + +#. If problem still persists, this might be at RX lcore thread + - Check if RX thread, distributor or event rx adapter? these may be + processing less than required + - Is the application is build using processing pipeline with RX stage? If + there are multiple port-pair tied to a single RX core, try to debug by + using rte_prefetch_non_temporal(). This will intimate the mbuf in cache + is temporary. + +Are there packet drops (receive|transmit)? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RX-TX Port and associated cores :numref:`dtg_rx_tx_drop`. + +.. _dtg_rx_tx_drop: + +.. figure:: img/dtg_rx_tx_drop.* + + RX-TX drops + +#. At RX + - Get RX queue count? nb_rx_queues using rte_eth_dev_info_get() + - Are there miss, errors, qerros? rte_eth_dev_stats() for imissed, + ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count + +#. At TX + - Are you doing in bulk TX? check application for TX descriptor overhead. + - Are there TX errors? rte_eth_dev_stats() for oerrors and qerros + - Is specific scenarios not releasing mbuf? check rte_mbuf_ref_count of + those packets. + - Is the packet multi segmented? Check if port and queue offlaod is set. + +Are there object drops in producer point for ring? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Producer point for ring :numref:`dtg_producer_ring`. + +.. _dtg_producer_ring: + +.. figure:: img/dtg_producer_ring.* + + Producer point for Rings + +#. Performance for Producer + - Fetch the type of RING 'rte_ring_dump()' for flags (RING_F_SP_ENQ) + - If '(burst enqueue - actual enqueue) > 0' check rte_ring_count() or + rte_ring_free_count() + - If 'burst or single enqueue returning always 0'? is rte_ring_full() + true then next stage is not pulling the content at desired rate. + +Are there object drops in consumer point for ring? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Consumer point for ring :numref:`dtg_consumer_ring`. + +.. _dtg_consumer_ring: + +.. figure:: img/dtg_consumer_ring.* + + Consumer point for Rings + +#. Performance for Consumer + - Fetch the type of RING ??? rte_ring_dump() for flags (RING_F_SC_DEQ) + - If '(burst dequeue - actual dequeue) > 0' for rte_ring_free_count() + - If 'burst or single enqueue' always results 0 check the ring is empty + via rte_ring_empty() + +Are packets or objects are not processed at desired rate? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Memory objects close to NUMA :numref:`dtg_mempool`. + +.. _dtg_mempool: + +.. figure:: img/dtg_mempool.* + + Memory objects has to be close to device per NUMA + +#. Is the performance low? + - Are packets received from multiple NIC? rte_eth_dev_count_all() + - Are NIC interfaces on different socket? use rte_eth_dev_socket_id() + - Is mempool created with right socket? rte_mempool_create() or + rte_pktmbuf_pool_create() + - Are drops on specific socket? If yes check if there are sufficent + objects by rte_mempool_get_count() or rte_mempool_avail_count() + - Is 'rte_mempool_get_count() or rte_mempool_avail_count()' zero? + application requires more objects hence reconfigure number of + elements in rte_mempool_create(). + - Is there single RX thread for multiple NIC? try having multiple + lcore to read from fixed interface or we might be hitting cache + limit, so increase cache_size for pool_create(). + +#. Is performance low for some sceanrios? + - Check if sufficient objects in mempool by rte_mempool_avail_count() + - Is failure seen in some packets? we might be getting packets with + 'size > mbuf data size'. + - Is NIC offload or application handling multi segment mbuf? check the + special packets are continuous with rte_pktmbuf_is_contiguous(). + - If there separate user threads used to access mempool objects, use + rte_mempool_cache_create() for non DPDK threads. + - Is the error reproducible with 1GB hugepage? If no, then try debuging + the issue with lookup table or objects with rte_mem_lock_page(). + +.. note:: + Stall in release of MBUF can be because + + * Processing pipeline is too heavy + * Number of stages are too many + * TX is not transferred at desired rate + * Multi segment is not offloaded at TX device. + * Application misuse scenarios can be + - not freeing packets + - invalid rte_pktmbuf_refcnt_set + - invalid rte_pktmbuf_prefree_seg + +Is there difference in performance for crypto? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Crypto device and PMD :numref:`dtg_crypto`. + +.. _dtg_crypto: + +.. figure:: img/dtg_crypto.* + + CRYPTO and interaction with PMD device + +#. Are generic configuration correct? + - Get total crypto devices ??? rte_cryptodev_count() + - Cross check software or hardware flags are configured properly + rte_cryptodev_info_get() for feature_flags + +#. If enqueue request > actual enqueue (drops)? + - Is the queue pair setup for right NUMA? check for socket_id using + rte_cryptodev_queue_pair_setup(). + - Is the session_pool created from same socket_id as queue pair? If no, + then create on same NUMA. + - Is enqueue thread on same socket_id as object? If no, then try + to put on same NUMA. + - Are there errors and drops? check err_count using rte_cryptodev_stats() + - Do multiple threads enqueue or dequeue from same queue pair? Try + debugging with separate threads. + +#. If enqueue rate > dequeue rate? + - Is dequeue lcore thread is same socket_id? + - If softwre crypto is in use, check if the CRYPTO Library is build with + right (SIMD) flags or check if the queue pair using CPU ISA for + feature_flags AVX|SSE|NEON using rte_cryptodev_info_get() + - If its hardware assited crypto showing performance variance? Check if + hardware is on same NUMA socket as queue pair and session pool. + +Worker functions not giving performance? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Custom worker function :numref:`dtg_distributor_worker`. + +.. _dtg_distributor_worker: + +.. figure:: img/dtg_distributor_worker.* + + Custom worker function performance drops + +#. Performance + - Threads context switches are more frequent? Identify lcore with + rte_lcore() and lcore index mapping with rte_lcore_index(). Best + performance when mapping of thread and core is 1:1. + - What are lcore role (type or state)? fetch the roles like RTE, OFF and + SERVICE using rte_eal_lcore_role(). + - Check if application has multiple functions running on same service + core? registered functions may be exceeeding the desired time slots + while running on same service core. + - Is function is running on RTE core? check if there are conflicting + functions running on same CPU core by rte_thread_get_affinity(). + +#. Debug + - Check what is mode of operation? master core, lcore, service core, + and numa count can be fetched with rte_eal_get_configuration(). + - Is it occurring on special scenario? Analyze run logic with + rte_dump_stack(), rte_dump_registers() and rte_memdump() for more + insights. + - Is 'perf' showing data process or memory stalls in functions? check + instruction being generated for functions using objdump. + +Service functions are not frequent enough? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +service functions on service cores :numref:`dtg_service`. + +.. _dtg_service: + +.. figure:: img/dtg_service.* + + functions running on service cores + +#. Performance + - Get service core count using rte_service_lcore_count() and compare with + result of rte_eal_get_configuration() + - Check registered service is available using rte_service_get_by_name(), + rte_service_get_count() and rte_service_get_name() + - Is given service running parallel on multiple lcores? + rte_service_probe_capability() and rte_service_map_lcore_get() + - Is service running? rte_service_runstate_get() + +#. Debug + - Find how many services are running on specific service lcore by + rte_service_lcore_count_services() + - Generic debug via rte_service_dump() + +Is there bottleneck in eventdev? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Are generic configuration correct? + - Get event_dev devices? rte_event_dev_count() + - Are they created on correct socket_id? - rte_event_dev_socket_id() + - Check if HW or SW capabilities? - rte_event_dev_info_get() for + event_qos, queue_all_types, burst_mode, multiple_queue_port, + max_event_queue|dequeue_depth + - Is packet stuck in queue? check for stages (event qeueue) where + packets are looped back to same or previous stages. + +#. Performance drops in enqueue (event count > actual enqueue)? + - Dump the event_dev information? rte_event_dev_dump() + - Check stats for queue and port for eventdev + - Check the inflight, current queue element for enqueue|deqeue + +How to debug QoS via TM? +~~~~~~~~~~~~~~~~~~~~~~~~ + +TM on TX interface :numref:`dtg_qos_tx`. + +.. _dtg_qos_tx: + +.. figure:: img/dtg_qos_tx.* + + Traffic Manager just before TX + +#. Is configuration right? + - Get current capabilities for DPDK port for max nodes, level, shaper + private, shaper shared, sched_n_children and stats_mask using + rte_tm_capabilities_get() + - Check if current leaf are configured identically by fetching + leaf_nodes_identical using rte_tm_capabilities_get() + - Get leaf nodes for a dpdk port - rte_tn_get_number_of_leaf_node() + - Check level capabilities by rte_tm_level_capabilities_get for n_nodes + - Max, nonleaf_max, leaf_max + - identical, non_identical + - Shaper_private_supported + - Stats_mask + - Cman wred packet|byte supported + - Cman head drop supported + - Check node capabilities by rte_tm_node_capabilities_get for n_nodes + - Shaper_private_supported + - Stats_mask + - Cman wred packet|byte supported + - Cman head drop supported + - Debug via stats - rte_tm_stats_update() and rte_tm_node_stats_read() + +Packet is not of right format? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Packet capture before and after processing :numref:`dtg_pdump`. + +.. _dtg_pdump: + +.. figure:: img/dtg_pdump.* + + Capture points of Traffic at RX-TX + +#. Where to capture packets? + - Enable pdump in primary to allow secondary to access queue-pair for + ports. Thus packets are copied over in RX|TX callback by secondary + process using ring buffers. + - To capture packet in middle of pipeline stage, user specific hooks + or callback are to be used to copy the packets. These packets can + be shared to secodnary process via user defined custom rings. + +Issue still persists? +~~~~~~~~~~~~~~~~~~~~~ + +#. Are there custom or vendor specific offload meta data? + - From PMD, then check for META data error and drops. + - From application, then check for META data error and drops. +#. Is multiprocess is used configuration and data processing? + - Check enabling or disabling features from secondary is supported or not? +#. Is there drops for certain scenario for packets or obejcts? + - Check user private data in objects by dumping the details for debug. + +How to develop custom code to debug? +------------------------------------ + +- For single process - the debug functionality is to be added in same + process +- For multiple process - the debug functionality can be added to + secondary multi process + +.. note:: + + Primary's Debug functions invoked via + #. Timer call-back + #. Service function under service core + #. USR1 or USR2 signal handler diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst index a642a2be1..9527fa84d 100644 --- a/doc/guides/howto/index.rst +++ b/doc/guides/howto/index.rst @@ -18,3 +18,4 @@ HowTo Guides virtio_user_as_exceptional_path packet_capture_framework telemetry + debug_troubleshoot_guide -- 2.17.1