G'Day Fletcher,

On Thu, Jan 24, 2008 at 09:58:32AM -0800, Fletcher Cocquyt wrote:
> 
>    Fellow dtracers - I am diagnosing a busy spamassassin/mailman/sendmail
>    system - the CPU and memory usage seems normal, but iotop shows I/O
>    and network are very busy (and initiating TCP connections is quite
>    slow sometimes (even after tuning the stack))
> 
> 
>    I suspect DNS latency - how can I use Dtrace to quantify latency due
>    to DNS lookups on this server?

There isn't a DNS provider yet, or plans to write one.  We do need
something to serve this role, as DNS latencies are a common woe.  DNS
probes were added to the Mozilla DTrace provider as part of firefox:

http://blogs.sun.com/jmr/entry/dtrace_mozilla_dns_timing_mystery

Which demonstrates how useful this data can be.

You can hit up the individual processes with the pid provider, which
will be easier if requests funnel through one process, such as nscd
(Sun's standard local caching deamon), or named (BIND, which you could
setup as a forwarder).

Digging using the pid provider in this case is easier than it sounds -
the DNS lookup latency, which we are interested in, will dwarf others.
For example, here I traced the a.out segment of nscd while I performed
a host lookup:

# dtrace -Fn 'BEGIN { last = timestamp; } 
pid$target:a.out::entry,pid$target:a.out::return { trace(timestamp - last); 
last = timestamp; }' -p `pgrep nscd`
dtrace: description 'BEGIN ' matched 801 probes
CPU FUNCTION                                 
  0  -> switcher                                     6132453361
  0    -> _nscd_restart_if_cfgfile_changed                 5578
  0    <- _nscd_restart_if_cfgfile_changed                32471
  0    -> validate_pheader                                 2508
  0    <- validate_pheader                                 2151
  0    -> if_selfcred_return_per_user_door                 1777
  0    <- if_selfcred_return_per_user_door                 1988
  0    -> lookup                                           1939
  0      -> nsc_lookup                                     3474
  0        -> lookup_int                                   1505
  0          -> get_cache_ctx                              4978
  0            -> get_cache_idx                            1725
  0            <- get_cache_idx                            2608
  0          <- get_cache_ctx                              1695
  0          -> nsc_get_db                                 1646
  0          <- nsc_get_db                                 2172
  0          -> check_config                               1987
  0            -> nsw_config_in_phdr                       2304
  0            <- nsw_config_in_phdr                       1913
  0          <- check_config                               2049
  0          -> check_db_file                              1677
  0          <- check_db_file                             15843
  0          -> lookup_cache                               2036
  0            -> hash_find                                1917
  0              -> nsc_db_cis_key_gethash                 1639
  0                -> cis_gethash                          1730
  0                <- cis_gethash                          3025
  0              <- nsc_db_cis_key_gethash                 1612
  0            <- hash_find                                2064
  0            -> nsc_db_cis_key_compar                    1905
  0            <- nsc_db_cis_key_compar                    2292
  0            -> nsc_db_cis_key_compar                     831
  0            <- nsc_db_cis_key_compar                    1175
  0            -> nsc_db_cis_key_compar                     709
  0            <- nsc_db_cis_key_compar                    1065
  0            -> nsc_db_cis_key_compar                     717
  0            <- nsc_db_cis_key_compar                    1262
  0            -> nsc_db_cis_key_compar                     784
  0            <- nsc_db_cis_key_compar                    1035
  0            -> nsc_db_cis_key_compar                     724
  0            <- nsc_db_cis_key_compar                    1100
  0            -> queue_adjust                             4504
  0            <- queue_adjust                             1768
  0          <- lookup_cache                               2102
  0          -> _nscd_get_clearance                        2225
  0          <- _nscd_get_clearance                        4161
  0          -> nss_psearch                                2228
  0            -> set_initf_key                            3403
  0            <- set_initf_key                            2349
  0            -> nss_search                               1724
  0              -> getparams                              1782
  0                -> nscd_initf                           1617
  0                  -> get_initf_key                      1430
  0                  <- get_initf_key                      1616
  0                <- nscd_initf                           1553
  0              <- getparams                              2424
  0              -> try_local                              2038
  0              <- try_local                              1664
  0              -> _nscd_get_nsw_state                    2198
  0                -> _get_nsw_state_int                   1643
  0                  -> _nscd_mutex_lock                   4128
  0                    -> _nscd_get                        1747
  0                      -> _nscd_is_int_addr              2106
  0                        -> _nscd_get_db_entry              5274
  0                          -> calc_hash                  1578
  0                          <- calc_hash                  2011
  0                          -> scan_hash                  1609
  0                          <- scan_hash                  1403
  0                        <- _nscd_get_db_entry              1378
  0                      <- _nscd_is_int_addr              2266
  0                    <- _nscd_get                        1809
  0                  <- _nscd_mutex_lock                   1918
  0                  -> _nscd_mutex_unlock                 1917
  0                    -> _nscd_release                    1718
  0                    <- _nscd_release                    2237
  0                  <- _nscd_mutex_unlock                 1825
  0                <- _get_nsw_state_int                   1788
  0              <- _nscd_get_nsw_state                    1492
  0              -> _nscd_get_smf_state                    2513
  0              <- _nscd_get_smf_state                    2243
  0              -> try_local2                             1454
  0              <- try_local2                             1689
  0              -> search_dns_withttl                     2024
  0              <- search_dns_withttl                     1823
  0              -> retry_test                            69505
  0              <- retry_test                             1923
  0              -> _nscd_get_smf_state                     967
  0              <- _nscd_get_smf_state                    1116
  0              -> try_local2                              700
  0              <- try_local2                              845
  0              -> search_dns_withttl                      950
  0                -> get_dns_funcs                        1651
  0                  -> get_lib_func                       1484
  0                  <- get_lib_func                       1656
  0                <- get_dns_funcs                        1184
  0              <- search_dns_withttl                164291454
  0              -> retry_test                             6615
  0              <- retry_test                             3033
  0              -> _nscd_put_nsw_state                    2427
  0                -> _put_nsw_state_int                   1765
  0                  -> _nscd_mutex_lock                   3284
  0                    -> _nscd_get                        1895
  0                      -> _nscd_is_int_addr              2480
  0                        -> _nscd_get_db_entry              6470
  0                          -> calc_hash                  1624
  0                          <- calc_hash                  2095
  0                          -> scan_hash                  1683
  0                          <- scan_hash                  1463
  0                        <- _nscd_get_db_entry              1642
  0                      <- _nscd_is_int_addr              2734
  0                    <- _nscd_get                        2036
  0                  <- _nscd_mutex_lock                   1903
  0                  -> _nscd_mutex_unlock                 1770
  0                    -> _nscd_release                    1742
  0                    <- _nscd_release                    2148
  0                  <- _nscd_mutex_unlock                 1970
  0                <- _put_nsw_state_int                   1796
  0              <- _nscd_put_nsw_state                    1368
  0            <- nss_search                               1887
  0            -> clear_initf_key                          2140
  0            <- clear_initf_key                          2251
  0          <- nss_psearch                                2240
  0          -> _nscd_release_clearance                    2921
  0          <- _nscd_release_clearance                    3870
  0          -> nscd_signal                                1775
  0          <- nscd_signal                                1744
  0          -> dup_packed_buffer                          2289
  0          <- dup_packed_buffer                          4124
  0          -> start_threads                              6257
  0          <- start_threads                              1899
  0        <- lookup_int                                   2260
  0      <- nsc_lookup                                     1454
^C

So search_dns_withttl() looks interesting.

A bit more digging showed that libresolv's res_nquery() looked even better,
since it has the lookup string as arg1.

Tracing res_nquery() can be performed like this:

------------------------------------------------------------
#!/usr/sbin/dtrace -s

#pragma D option quiet

dtrace:::BEGIN
{
        printf("%-20s %-10s %10s %s\n", "TIME", "FUNC", "DELTA(ms)", "ARG");
}

pid$target:libresolv:res_nquery:entry
{
        self->start = timestamp;
        self->arg1 = copyinstr(arg1);

}

pid$target:libresolv:res_nquery:return
/self->start/
{
        this->delta = (timestamp - self->start) / 1000000;
        printf("%-20Y %-10s %10d %s\n", walltimestamp, probefunc, this->delta,
            self->arg1);
        self->start = 0;
        self->arg1 = 0;
}
------------------------------------------------------------

An example of this running:

   # ./nscd_dns_lookups.d -p `pgrep nscd`
   TIME                 FUNC        DELTA(ms) ARG
   2008 Jan 24 11:23:11 res_nquery         21 wikis.sun.com
   2008 Jan 24 11:23:17 res_nquery       6110 www.sun.co.uk
   2008 Jan 24 11:23:49 res_nquery          0 grouper

"grouper" was a local lookup, and returned quickly.  "www.sun.co.uk"
really did take 6.1 seconds to return.

I spent about ten minutes writing nscd_dns_lookups.d as a demo - don't
treat this as the best possible way to get this data.

We should have a provider for nscd, however this won't trace applications
that talk to DNS servers directly.  There might be a better way to do DNS
tracing once the DTrace network providers exist.

Brendan

-- 
Brendan
[CA, USA]
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org

Reply via email to