My general question is "what meaning do I give to lossy traceroutes, even when pings show no problem."

Can I expect that backbone routers should never give me timeouts on a traceroute through them, so, lots of asterisks from these systems indicate a packet loss problem that needs to be fixed?

Or, are these traceroute asterisks essentially meaningless, and should be expected on any busy link?

More specifically, is anyone else getting lots of *s for NYC1.gblx.net for traceroutes through them? If I do three traceroutes through there, at least one will show losses at or beyond the NYC1 hops (and, the *s beyond NYC1 might be getting lost in NYC1, rather than indicating a different error). But, Global Crossing's on-line tools don't show any loss.

I am at simons-rock.edu, in Western Mass, and we connect via Boston. A few days ago, our users of a database that's hosted at our parent campus, bard.edu, started complaining of many frequent (but intermittent) delays. Bard is in the Hudson Valley, and connects via Poughkeepsie. Both of our local providers connect to Global Crossing. Once before, we saw similar database symptoms, and that time, Bard had a problem dropping packets at their gateway. So I think these symptoms mean packet loss is happening somewhere. However, this time, pings from Simon's Rock to Bard, and vice-versa, show essentially no errors, typically 1000 pings will get through 100%.

Still, despite the good pings, traceroutes from either end show lots of asterisks at or after Global Crossing's NYC1.gblx.net links. I have opened a ticket with our provider, who has opened one with Global Crossing; and Bard has done the same with their end, but no significant response so far. (Bard's Graduate campus, located in New York City, is having similar poor database performance, so I'm pretty sure it is not just my end. Staff at the main Bard campus have no troubles, so it seems a network problem, not a server problem.)

As I understand it, an asterisk in traceroute means that the sending machine did not get any reply to a given packet. Since the traceroute packets have small TTL values, it expects to get a reply when the TTL is decremented to zero. But, I don't know if big routers are just lazy about sending such responses, or if these asterisks really indicate packets getting lost. (As far as I remember in the past, when things work well, I never see *s at the central links, but, I have not really done any baseline testing of the link from here to Bard when the database was working.)

So, another question is why pings work so well when traceroutes work so poorly. (By experiment, I believe our database application performs more like traceroute than like ping.) Is it packet size? Different handling for different sorts of traffic? Magic?

Here are some sample traceroutes each way:
Simon's Rock to Bard:

2h189:bin skbohrer$ traceroute -q5 -S bip.bard.edu
traceroute to bip.bard.edu (192.246.228.16), 64 hops max, 40 byte packets 1 10.30.2.1 (10.30.2.1) 1.514 ms 1.791 ms 0.684 ms 0.761 ms 0.712 ms (0% loss) 2 michael.simons-rock.edu (208.81.88.1) 2.509 ms 1.882 ms 0.899 ms 1.345 ms 2.057 ms (0% loss) 3 64.213.79.249 (64.213.79.249) 104.294 ms 10.605 ms 17.106 ms 18.987 ms 38.740 ms (0% loss) 4 pos2-0-155M.cr2.BOS1.gblx.net (67.17.70.166) 21.962 ms 20.411 ms 8.394 ms 23.308 ms 10.192 ms (0% loss) 5 so1-2-0-2488M.scr2.NYC1.gblx.net (67.17.94.158) 15.738 ms 14.582 ms 17.306 ms 24.444 ms 15.466 ms (0% loss) 6 ae3-30g.scr3.NYC1.gblx.net (67.17.104.189) 15.586 ms 13.358 ms ae0-30G.scr4.NYC1.gblx.net (67.16.139.2) 13.875 ms 13.495 ms 12.780 ms (0% loss) 7 e5-1-30G.ar9.NYC1.gblx.net (67.16.142.54) 75.184 ms lag1.ar9.NYC1.gblx.net (67.16.142.50) 15.766 ms 11.947 ms * e5-1-30G.ar9.NYC1.gblx.net (67.16.142.54) 25.916 ms (20% loss) 8 * * wbs-connect.gigabitethernet1-0-2.asr1.jfk1.gblx.net (64.211.195.6) 55.909 ms 73.803 ms * (60% loss) 9 * pghknyshj42-xe-0-3-0.lightower.net (72.22.160.150) 16.521 ms 21.817 ms 23.715 ms 17.236 ms (20% loss) 10 pghknyshj91-ae0-66.lightower.net (72.22.160.165) 76.257 ms 27.712 ms 20.372 ms 18.923 ms 55.355 ms (0% loss) 11 kgtnnykgj91-ae3.66.lightower.net (72.22.160.107) 18.088 ms 51.631 ms 19.052 ms 20.876 ms 22.942 ms (0% loss) 12 BardCollege-cust.customer.hvdata.net (64.72.66.234) 51.243 ms 47.800 ms 32.835 ms 19.040 ms 55.661 ms (0% loss)
13  *^C


Bard to SR (their version of traceroute doen't have the handy -S option):

SRDB/users/usrsr/finrep: traceroute mail.simons-rock.edu
trying to get source for mail.simons-rock.edu
source should be 10.20.11.23
traceroute to hedwig.simons-rock.edu (208.81.88.14) from 10.20.11.23 (10.20.11.23), 30 hops max
outgoing MTU = 1500
 1  hcrcgw (10.20.11.1)  1 ms  0 ms  0 ms
 2  hyphen (192.246.235.1)  1 ms  1 ms  1 ms
3 BardCollege-hvdn.customer.hvdata.net (64.72.66.233) 1 ms 1 ms 1 ms 4 pghknyshj91-xe-5-2-0.lightower.net (72.22.160.106) 2 ms 2 ms 2 ms
 5  pghknyshj42-ae0-66.lightower.net (72.22.160.159)  27 ms  2 ms  2 ms
6 nycmnyzrj42-xe-0-3-0.lightower.net (72.22.160.151) 4 ms 4 ms 4 ms
 7  ve463.ar9.NYC1.gblx.net (64.211.195.5)  4 ms  4 ms  4 ms
 8  * ae0-40G.scr1.NYC1.gblx.net (67.16.138.253)  4 ms  4 ms
9 pos5-0-2488M.cr1.BOS1.gblx.net (67.17.94.57) 9 ms pos9-0-2488M.cr2.BOS1.gblx.net (67.17.94.157) 9 ms 11 ms
10  pos1-0-0-155M.ar1.BOS1.gblx.net (67.17.70.165)  14 ms  10 ms  9 ms
11  64.213.79.250 (64.213.79.250)  15 ms  15 ms  18 ms
^C


For more automated testing, I used -m10 to set the max hops so that the traces stop within the backbone network, as this avoids any issue of the boxes at the ends not really responding to traceroutes. That way, I could assume any * was a real time out. I also used -q4 for 4 queries to each host. With a few hundred traceroutes each direction, more than 75% from SR to Bard, and more than 94% from Bard to SR, showed an asterisk at or past the NYC1 hops. There were zero asterisks on the links before NYC1 from either side.

Thanks for any insights.

Steve Bohrer
Network Administrator
ITS, Bard College at Simon's Rock
413-528-7645




Reply via email to