Comments below .... On Fri, May 28, 2010 at 8:36 AM, Bruno Randolf <b...@einfach.org> wrote: > On Friday 28 May 2010 10:58:10 Robert Brown wrote: >> It looks like the code I >> added has made the corruption problem much worse. > > welcome to the wonderful world of ath5k "heisenbugs"! :) > > actually we can be happy that you have something repeatable here. i'm still > trying to reproduce the same problem. can you give us more information how you > transfer the data? how do you detect the duplicates?
I'm glad I have something reproducible too. I'm just FTPing a 1 Gb file containing random data to my Acer. Here's my duplicate packet detection code. Note that I have not yet found a case where a packet printed out by this code turned up duplicated in the destination file. So, as far as I know, the duplication is happening via some other means than the hardware sending the same packet twice. The patch does vastly increase the rate of errors, however, maybe just due to CPU usage. I've running the ath5k code compiled with debugging, so that may be a factor too. For my test run I turned on the "dumprx" and "reset" debugging options. The first is needed to see the output the patch generates. *** drivers/net/wireless/ath/ath5k/base.h.~1~ Tue May 18 15:06:03 2010 --- drivers/net/wireless/ath/ath5k/base.h Thu May 27 13:00:47 2010 *************** *** 166,171 **** --- 166,175 ---- #ifdef CONFIG_ATH5K_DEBUG struct ath5k_dbg_info debug; /* debug info */ + + #define HIST_SIZE 100 + unsigned char history[HIST_SIZE]; + #endif /* CONFIG_ATH5K_DEBUG */ struct ath5k_buf *bufptr; /* allocated buffer ptr */ *** drivers/net/wireless/ath/ath5k/base.c.~1~ Tue May 18 15:06:03 2010 --- drivers/net/wireless/ath/ath5k/base.c Thu May 27 13:00:21 2010 *************** *** 2081,2087 **** sc->curband->bitrates[rxs->rate_idx].hw_value_short) rxs->flag |= RX_FLAG_SHORTPRE; ! ath5k_debug_dump_skb(sc, skb, "RX ", 0); ath5k_update_beacon_rssi(sc, skb, rs.rs_rssi); --- 2081,2101 ---- sc->curband->bitrates[rxs->rate_idx].hw_value_short) rxs->flag |= RX_FLAG_SHORTPRE; ! ! ! #define OFFSET 1000 ! if (skb->len > OFFSET + HIST_SIZE) { ! unsigned char *data = skb->data + OFFSET; ! if (memcmp(sc->history, data, HIST_SIZE) == 0) { ! ath5k_debug_dump_skb(sc, skb, "ZZ ", 0); ! } ! memcpy(sc->history, data, HIST_SIZE); ! } ! ! ! ! ! // ath5k_debug_dump_skb(sc, skb, "RX ", 0); ath5k_update_beacon_rssi(sc, skb, rs.rs_rssi); > also i'm wondering: how can it happen that the repeated packet passes the > upper layers unnoticed? My repeated packet detection hack is not perfect and given how often it fires it's probably showing TCP retransmissions or something similar that's totally normal.. The packets displayed appear in the destination file but I've yet to see them appear repeated in the destination file. >> For now, however, here are some of the debugging messages generated by the >> resets I saw while transferring data. Perhaps the presence of repeated hex >> values in the "rx queue" messages or the info in the "R" receive buffer >> dumps contains a clue about what's going on. > > i think, the repeated" rx queue hex" prints should be o.k. since this is just > the RXDP (current rx descriptor for the hardware to use), and it does not have > to change if no packet is received. more comments below: > >> (ath5k_reset:2955): resetting >> May 27 19:58:59 ugg kernel: [68460.373320] ath5k phy0: >> (ath5k_txq_cleanup:1666): beacon queue 0 >> May 27 19:58:59 ugg kernel: [68460.373462] ath5k phy0: >> (ath5k_txq_cleanup:1675): txq [0] 312206e0, link (null) >> May 27 19:58:59 ugg kernel: [68460.373615] rx queue 31220200, link f12201c0 >> May 27 19:58:59 ugg kernel: [68460.373631] R (f12201e0 312201e0) >> 31220200 32ac7020 00000000 00000a00 109d8065 109d8065 * >> May 27 19:58:59 ugg kernel: [68460.428534] rx queue 31220220, link f1220200 >> May 27 19:58:59 ugg kernel: [68460.484518] rx queue 31220260, link f1220240 >> May 27 19:58:59 ugg kernel: [68460.540524] rx queue 31220260, link f1220240 >> May 27 19:58:59 ugg kernel: [68460.597526] rx queue 31220280, link f1220260 >> May 27 19:58:59 ugg kernel: [68460.653502] rx queue 312202c0, link f12202a0 >> May 27 19:58:59 ugg kernel: [68460.709503] rx queue 31220320, link f1220300 >> >> This is interesting. Repeated packet in the middle of the reset. > > very interesting! - there should be no packet reception during a reset!!! I don't understand the reset code or the strategy it's implementing. Naively, I would expect the reset code to take a lock and set a global "we are resetting" boolean, a flag that packet reception and transmission would check, but I don't think that's what's going on. > also i am wondering why you have so many resets? (a review of the reset code > is the next thing we should do... it gives us trouble in other places too...) I have no clue about the frequency of resets. I'll have to track down the reasons the chip can be reset and make the code print out why each reset happened. > i looked at the debug output more. it prints descriptors which are still in > the RX queue just at the beginning of the reset function. > > [68301.828987] rx queue 31220120, link f12200e0 > [68301.829002] R (f1220100 31220100) 31220120 2a343020 00000000 00000a00 > 12bd8608 12bd8608 * > > [68301.888512] rx queue 31220180, link f1220140 > [68301.888540] R (f1220160 31220160) 31220180 370f5020 00000000 00000a00 > 21bd8608 21bd8608 ! > > [68307.773523] rx queue 312202a0, link f1220260 > [68307.773531] R (f1220280 31220280) 312202a0 370f5020 00000000 00000a00 > 11cd8123 11cd8123 * > > [68389.913677] rx queue 312201c0, link f1220180 > [68389.913688] R (f12201a0 312201a0) 312201c0 32ac4020 00000000 00000a00 > 224d8608 224d8608 * > > [68389.972482] rx queue 312201e0, link f12201a0 > [68389.972493] R (f12201c0 312201c0) 312201e0 32ac7020 00000000 00000a00 > 22ad8075 22ad8075 * > > [68390.308460] rx queue 312202c0, link f1220280 > [68390.308476] R (f12202a0 312202a0) 312202c0 32a8f020 00000000 00000a00 > 10ad805c 10ad805c * > > [68460.373615] rx queue 31220200, link f12201c0 > [68460.373631] R (f12201e0 312201e0) 31220200 32ac7020 00000000 00000a00 > 109d8065 109d8065 * > > the fields in the R... line are in this order: > > ds, > (unsigned long long)bf->daddr, > ds->ds_link, > ds->ds_data, > rd->rx_ctl.rx_control_0, > rd->rx_ctl.rx_control_1, > rd->u.rx_stat.rx_status_0, > rd->u.rx_stat.rx_status_0, > !done ? ' ' : (rs->rs_status == 0) ? '*' : '!'); > > i'm not sure if it's relevant or not, but if we look at the adresses in the > last 5 lines, they are all somehow related... > > none of these descriptors was a JUMBO b.t.w. > > bruno _______________________________________________ ath5k-devel mailing list ath5k-devel@lists.ath5k.org https://lists.ath5k.org/mailman/listinfo/ath5k-devel