[PATCH bpf-next] libbpf: Fix unintentional success return code in bpf_object__load
There are code paths where EINVAL is returned directly without setting errno. In that case, errno could be 0, which would mask the failure. For example, if a careless programmer set log_level to 1 out of laziness, they would have to spend a long time trying to figure out why. Fixes: 4f33ddb4e3e2 ("libbpf: Propagate EPERM to caller on program load") Signed-off-by: Alex Gartrell --- tools/lib/bpf/libbpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 2e2523d8bb6d..8f9e7d281225 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -6067,7 +6067,7 @@ load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt, free(log_buf); goto retry_load; } - ret = -errno; + ret = errno ? -errno : -LIBBPF_ERRNO__LOAD; cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg)); pr_warn("load bpf program failed: %s\n", cp); pr_perm_msg(ret); -- 2.26.0
Re: [PATCH] net: Provide lwtstate_free for non-lwt config
On Wed, Aug 19, 2015 at 4:28 PM, Eric Dumazet wrote: > We've seen similar patches earlier today on netdev ;) Ah, the mailing list equivalent of someone letting you know that you left your fly down :) Sorry for the noise! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Provide lwtstate_free for non-lwt config
On Wed, Aug 19, 2015 at 3:16 PM, Alex Gartrell wrote: > Fixes: df383e624 ("lwtunnel: fix memory leak") Ah sorry, this is PATCH net-next, obviously :/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Provide lwtstate_free for non-lwt config
Fixes: df383e624 ("lwtunnel: fix memory leak") Signed-off-by: Alex Gartrell --- include/net/lwtunnel.h | 4 1 file changed, 4 insertions(+) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index 34fd8f7..cfee539 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -93,6 +93,10 @@ int lwtunnel_input6(struct sk_buff *skb); #else +static inline void lwtstate_free(struct lwtunnel_state *lws) +{ +} + static inline struct lwtunnel_state * lwtstate_get(struct lwtunnel_state *lws) { -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panic with demuxed ipv4 multicast udp sockets on 4.0.4
Hey Gregory, On Sun, Aug 2, 2015 at 2:28 PM, Gregory Hoggarth wrote: > I will apply the new suggested patch, reverting previous patch, and test > overnight and update tomorrow. Did this solve your problem? If not, would you mind sharing a repro? -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers
mov %rsp, %r1 ; r1 = rsp add $-8, %r1; r1 = rsp - 8 store_q $123, -8(%rsp) ; *(u64*)r1 = 123 <- valid store_q $123, (%r1) ; *(u64*)r1 = 123 <- previously invalid mov $0, %r0 exit; Always need to exit And we'd get the following error: 0: (bf) r1 = r10 1: (07) r1 += -8 2: (7a) *(u64 *)(r10 -8) = 999 3: (7a) *(u64 *)(r1 +0) = 999 R1 invalid mem access 'fp' Unable to load program We already know that a register is a stack address and the appropriate offset, so we should be able to validate those references as well. Signed-off-by: Alex Gartrell --- kernel/bpf/verifier.c | 6 - samples/bpf/test_verifier.c | 59 + 2 files changed, 64 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 039d866..cd307df 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -648,6 +648,9 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, struct verifier_state *state = &env->cur_state; int size, err = 0; + if (state->regs[regno].type == PTR_TO_STACK) + off += state->regs[regno].imm; + size = bpf_size_to_bytes(bpf_size); if (size < 0) return size; @@ -667,7 +670,8 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown_value(state->regs, value_regno); - } else if (state->regs[regno].type == FRAME_PTR) { + } else if (state->regs[regno].type == FRAME_PTR || + state->regs[regno].type == PTR_TO_STACK) { if (off >= 0 || off < -MAX_BPF_STACK) { verbose("invalid stack off=%d size=%d\n", off, size); return -EACCES; diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c index 6936059..ee0f110 100644 --- a/samples/bpf/test_verifier.c +++ b/samples/bpf/test_verifier.c @@ -822,6 +822,65 @@ static struct bpf_test tests[] = { .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, + { + "PTR_TO_STACK store/load", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -10), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 2, 0xfaceb00c), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 2), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + }, + { + "PTR_TO_STACK store/load - bad alignment on off", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 2, 0xfaceb00c), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 2), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "misaligned access off -6 size 8", + }, + { + "PTR_TO_STACK store/load - bad alignment on reg", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -10), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 8, 0xfaceb00c), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "misaligned access off -2 size 8", + }, + { + "PTR_TO_STACK store/load - out of bounds low", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 8, 0xfaceb00c), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "invalid stack off=-79992 size=8", + }, + { + "PTR_TO_STACK store/load - out of bounds high", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 8, 0xfaceb00c), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8), + BPF_EXIT_INSN(), + }, + .result = REJECT, +
Re: [RFC PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers
On Tue, Jul 21, 2015 at 8:00 PM, Alexei Starovoitov wrote: > On Tue, Jul 21, 2015 at 07:00:40PM -0700, Alex Gartrell wrote: >> mov %rsp, %r1 ; r1 = rsp >> add $-8, %r1; r1 = rsp - 8 >> store_q $123, -8(%rsp) ; *(u64*)r1 = 123 <- valid >> store_q $123, (%r1) ; *(u64*)r1 = 123 <- previously invalid >> mov $0, %r0 >> exit; Always need to exit > > Is this your new eBPF assembler syntax? :) > imo gnu style looks ugly... ;) If you think this is ugly, you'll love the "instruction" I added to be compatible with the map fd -> immediate conversion hack :) > It's great to see such in-depth understanding of verifier!! > >> And we'd get the following error: >> >> 0: (bf) r1 = r10 >> 1: (07) r1 += -8 >> 2: (7a) *(u64 *)(r10 -8) = 999 >> 3: (7a) *(u64 *)(r1 +0) = 999 >> R1 invalid mem access 'fp' >> >> Unable to load program >> >> We already know that a register is a stack address and the appropriate >> offset, so we should be able to validate those references as well. > > yes, we can teach verifier to do that. > Though llvm doesn't generate such code. It's small enough change. I happened upon this as I was playing around with the bytecode in our 4.0 kernels. I believe that we can write general purpose utilities without needing to write C code for each use case that do things like filtering/counting packets or syscalls and outputting that data into maps at low cost, but I'm still just prototyping so I'm not ready to be an assertive jerk about it (yet) > real_off is missing alignment and bounds checks. > something like: > if (state->regs[regno].type == PTR_TO_STACK) > off += state->regs[regno].imm; > if (off % size != 0) > ... Yeah, I'm an idiot and assumed that a bounds check happened in the check_stack_read function. I'll find a way to do this without copy-pasta'ing but I'm going to stick to my moral high ground and not mutate a parameter (this lead to a bug in a job interview 6 years ago and I've never forgiven myself because the interviewer was an OpenBSD guy) > else if (state->regs[regno].type == FRAME_PTR || == PTR_TO_STACK) > .. as-is here ... > > would fix it. > > please add few accept and reject tests for this to test_verifier.c as well. psh, tests... I'll update this stuff and submit a patch. -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers
mov %rsp, %r1 ; r1 = rsp add $-8, %r1; r1 = rsp - 8 store_q $123, -8(%rsp) ; *(u64*)r1 = 123 <- valid store_q $123, (%r1) ; *(u64*)r1 = 123 <- previously invalid mov $0, %r0 exit; Always need to exit And we'd get the following error: 0: (bf) r1 = r10 1: (07) r1 += -8 2: (7a) *(u64 *)(r10 -8) = 999 3: (7a) *(u64 *)(r1 +0) = 999 R1 invalid mem access 'fp' Unable to load program We already know that a register is a stack address and the appropriate offset, so we should be able to validate those references as well. Signed-off-by: Alex Gartrell --- kernel/bpf/verifier.c | 9 + 1 file changed, 9 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 039d866..5dfbece 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -676,6 +676,15 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, err = check_stack_write(state, off, size, value_regno); else err = check_stack_read(state, off, size, value_regno); + } else if (state->regs[regno].type == PTR_TO_STACK) { + int real_off = state->regs[regno].imm + off; + + if (t == BPF_WRITE) + err = check_stack_write( + state, real_off, size, value_regno); + else + err = check_stack_read( + state, real_off, size, value_regno); } else { verbose("R%d invalid mem access '%s'\n", regno, reg_type_str[state->regs[regno].type]); -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why return E2BIG from bpf map update?
On Tue, Jul 21, 2015 at 2:40 AM, Daniel Borkmann wrote: > On 07/21/2015 12:24 AM, Alexei Starovoitov wrote: >> >> On 7/20/15 3:15 PM, Alex Gartrell wrote: >>> >>> The ship has probably sailed on this one, but it seems like ENOSPC >>> makes more sense than E2BIG. Any chance of changing it so that poor >>> ebpf library maintainers in the future don't have to wonder how their >>> argument list got too big? >> >> >> sorry, too late. >> It's in tests and even document in bpf manpage: >> "E2BIG - indicates that the number of elements in the map reached the >> max_entries limit specified at map creation time." >> I read E2BIG as "too big" and not as "argument list is too long" :) > > > If some libraries do an strerror(3) on errno then it certainly sounds > a bit weird, "no space left on device" perhaps also a bit misleading. > The bpf(2) manpage was actually submitted/discussed longer time ago, > but I still didn't see it in Michael's tree yet, will ping him again. I was just being whiny :) The options really are ENOMEM -- really should mean allocation failed E2BIG -- really should mean argument list too big ENOSPC -- really should mean that a physical device is full I suppose there's also EDQUOT -- Disk Quota Exceeded And if you're really stretching EXFULL - Exchange Full but I've never seen either of the last two in my life. So clearly this is all very subjective and people who complain about it on mailing lists (me) are the worst. The only complaint I'd have though is E2BIG actually does mean that your bpf_attr argument is too big as well, so that part confused me for a couple of minutes. But, the EINVAL errno has similarly been abused to death so I don't think it's that big of a deal. /bikeshedding -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why return E2BIG from bpf map update?
The ship has probably sailed on this one, but it seems like ENOSPC makes more sense than E2BIG. Any chance of changing it so that poor ebpf library maintainers in the future don't have to wonder how their argument list got too big? net-next/master:kernel/bpf/hashtab.c=static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, -- net-next/master:kernel/bpf/hashtab.c- net-next/master:kernel/bpf/hashtab.c- if (!l_old && unlikely(htab->count >= map->max_entries)) { net-next/master:kernel/bpf/hashtab.c- /* if elem with this 'key' doesn't exist and we've reached net-next/master:kernel/bpf/hashtab.c-* max_entries limit, fail insertion of new elem net-next/master:kernel/bpf/hashtab.c-*/ net-next/master:kernel/bpf/hashtab.c: ret = -E2BIG; net-next/master:kernel/bpf/hashtab.c- goto err; net-next/master:kernel/bpf/hashtab.c- } net-next/master:kernel/bpf/hashtab.c- net-next/master:kernel/bpf/hashtab.c- if (l_old && map_flags == BPF_NOEXIST) { net-next/master:kernel/bpf/hashtab.c- /* elem already exists */ -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH,v2 net] net: sched: validate that class is found in qdisc_tree_decrease_qlen
We have an application that invokes tc to delete the root every time the config changes. As a result we stress the cleanup code and were seeing the following panic: crash> bt PID: 630839 TASK: 8823c990d280 CPU: 14 COMMAND: "tc" [... snip ...] #8 [8820ceec17a0] page_fault at 8160a8c2 [exception RIP: htb_qlen_notify+24] RIP: a0841718 RSP: 8820ceec1858 RFLAGS: 00010282 RAX: RBX: RCX: 88241747b400 RDX: 88241747b408 RSI: RDI: 8811fb27d000 RBP: 8820ceec1868 R8: 88120cdeff24 R9: 88120cdeff30 R10: 0bd4 R11: a0840919 R12: a0843340 R13: R14: 0001 R15: 8808dae5c2e8 ORIG_RAX: CS: 0010 SS: 0018 #9 [...] qdisc_tree_decrease_qlen at 81565375 #10 [...] fq_codel_dequeue at a084e0a0 [sch_fq_codel] #11 [...] fq_codel_reset at a084e2f8 [sch_fq_codel] #12 [...] qdisc_destroy at 81560d2d #13 [...] htb_destroy_class at a08408f8 [sch_htb] #14 [...] htb_put at a084095c [sch_htb] #15 [...] tc_ctl_tclass at 815645a3 #16 [...] rtnetlink_rcv_msg at 81552cb0 [... snip ...] To my understanding, the following situation is taking place. tc_ctl_tclass -> htb_delete -> class is deleted from clhash -> htb_put -> qdisc_destroy -> fq_codel_reset -> fq_codel_dequeue -> qdidsc_tree_decrease_qlen -> cl = htb_get # returns NULL, removed in htb_delete -> htb_qlen_notify(sch, NULL) # BOOM This patch checks cl for 0 before invoking qlen_notify and put. The notify is not necessary in this case, as the parent has already been deactivated anyway. Signed-off-by: Alex Gartrell --- net/sched/sch_api.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f06aa01..46be5e5 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -762,8 +762,15 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n) cops = sch->ops->cl_ops; if (cops->qlen_notify) { cl = cops->get(sch, parentid); - cops->qlen_notify(sch, cl); - cops->put(sch, cl); + /* This will be 0 in the event that cl is being +* destroyed, in which case we should not attempt +* to invoke qlen_notify upon it (it must be +* cleaned up otherwise anyway. +*/ + if (cl != 0) { + cops->qlen_notify(sch, cl); + cops->put(sch, cl); + } } sch->q.qlen -= n; __qdisc_qstats_drop(sch, drops); -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] net: sched: validate that class is found in qdisc_tree_decrease_qlen
First, vehement apologies for sending an RFC patch against such an old kernel. This is (potentially) a bug fix for a crash that we've been seeing in 3.2 and 3.10, and, AFAICT, the underlying issue has not been addressed. We have an application that invokes tc to delete the root every time the config changes. As a result we stress the cleanup code and were seeing the following panic: crash> bt PID: 630839 TASK: 8823c990d280 CPU: 14 COMMAND: "tc" [... snip ...] #8 [8820ceec17a0] page_fault at 8160a8c2 [exception RIP: htb_qlen_notify+24] RIP: a0841718 RSP: 8820ceec1858 RFLAGS: 00010282 RAX: RBX: RCX: 88241747b400 RDX: 88241747b408 RSI: RDI: 8811fb27d000 RBP: 8820ceec1868 R8: 88120cdeff24 R9: 88120cdeff30 R10: 0bd4 R11: a0840919 R12: a0843340 R13: R14: 0001 R15: 8808dae5c2e8 ORIG_RAX: CS: 0010 SS: 0018 #9 [...] qdisc_tree_decrease_qlen at 81565375 #10 [...] fq_codel_dequeue at a084e0a0 [sch_fq_codel] #11 [...] fq_codel_reset at a084e2f8 [sch_fq_codel] #12 [...] qdisc_destroy at 81560d2d #13 [...] htb_destroy_class at a08408f8 [sch_htb] #14 [...] htb_put at a084095c [sch_htb] #15 [...] tc_ctl_tclass at 815645a3 #16 [...] rtnetlink_rcv_msg at 81552cb0 [... snip ...] To my understanding, the following situation is taking place. tc_ctl_tclass -> htb_delete -> class is deleted from clhash -> htb_put -> qdisc_destroy -> fq_codel_reset -> fq_codel_dequeue -> qdidsc_tree_decrease_qlen -> cl = htb_get # returns NULL, removed in htb_delete -> htb_qlen_notify(sch, NULL) # BOOM This patch checks cl for 0 before invoking qlen_notify and put. The notify is not necessary in this case, as the parent has already been deactivated anyway. Signed-off-by: Alex Gartrell --- net/sched/sch_api.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f06aa01..46be5e5 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -762,8 +762,15 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n) cops = sch->ops->cl_ops; if (cops->qlen_notify) { cl = cops->get(sch, parentid); - cops->qlen_notify(sch, cl); - cops->put(sch, cl); + /* This will be 0 in the event that cl is being +* destroyed, in which case we should not attempt +* to invoke qlen_notify upon it (it must be +* cleaned up otherwise anyway. +*/ + if (cl != 0) { + cops->qlen_notify(sch, cl); + cops->put(sch, cl); + } } sch->q.qlen -= n; __qdisc_qstats_drop(sch, drops); -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen
On Fri, Jul 17, 2015 at 5:10 PM, Cong Wang wrote: > Hmm, but in htb_delete() we do reset the leaf qdisc before removing the > class from ha > > if (!cl->level) { > qlen = cl->un.leaf.q->q.qlen; > qdisc_reset(cl->un.leaf.q); > qdisc_tree_decrease_qlen(cl->un.leaf.q, qlen); > } > > therefore, the leaf fq_codel qdisc is supposed to have 0 skb after that, > that is, the second fq_codel_reset() is supposed to return NULL immediately? > Why is qdidsc_tree_decrease_qlen() called in the second fq_codel_reset()? > > Or there is a race condition between ->delete() and ->put()? In which new > skb can be enqueued? This makes the most sense to me, but I have ~32 hours of experience with this subsystem :) Taking that for granted, it seems like it'd be appropriate to note the invariant in the code i've changed with a WARN_ON and to skip it, and then to otherwise find a way to close the hole. Do you agree? -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen
First, vehement apologies for sending an RFC patch against such an old kernel. This is (potentially) a bug fix for a crash that we've been seeing in 3.2 and 3.10, and, AFAICT, the underlying issue has not been addressed. We have an application that invokes tc to delete the root every time the config changes. As a result we stress the cleanup code and were seeing the following panic: crash> bt PID: 630839 TASK: 8823c990d280 CPU: 14 COMMAND: "tc" #0 [8820ceec1450] machine_kexec at 81036865 #1 [8820ceec14a0] crash_kexec at 810aa3e8 #2 [8820ceec1570] oops_end at 8160b458 #3 [8820ceec15a0] no_context at 816000e2 #4 [8820ceec1600] __bad_area_nosemaphore at 816002c7 #5 [8820ceec1650] bad_area at 81600515 #6 [8820ceec1680] __do_page_fault at 8160df9e #7 [8820ceec1790] do_page_fault at 8160e09e #8 [8820ceec17a0] page_fault at 8160a8c2 [exception RIP: htb_qlen_notify+24] RIP: a0841718 RSP: 8820ceec1858 RFLAGS: 00010282 RAX: RBX: RCX: 88241747b400 RDX: 88241747b408 RSI: RDI: 8811fb27d000 RBP: 8820ceec1868 R8: 88120cdeff24 R9: 88120cdeff30 R10: 0bd4 R11: a0840919 R12: a0843340 R13: R14: 0001 R15: 8808dae5c2e8 ORIG_RAX: CS: 0010 SS: 0018 #9 [8820ceec1870] qdisc_tree_decrease_qlen at 81565375 #10 [8820ceec18a0] fq_codel_dequeue at a084e0a0 [sch_fq_codel] #11 [8820ceec1940] fq_codel_reset at a084e2f8 [sch_fq_codel] #12 [8820ceec1960] qdisc_destroy at 81560d2d #13 [8820ceec1980] htb_destroy_class at a08408f8 [sch_htb] #14 [8820ceec19a0] htb_put at a084095c [sch_htb] #15 [8820ceec19b0] tc_ctl_tclass at 815645a3 #16 [8820ceec1a60] rtnetlink_rcv_msg at 81552cb0 #17 [8820ceec1ae0] netlink_rcv_skb at 8156cfd1 #18 [8820ceec1b10] rtnetlink_rcv at 8154f7f5 #19 [8820ceec1b30] netlink_unicast at 8156c929 #20 [8820ceec1b80] netlink_sendmsg at 8156cc69 #21 [8820ceec1c10] sock_sendmsg at 81528cf6 #22 [8820ceec1d60] ___sys_sendmsg at 815290bc #23 [8820ceec1f00] __sys_sendmsg at 8152c589 #24 [8820ceec1f70] sys_sendmsg at 8152c5e2 #25 [8820ceec1f80] system_call_fastpath at 816128c2 RIP: 7f31695448b0 RSP: 7fffb1509538 RFLAGS: 00010293 RAX: 002e RBX: 816128c2 RCX: 0030 RDX: RSI: 7fffb15054d0 RDI: 0005 RBP: 7fffb150a5f0 R8: 0007 R9: R10: 7fffb1505230 R11: 0246 R12: 8152c5e2 R13: 8820ceec1f78 R14: 00642760 R15: 7fffb15095b0 ORIG_RAX: 002e CS: 0033 SS: 002b To my understanding, the following situation is taking place. tc_ctl_tclass -> htb_delete -> class is deleted from clhash -> htb_put -> qdisc_destroy -> fq_codel_reset -> fq_codel_dequeue -> qdidsc_tree_decrease_qlen -> cl = htb_get # returns NULL, removed in htb_delete -> htb_qlen_notify(sch, NULL) # BOOM This patch checks cl for 0 before invoking qlen_notify and put. The notify is not necessary in this case, as the parent has already been deactivated anyway. Signed-off-by: Alex Gartrell --- net/sched/sch_api.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 2d2f079..f5d48dd 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -700,8 +700,15 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n) cops = sch->ops->cl_ops; if (cops->qlen_notify) { cl = cops->get(sch, parentid); - cops->qlen_notify(sch, cl); - cops->put(sch, cl); + /* This will be 0 in the event that cl is being +* destroyed, in which case we should not attempt +* to invoke qlen_notify upon it (it must be +* cleaned up otherwise anyway. +*/ + if (cl != 0) { + cops->qlen_notify(sch, cl); + cops->put(sch, cl); + } } sch->q.qlen -= n; } -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Panic with demuxed ipv4 multicast udp sockets on 4.0.4
Hey everyone, There's some kind of nasty condition in which sk_rx_dst points to an apparently garbage datastructure and it's blowing up in the early demux code because dst->ops is NULL. The packet in question was for bit torrent local peer discovery https://en.wikipedia.org/wiki/Local_Peer_Discovery . We're seeing this on about a 1/200 chance of panic per day. crash> bt PID: 1899532 TASK: 88000826cf00 CPU: 9 COMMAND: "hhvm.node.1" #0 [88047fc23990] machine_kexec at 8103bf05 #1 [88047fc239e0] crash_kexec at 810cb4e8 #2 [88047fc23ab0] oops_end at 81006468 #3 [88047fc23ae0] no_context at 8167aac1 #4 [88047fc23b40] __bad_area_nosemaphore at 8167acb9 #5 [88047fc23b90] bad_area_nosemaphore at 8167aceb #6 [88047fc23ba0] __do_page_fault at 81044ac5 #7 [88047fc23c10] do_page_fault at 81044eec #8 [88047fc23c20] page_fault at 81686c02 [exception RIP: udp_v4_early_demux+481] RIP: 816249a1 RSP: 88047fc23cd8 RFLAGS: 00010246 RAX: 880248ee4500 RBX: 093a RCX: 0002 RDX: RSI: RDI: 880248ee4500 RBP: 88047fc23d48 R8: R9: R10: 0001 R11: c9000199f3a0 R12: 88006f8a6300 R13: 81cbb1c0 R14: 0001 R15: 880474798f00 ORIG_RAX: CS: 0010 SS: #9 [88047fc23cd0] udp_v4_early_demux at 81624bb3 #10 [88047fc23d50] ip_rcv_finish at 815f3055 #11 [88047fc23d80] ip_rcv at 815f3952 #12 [88047fc23dc0] __netif_receive_skb_core at 815b96d4 #13 [88047fc23e30] __netif_receive_skb at 815b9911 #14 [88047fc23e50] process_backlog at 815b99f0 #15 [88047fc23ea0] net_rx_action at 815ba1e8 #16 [88047fc23f30] __do_softirq at 81054ce6 #17 [88047fc23f90] irq_exit at 81055075 #18 [88047fc23fa0] smp_call_function_single_interrupt at 810319f5 #19 [88047fc23fb0] call_function_single_interrupt at 8168637a --- --- #20 [8800792dff58] call_function_single_interrupt at 8168637a RIP: 006e7b4c RSP: 7f4c8ba38b80 RFLAGS: 0216 RAX: 006b RBX: 816851f2 RCX: 7f49f4de84d6 RDX: 7f49f4de84d8 RSI: 7f48dbcce731 RDI: RBP: 7f4c8ba38bd0 R8: 006b R9: R10: 7f48dbcce737 R11: 7f49f4de84e0 R12: 7f4adab85198 R13: 0014 R14: 7f4adaaa4c00 R15: ORIG_RAX: ff04 CS: 0033 SS: 002b crash> print *(struct *dst_entry)0x880248ee4500 A syntax error in expression, near `*dst_entry)0x880248ee4500'. gdb: gdb request failed: print *(struct *dst_entry)0x880248ee4500 crash> print *(struct dst_entry*)0x880248ee4500 $1 = { callback_head = { next = 0x880248ee4d00, func = 0x0 }, child = 0x13eacdfb7df67f6b, dev = 0x880113975d00, ops = 0x0, _metrics = 13729079323838086211, expires = 103079215104, path = 0x24c8d3baa, from = 0x0, xfrm = 0x6, input = 0x0, output = 0x0, flags = 5536, pending_confirm = 33114, error = -1, obsolete = -1, header_len = 0, trailer_len = 0, tclassid = 0, __pad_to_align_refcnt = {0, 704374636708}, __refcnt = { counter = 14 }, __use = 2097153, lastuse = 536576, { next = 0x0, rt_next = 0x0, rt6_next = 0x0, dn_next = 0x0 } } -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding
On Tue, Jul 7, 2015 at 6:33 PM, Simon Horman wrote: > Is there any possibility you could investigate which stable trees are > effected by this bug? Well this was certainly a problem in 3.10 (we had our own hacky solution at the time) and every kernel since then would have also had this problem. I'm pretty sure this isn't a problem in 3.4 or before. -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding
On Sun, Jul 5, 2015 at 8:50 PM, Simon Horman wrote: > Is it possible to get a 'Fixes:' tag? I suppose it'd be appropriate to say Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") As that is what introduces tcp early_demux, but that's just a guess as I haven't bisected it (not even sure my test would run on that code base). -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding
On Sun, Jul 5, 2015 at 3:13 PM, Julian Anastasov wrote: > May be the patch fixes crashes? If yes, Simon > should apply it for ipvs/net tree, otherwise after > the merge window... Yeah this is definitely a crash-fix and it's existed since at least 3.10. -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding
It is possible that we bind against a local socket in early_demux when we are actually going to want to forward it. In this case, the socket serves no purpose and only serves to confuse things (particularly functions which implicitly expect sk_fullsock to be true, like ip_local_out). Additionally, skb_set_owner_w is totally broken for non full-socks. Signed-off-by: Alex Gartrell --- net/netfilter/ipvs/ip_vs_xmit.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index bf66a86..99d4a41 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -527,6 +527,21 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, return ret; } +/* In the event of a remote destination, it's possible that we would have + * matches against an old socket (particularly a TIME-WAIT socket). This + * causes havoc down the line (ip_local_out et. al. expect regular sockets + * and invalid memory accesses will happen) so simply drop the association + * in this case. +*/ +static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb) +{ + /* If dev is set, the packet came from the LOCAL_IN callback and +* not from a local TCP socket. +*/ + if (skb->dev) + skb_orphan(skb); +} + /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, struct ip_vs_conn *cp, int local) @@ -538,12 +553,21 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, ip_vs_notrack(skb); else ip_vs_update_conntrack(skb, cp, 1); + + /* Remove the early_demux association unless it's bound for the +* exact same port and address on this host after translation. +*/ + if (!local || cp->vport != cp->dport || + !ip_vs_addr_equal(cp->af, &cp->vaddr, &cp->daddr)) + ip_vs_drop_early_demux_sk(skb); + if (!local) { skb_forward_csum(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb, NULL, skb_dst(skb)->dev, dst_output_sk); } else ret = NF_ACCEPT; + return ret; } @@ -557,6 +581,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, if (likely(!(cp->flags & IP_VS_CONN_F_NFCT))) ip_vs_notrack(skb); if (!local) { + ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb, NULL, skb_dst(skb)->dev, dst_output_sk); @@ -845,6 +870,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af, struct ipv6hdr *old_ipv6h = NULL; #endif + ip_vs_drop_early_demux_sk(skb); + if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) { new_skb = skb_realloc_headroom(skb, max_headroom); if (!new_skb) -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] ipvs: skb_orphan in case of forwarding
It is possible that we bind against a local socket in early_demux when we are actually going to want to forward it. In this case, the socket serves no purpose and only serves to confuse things (particularly functions which implicitly expect sk_fullsock to be true, like ip_local_out). Additionally, skb_set_owner_w is totally broken for non full-socks. Signed-off-by: Alex Gartrell --- net/netfilter/ipvs/ip_vs_xmit.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index bf66a86..65526f4 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -527,6 +527,21 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, return ret; } +/* In the event of a remote destination, it's possible that we would have + * matches against an old socket (particularly a TIME-WAIT socket). This + * causes havoc down the line (ip_local_out et. al. expect regular sockets + * and invalid memory accesses will happen) so simply drop the association + * in this case. +*/ +static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb) +{ + /* If dev is set, the packet came from the LOCAL_IN callback and +* not from a local TCP socket. +*/ + if (skb->dev) + skb_orphan(skb); +} + /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, struct ip_vs_conn *cp, int local) @@ -538,12 +553,21 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, ip_vs_notrack(skb); else ip_vs_update_conntrack(skb, cp, 1); + + /* Remove the early_demux association unless it's bound for the +* exact same port and address on this host after translation. +*/ + if (!local || cp->vport != cp->dport || + !ip_vs_addr_equal(cp->af, &cp->vaddr, &cp->caddr)) + ip_vs_drop_early_demux_sk(skb); + if (!local) { skb_forward_csum(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb, NULL, skb_dst(skb)->dev, dst_output_sk); } else ret = NF_ACCEPT; + return ret; } @@ -557,6 +581,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, if (likely(!(cp->flags & IP_VS_CONN_F_NFCT))) ip_vs_notrack(skb); if (!local) { + ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb, NULL, skb_dst(skb)->dev, dst_output_sk); @@ -845,6 +870,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af, struct ipv6hdr *old_ipv6h = NULL; #endif + ip_vs_drop_early_demux_sk(skb); + if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) { new_skb = skb_realloc_headroom(skb, max_headroom); if (!new_skb) -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk
Hey On Fri, Jul 3, 2015 at 1:32 AM, Julian Anastasov wrote: > To summarize: > - we should call skb_orphan as soon as possible after > deciding if packets goes to local or remote real server > but only for skb->sk set by early_demux, not for packets > sent by TCP Yeah, agree > - if packets go to local server IPVS should not touch > skb->dst, skb->sk, etc (NF_ACCEPT case) Yeah, the thing is that early demux could totally match for a socket that existed before we created the service, and in that instance it might make the most sense to retain the connection and simply NF_ACCEPT. The problem with that approach though is that is that the behavior changes if early_demux is not enabled. I believe that we should just do the consistent thing and always drop the early_demux result if bound for non-local, as you've said. The interesting thing though is that, for the purposes of routing, enabling early_demux does change the behavior. I suspect that's a bug, but it's far enough away from actual use cases that it's probably fine (who is out there tearing down addresses and setting up routes in their place?) > - for skb->sk set by early_demux, skb_orphan should happen before > skb_set_owner_w in ip_vs_prepare_tunneled_skb because > skb_set_owner_w will try to increase sk_wmem_alloc which is > wrong for early_demux phase Yeah that's my thinking as well. > - reaching skb_set_owner_w code for skb->sk set by eraly_demux > looks wrong to me, it can happen on: > - redirect (DNAT), if somehow we have socket too > - IPVS redirect: if we forward both to local and remote > real servers > - not likely for forward, nobody forwards traffic > destined to local IP to remote host What do you think of the following: commit f04c42f8041cc4ccc4cb2a30c1058136dd497a83 Author: Alex Gartrell Date: Wed Jul 1 13:24:46 2015 -0700 ipvs: orphan_skb in case of forwarding It is possible that we bind against a local socket in early_demux when we are actually going to want to forward it. In this case, the socket serves no purpose and only serves to confuse things (particularly functions which implicitly expect sk_fullsock to be true, like ip_local_out). Additionally, skb_set_owner_w is totally broken for non full-socks. Signed-off-by: Alex Gartrell diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index bf66a86..3efe719 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -527,6 +527,19 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, return ret; } +/* In the event of a remote destination, it's possible that we would have + * matches against an old socket (particularly a TIME-WAIT socket). This + * causes havoc down the line (ip_local_out et. al. expect regular sockets + * and invalid memory accesses will happen) so simply drop the association + * in this case +*/ +static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb) { + /* If dev is set, the packet came from the LOCAL_IN callback and +* not from a local TCP socket */ + if (skb->dev) + skb_orphan(skb); +} + /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, struct ip_vs_conn *cp, int local) @@ -539,6 +552,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, else ip_vs_update_conntrack(skb, cp, 1); if (!local) { + ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb, NULL, skb_dst(skb)->dev, dst_output_sk); @@ -557,6 +571,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, if (likely(!(cp->flags & IP_VS_CONN_F_NFCT))) ip_vs_notrack(skb); if (!local) { + ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb, NULL, skb_dst(skb)->dev, dst_output_sk); @@ -845,6 +860,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af, struct ipv6hdr *old_ipv6h = NULL; #endif + ip_vs_drop_early_demux_sk(skb); + if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) { new_skb = skb_realloc_headroom(skb, max_headroom); if (!new_skb) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk
On Thu, Jul 2, 2015 at 2:18 PM, Alex Gartrell wrote: > If early demux was enabled, we'd use the route from the socket Actually now that I think about it, this is probably broken, because we don't reply to the packet but instead silently drop it. -- Alex Gartrell http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk
On Thu, Jul 2, 2015 at 1:44 AM, Julian Anastasov wrote: > I think, your patch from January is almost > good: I'll rebase it, add your other suggestions, test it, and send it in. > And the patch from Eric for IPVS looks good too. Are we sure that we want to change the semantics of set_owner_w to orphan it? It works for us but that's not the behavior I'd expect from that function and might burn someone later? I've actually been looking through the code more for other uses of set_owner_w and I noticed this weird quirk: The test was simple: 0) Enable ip_forward 1) Add an address to loopback and listen on it 2) Accept a connection and close it (creating a TIME-WAIT socket) 3) Add a new route to a gre tunnel If early demux was enabled, we'd use the route from the socket If early demux was disabled, we'd forward using the gre tunnel Should we just replicate this behavior in ipvs? if (!skb->dev && skb->sk) return NF_ACCEPT; -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk
On Wed, Jul 1, 2015 at 4:26 PM, Eric Dumazet wrote: > I think you are mistaken Alex. Indeed, I was! Should be unsurpising. > > socket early demux cannot possibly set skb->destructor to sock_rfree() Yeah I will admit adding the code to sock_rfree reflexively out of paranoia. > If skb->destructor is set by early demux, it correctly points to sock_edemux() > > And this one correctly handles all socket variants. Yes, the problem appears to be in ip_vs_prepare_tunneled_skb (ip_vs_xmit.c:859 in 4.0) if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) { new_skb = skb_realloc_headroom(skb, max_headroom); if (!new_skb) goto error; if (skb->sk) skb_set_owner_w(new_skb, skb->sk); consume_skb(skb); skb = new_skb; } skb_set_owner_w sets sock_wfree. I'll figure out how to ensure that we're using an appropriate destructor here. Appreciate the patience! -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk
If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan it (as we need to do in the ipvs forwarding case), sock_wfree and sock_rfree are going to reach into the inet_timewait_sock and mess with fields that don't exist. Signed-off-by: Alex Gartrell --- net/core/sock.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/core/sock.c b/net/core/sock.c index 1e1fe9a..b37328f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1620,6 +1620,9 @@ void sock_wfree(struct sk_buff *skb) struct sock *sk = skb->sk; unsigned int len = skb->truesize; + if (sk->sk_state == TCP_TIME_WAIT) + return; + if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { /* * Keep a reference on sk_wmem_alloc, this will be released @@ -1665,6 +1668,9 @@ void sock_rfree(struct sk_buff *skb) struct sock *sk = skb->sk; unsigned int len = skb->truesize; + if (sk->sk_state == TCP_TIME_WAIT) + return; + atomic_sub(len, &sk->sk_rmem_alloc); sk_mem_uncharge(sk, len); } -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Intel-wired-lan] [PATCH 1/1] igb: Use ARRAY_SIZE instead fo sizeof(a)/sizeof(a[0])
On Tue, Jun 30, 2015 at 1:16 PM, Fujinaka, Todd wrote: > I still would say no if I'm allowed, because to guarantee that this change - > that I don't think fixes anything - works in all cases, we need to do an > incredible amount of regression testing. Every variant of every Intel part > that uses this driver (and there are many) should be tested and will end up > being used by the community. > Validation is really simple: diff old_module.ko new_module.ko And this is a good defensive measure, as it'll save you when someone screws up and changes your array to a pointer to an array (you'll get a build failure instead of 0). -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] net/unix: SO_REUSEPORT for AF_UNIX
On Sun, Jun 28, 2015 at 2:52 AM, Conrad Hoffmann wrote: > Support the SO_REUSEPORT option for AF_UNIX (aka AF_LOCAL) sockets. Note > that unlike the IP implementations, the semantics for AF_UNIX sockets are > those of the original BSD implementation, i.e. each socket that > successfully reuses a port completely takes over from the previous > listener. This is a really weird corner case from a user's perspective. I think sharing it is more reasonable, as you can always signal the other process out of band. -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
We've released a generic netlink python library -- gnlpy
Hey everyone, tl;dr; pure python generic netlink library with simple clients for ipvs and taskstats here: https://github.com/facebook/gnlpy At Facebook we rely upon ipvs for most of our layer-4 load balancing needs. It's mostly worked pretty great for us. The standard way to interact with ipvs is ipvsadm, which will use netlink sockets to make rpc calls to the kernel (things like adding/remove services and real servers). Because we run many, many instances of ipvs, we ended up scripting this away with a separate program that shells out to ipvsadm. The down side to this approach was that we had to format the arguments and parse the result of the tool, which was kind of a pain. Additionally, we'd sometimes get into a bad state where the exec would fail or the binary wouldn't be there and the whole thing would kind of break. To my knowledge this never caused any kind of large scale incident, but it was an annoying thing to deal with. At some point, we made some changes to ipvs (heterogenous pools) and we were in the position of needing to roll a new ipvsadm binary to take advantage of the functionality. That was fundamentally unappealing to yours truly, so I hacked together gnlpy (Generic NetLink PYthon library) instead. It's been in production for several months. At some point thereafter, someone on the lvs-devel list mentioned wanting to interact with ipvs through python and I made a vague assurance that we'd open source this thing. Well here we are. If you take a look at the code, you'll quickly notice that we haven't gone through the trouble of implementing every RPC call for the families we support and we certainly haven't gone through the trouble of implementing every generic netlink family. We welcome your pull requests :) -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html