from:"\"Alex Gartrell\""

[PATCH bpf-next] libbpf: Fix unintentional success return code in bpf_object__load

2020-08-26 Thread Alex Gartrell

There are code paths where EINVAL is returned directly without setting
errno. In that case, errno could be 0, which would mask the
failure. For example, if a careless programmer set log_level to 1
out of laziness, they would have to spend a long time trying to figure
out why.

Fixes: 4f33ddb4e3e2 ("libbpf: Propagate EPERM to caller on program load")
Signed-off-by: Alex Gartrell 
---
 tools/lib/bpf/libbpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2e2523d8bb6d..8f9e7d281225 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6067,7 +6067,7 @@ load_program(struct bpf_program *prog, struct bpf_insn 
*insns, int insns_cnt,
free(log_buf);
goto retry_load;
}
-   ret = -errno;
+   ret = errno ? -errno : -LIBBPF_ERRNO__LOAD;
cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg));
pr_warn("load bpf program failed: %s\n", cp);
pr_perm_msg(ret);
-- 
2.26.0

Re: [PATCH] net: Provide lwtstate_free for non-lwt config

2015-08-19 Thread Alex Gartrell

On Wed, Aug 19, 2015 at 4:28 PM, Eric Dumazet  wrote:
> We've seen similar patches earlier today on netdev ;)

Ah, the mailing list equivalent of someone letting you know that you
left your fly down :)

Sorry for the noise!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: Provide lwtstate_free for non-lwt config

2015-08-19 Thread Alex Gartrell

On Wed, Aug 19, 2015 at 3:16 PM, Alex Gartrell  wrote:
> Fixes: df383e624 ("lwtunnel: fix memory leak")

Ah sorry, this is PATCH net-next, obviously :/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: Provide lwtstate_free for non-lwt config

2015-08-19 Thread Alex Gartrell

Fixes: df383e624 ("lwtunnel: fix memory leak")

Signed-off-by: Alex Gartrell 
---
 include/net/lwtunnel.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 34fd8f7..cfee539 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -93,6 +93,10 @@ int lwtunnel_input6(struct sk_buff *skb);
 
 #else
 
+static inline void lwtstate_free(struct lwtunnel_state *lws)
+{
+}
+
 static inline struct lwtunnel_state *
 lwtstate_get(struct lwtunnel_state *lws)
 {
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Panic with demuxed ipv4 multicast udp sockets on 4.0.4

2015-08-12 Thread Alex Gartrell

Hey Gregory,

On Sun, Aug 2, 2015 at 2:28 PM, Gregory Hoggarth
 wrote:
> I will apply the new suggested patch, reverting previous patch, and test 
> overnight and update tomorrow.

Did this solve your problem?  If not, would you mind sharing a repro?

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers

2015-07-23 Thread Alex Gartrell

mov %rsp, %r1   ; r1 = rsp
add $-8, %r1; r1 = rsp - 8
store_q $123, -8(%rsp)  ; *(u64*)r1 = 123  <- valid
store_q $123, (%r1) ; *(u64*)r1 = 123  <- previously invalid
mov $0, %r0
exit; Always need to exit

And we'd get the following error:

0: (bf) r1 = r10
1: (07) r1 += -8
2: (7a) *(u64 *)(r10 -8) = 999
3: (7a) *(u64 *)(r1 +0) = 999
R1 invalid mem access 'fp'

Unable to load program

We already know that a register is a stack address and the appropriate
offset, so we should be able to validate those references as well.

Signed-off-by: Alex Gartrell 
---
 kernel/bpf/verifier.c   |  6 -
 samples/bpf/test_verifier.c | 59 +
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..cd307df 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -648,6 +648,9 @@ static int check_mem_access(struct verifier_env *env, u32 
regno, int off,
struct verifier_state *state = &env->cur_state;
int size, err = 0;
 
+   if (state->regs[regno].type == PTR_TO_STACK)
+   off += state->regs[regno].imm;
+
size = bpf_size_to_bytes(bpf_size);
if (size < 0)
return size;
@@ -667,7 +670,8 @@ static int check_mem_access(struct verifier_env *env, u32 
regno, int off,
if (!err && t == BPF_READ && value_regno >= 0)
mark_reg_unknown_value(state->regs, value_regno);
 
-   } else if (state->regs[regno].type == FRAME_PTR) {
+   } else if (state->regs[regno].type == FRAME_PTR ||
+  state->regs[regno].type == PTR_TO_STACK) {
if (off >= 0 || off < -MAX_BPF_STACK) {
verbose("invalid stack off=%d size=%d\n", off, size);
return -EACCES;
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 6936059..ee0f110 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -822,6 +822,65 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
+   {
+   "PTR_TO_STACK store/load",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -10),
+   BPF_ST_MEM(BPF_DW, BPF_REG_1, 2, 0xfaceb00c),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 2),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   },
+   {
+   "PTR_TO_STACK store/load - bad alignment on off",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+   BPF_ST_MEM(BPF_DW, BPF_REG_1, 2, 0xfaceb00c),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 2),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "misaligned access off -6 size 8",
+   },
+   {
+   "PTR_TO_STACK store/load - bad alignment on reg",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -10),
+   BPF_ST_MEM(BPF_DW, BPF_REG_1, 8, 0xfaceb00c),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "misaligned access off -2 size 8",
+   },
+   {
+   "PTR_TO_STACK store/load - out of bounds low",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+   BPF_ST_MEM(BPF_DW, BPF_REG_1, 8, 0xfaceb00c),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "invalid stack off=-79992 size=8",
+   },
+   {
+   "PTR_TO_STACK store/load - out of bounds high",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+   BPF_ST_MEM(BPF_DW, BPF_REG_1, 8, 0xfaceb00c),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
+   BPF_EXIT_INSN(),
+   },
+       .result = REJECT,
+

Re: [RFC PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers

2015-07-22 Thread Alex Gartrell

On Tue, Jul 21, 2015 at 8:00 PM, Alexei Starovoitov  wrote:
> On Tue, Jul 21, 2015 at 07:00:40PM -0700, Alex Gartrell wrote:
>> mov %rsp, %r1   ; r1 = rsp
>> add $-8, %r1; r1 = rsp - 8
>> store_q $123, -8(%rsp)  ; *(u64*)r1 = 123  <- valid
>> store_q $123, (%r1) ; *(u64*)r1 = 123  <- previously invalid
>> mov $0, %r0
>> exit; Always need to exit
>
> Is this your new eBPF assembler syntax? :)
> imo gnu style looks ugly... ;)

If you think this is ugly, you'll love the "instruction" I added to be
compatible with the map fd -> immediate conversion hack :)

> It's great to see such in-depth understanding of verifier!!
>
>> And we'd get the following error:
>>
>>   0: (bf) r1 = r10
>>   1: (07) r1 += -8
>>   2: (7a) *(u64 *)(r10 -8) = 999
>>   3: (7a) *(u64 *)(r1 +0) = 999
>>   R1 invalid mem access 'fp'
>>
>>   Unable to load program
>>
>> We already know that a register is a stack address and the appropriate
>> offset, so we should be able to validate those references as well.
>
> yes, we can teach verifier to do that.
> Though llvm doesn't generate such code. It's small enough change.

I happened upon this as I was playing around with the bytecode in our
4.0 kernels.  I believe that we can write general purpose utilities
without needing to write C code for each use case that do things like
filtering/counting packets or syscalls and outputting that data into
maps at low cost, but I'm still just prototyping so I'm not ready to
be an assertive jerk about it (yet)

> real_off is missing alignment and bounds checks.
> something like:
> if (state->regs[regno].type == PTR_TO_STACK)
> off += state->regs[regno].imm;
> if (off % size != 0)
> ...

Yeah, I'm an idiot and assumed that a bounds check happened in the
check_stack_read function.  I'll find a way to do this without
copy-pasta'ing but I'm going to stick to my moral high ground and not
mutate a parameter (this lead to a bug in a job interview 6 years ago
and I've never forgiven myself because the interviewer was an OpenBSD
guy)

> else if (state->regs[regno].type == FRAME_PTR || == PTR_TO_STACK)
> .. as-is here ...
>
> would fix it.
>
> please add few accept and reject tests for this to test_verifier.c as well.

psh, tests...

I'll update this stuff and submit a patch.

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers

2015-07-21 Thread Alex Gartrell

mov %rsp, %r1   ; r1 = rsp
add $-8, %r1; r1 = rsp - 8
store_q $123, -8(%rsp)  ; *(u64*)r1 = 123  <- valid
store_q $123, (%r1) ; *(u64*)r1 = 123  <- previously invalid
mov $0, %r0
exit; Always need to exit

And we'd get the following error:

0: (bf) r1 = r10
1: (07) r1 += -8
2: (7a) *(u64 *)(r10 -8) = 999
3: (7a) *(u64 *)(r1 +0) = 999
R1 invalid mem access 'fp'

Unable to load program

We already know that a register is a stack address and the appropriate
offset, so we should be able to validate those references as well.

Signed-off-by: Alex Gartrell 
---
 kernel/bpf/verifier.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..5dfbece 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -676,6 +676,15 @@ static int check_mem_access(struct verifier_env *env, u32 
regno, int off,
err = check_stack_write(state, off, size, value_regno);
else
err = check_stack_read(state, off, size, value_regno);
+   } else if (state->regs[regno].type == PTR_TO_STACK) {
+   int real_off = state->regs[regno].imm + off;
+
+   if (t == BPF_WRITE)
+   err = check_stack_write(
+   state, real_off, size, value_regno);
+   else
+   err = check_stack_read(
+   state, real_off, size, value_regno);
} else {
verbose("R%d invalid mem access '%s'\n",
regno, reg_type_str[state->regs[regno].type]);
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why return E2BIG from bpf map update?

2015-07-21 Thread Alex Gartrell

On Tue, Jul 21, 2015 at 2:40 AM, Daniel Borkmann  wrote:
> On 07/21/2015 12:24 AM, Alexei Starovoitov wrote:
>>
>> On 7/20/15 3:15 PM, Alex Gartrell wrote:
>>>
>>> The ship has probably sailed on this one, but it seems like ENOSPC
>>> makes more sense than E2BIG.  Any chance of changing it so that poor
>>> ebpf library maintainers in the future don't have to wonder how their
>>> argument list got too big?
>>
>>
>> sorry, too late.
>> It's in tests and even document in bpf manpage:
>> "E2BIG - indicates that the number of elements in the map reached the
>> max_entries limit specified at map creation time."
>> I read E2BIG as "too big" and not as "argument list is too long" :)
>
>
> If some libraries do an strerror(3) on errno then it certainly sounds
> a bit weird, "no space left on device" perhaps also a bit misleading.
> The bpf(2) manpage was actually submitted/discussed longer time ago,
> but I still didn't see it in Michael's tree yet, will ping him again.

I was just being whiny :)

The options really are
ENOMEM -- really should mean allocation failed
E2BIG -- really should mean argument list too big
ENOSPC -- really should mean that a physical device is full

I suppose there's also
EDQUOT -- Disk Quota Exceeded

And if you're really stretching
EXFULL - Exchange Full

but I've never seen either of the last two in my life.

So clearly this is all very subjective and people who complain about
it on mailing lists (me) are the worst.

The only complaint I'd have though is E2BIG actually does mean that
your bpf_attr argument is too big as well, so that part confused me
for a couple of minutes.  But, the EINVAL errno has similarly been
abused to death so I don't think it's that big of a deal.

/bikeshedding

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Why return E2BIG from bpf map update?

2015-07-20 Thread Alex Gartrell

The ship has probably sailed on this one, but it seems like ENOSPC
makes more sense than E2BIG.  Any chance of changing it so that poor
ebpf library maintainers in the future don't have to wonder how their
argument list got too big?

net-next/master:kernel/bpf/hashtab.c=static int
htab_map_update_elem(struct bpf_map *map, void *key, void *value,
--
net-next/master:kernel/bpf/hashtab.c-
net-next/master:kernel/bpf/hashtab.c-   if (!l_old &&
unlikely(htab->count >= map->max_entries)) {
net-next/master:kernel/bpf/hashtab.c-   /* if elem with this
'key' doesn't exist and we've reached
net-next/master:kernel/bpf/hashtab.c-* max_entries limit,
fail insertion of new elem
net-next/master:kernel/bpf/hashtab.c-*/
net-next/master:kernel/bpf/hashtab.c:   ret = -E2BIG;
net-next/master:kernel/bpf/hashtab.c-   goto err;
net-next/master:kernel/bpf/hashtab.c-   }
net-next/master:kernel/bpf/hashtab.c-
net-next/master:kernel/bpf/hashtab.c-   if (l_old && map_flags == BPF_NOEXIST) {
net-next/master:kernel/bpf/hashtab.c-   /* elem already exists */

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH,v2 net] net: sched: validate that class is found in qdisc_tree_decrease_qlen

2015-07-20 Thread Alex Gartrell

We have an application that invokes tc to delete the root every time the
config changes. As a result we stress the cleanup code and were seeing the
following panic:

  crash> bt
  PID: 630839  TASK: 8823c990d280  CPU: 14  COMMAND: "tc"
   [... snip ...]
   #8 [8820ceec17a0] page_fault at 8160a8c2
  [exception RIP: htb_qlen_notify+24]
  RIP: a0841718  RSP: 8820ceec1858  RFLAGS: 00010282
  RAX:   RBX:   RCX: 88241747b400
  RDX: 88241747b408  RSI:   RDI: 8811fb27d000
  RBP: 8820ceec1868   R8: 88120cdeff24   R9: 88120cdeff30
  R10: 0bd4  R11: a0840919  R12: a0843340
  R13:   R14: 0001  R15: 8808dae5c2e8
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [...] qdisc_tree_decrease_qlen at 81565375
  #10 [...] fq_codel_dequeue at a084e0a0 [sch_fq_codel]
  #11 [...] fq_codel_reset at a084e2f8 [sch_fq_codel]
  #12 [...] qdisc_destroy at 81560d2d
  #13 [...] htb_destroy_class at a08408f8 [sch_htb]
  #14 [...] htb_put at a084095c [sch_htb]
  #15 [...] tc_ctl_tclass at 815645a3
  #16 [...] rtnetlink_rcv_msg at 81552cb0
  [... snip ...]

To my understanding, the following situation is taking place.

  tc_ctl_tclass
   -> htb_delete
 -> class is deleted from clhash
   -> htb_put
 -> qdisc_destroy
   -> fq_codel_reset
 -> fq_codel_dequeue
   -> qdidsc_tree_decrease_qlen
 -> cl = htb_get # returns NULL, removed in htb_delete
   -> htb_qlen_notify(sch, NULL) # BOOM

This patch checks cl for 0 before invoking qlen_notify and put.  The notify
is not necessary in this case, as the parent has already been deactivated
anyway.

Signed-off-by: Alex Gartrell 
---
 net/sched/sch_api.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f06aa01..46be5e5 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -762,8 +762,15 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned 
int n)
cops = sch->ops->cl_ops;
if (cops->qlen_notify) {
cl = cops->get(sch, parentid);
-   cops->qlen_notify(sch, cl);
-   cops->put(sch, cl);
+   /* This will be 0 in the event that cl is being
+* destroyed, in which case we should not attempt
+* to invoke qlen_notify upon it (it must be
+* cleaned up otherwise anyway.
+*/
+   if (cl != 0) {
+   cops->qlen_notify(sch, cl);
+   cops->put(sch, cl);
+   }
}
sch->q.qlen -= n;
__qdisc_qstats_drop(sch, drops);
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] net: sched: validate that class is found in qdisc_tree_decrease_qlen

2015-07-20 Thread Alex Gartrell

First, vehement apologies for sending an RFC patch against such an old
kernel.  This is (potentially) a bug fix for a crash that we've been seeing
in 3.2 and 3.10, and, AFAICT, the underlying issue has not been addressed.

We have an application that invokes tc to delete the root every time the
config changes. As a result we stress the cleanup code and were seeing the
following panic:

  crash> bt
  PID: 630839  TASK: 8823c990d280  CPU: 14  COMMAND: "tc"
   [... snip ...]
   #8 [8820ceec17a0] page_fault at 8160a8c2
  [exception RIP: htb_qlen_notify+24]
  RIP: a0841718  RSP: 8820ceec1858  RFLAGS: 00010282
  RAX:   RBX:   RCX: 88241747b400
  RDX: 88241747b408  RSI:   RDI: 8811fb27d000
  RBP: 8820ceec1868   R8: 88120cdeff24   R9: 88120cdeff30
  R10: 0bd4  R11: a0840919  R12: a0843340
  R13:   R14: 0001  R15: 8808dae5c2e8
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [...] qdisc_tree_decrease_qlen at 81565375
  #10 [...] fq_codel_dequeue at a084e0a0 [sch_fq_codel]
  #11 [...] fq_codel_reset at a084e2f8 [sch_fq_codel]
  #12 [...] qdisc_destroy at 81560d2d
  #13 [...] htb_destroy_class at a08408f8 [sch_htb]
  #14 [...] htb_put at a084095c [sch_htb]
  #15 [...] tc_ctl_tclass at 815645a3
  #16 [...] rtnetlink_rcv_msg at 81552cb0
  [... snip ...]

To my understanding, the following situation is taking place.

  tc_ctl_tclass
   -> htb_delete
 -> class is deleted from clhash
   -> htb_put
 -> qdisc_destroy
   -> fq_codel_reset
 -> fq_codel_dequeue
   -> qdidsc_tree_decrease_qlen
 -> cl = htb_get # returns NULL, removed in htb_delete
   -> htb_qlen_notify(sch, NULL) # BOOM

This patch checks cl for 0 before invoking qlen_notify and put.  The notify
is not necessary in this case, as the parent has already been deactivated
anyway.

Signed-off-by: Alex Gartrell 
---
 net/sched/sch_api.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f06aa01..46be5e5 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -762,8 +762,15 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned 
int n)
cops = sch->ops->cl_ops;
if (cops->qlen_notify) {
cl = cops->get(sch, parentid);
-   cops->qlen_notify(sch, cl);
-   cops->put(sch, cl);
+   /* This will be 0 in the event that cl is being
+* destroyed, in which case we should not attempt
+* to invoke qlen_notify upon it (it must be
+* cleaned up otherwise anyway.
+*/
+   if (cl != 0) {
+   cops->qlen_notify(sch, cl);
+   cops->put(sch, cl);
+   }
}
sch->q.qlen -= n;
__qdisc_qstats_drop(sch, drops);
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen

2015-07-17 Thread Alex Gartrell

On Fri, Jul 17, 2015 at 5:10 PM, Cong Wang  wrote:
> Hmm, but in htb_delete() we do reset the leaf qdisc before removing the
> class from ha
>
> if (!cl->level) {
> qlen = cl->un.leaf.q->q.qlen;
> qdisc_reset(cl->un.leaf.q);
> qdisc_tree_decrease_qlen(cl->un.leaf.q, qlen);
> }
>
> therefore, the leaf fq_codel qdisc is supposed to have 0 skb after that,
> that is, the second fq_codel_reset() is supposed to return NULL immediately?
> Why is qdidsc_tree_decrease_qlen() called in the second fq_codel_reset()?
>
> Or there is a race condition between ->delete() and ->put()? In which new
> skb can be enqueued?

This makes the most sense to me, but I have ~32 hours of experience
with this subsystem :)

Taking that for granted, it seems like it'd be appropriate to note the
invariant in the code i've changed with a WARN_ON and to skip it, and
then to otherwise find a way to close the hole.  Do you agree?

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen

2015-07-17 Thread Alex Gartrell

First, vehement apologies for sending an RFC patch against such an old
kernel.  This is (potentially) a bug fix for a crash that we've been seeing
in 3.2 and 3.10, and, AFAICT, the underlying issue has not been addressed.

We have an application that invokes tc to delete the root every time the
config changes. As a result we stress the cleanup code and were seeing the
following panic:

  crash> bt
  PID: 630839  TASK: 8823c990d280  CPU: 14  COMMAND: "tc"
   #0 [8820ceec1450] machine_kexec at 81036865
   #1 [8820ceec14a0] crash_kexec at 810aa3e8
   #2 [8820ceec1570] oops_end at 8160b458
   #3 [8820ceec15a0] no_context at 816000e2
   #4 [8820ceec1600] __bad_area_nosemaphore at 816002c7
   #5 [8820ceec1650] bad_area at 81600515
   #6 [8820ceec1680] __do_page_fault at 8160df9e
   #7 [8820ceec1790] do_page_fault at 8160e09e
   #8 [8820ceec17a0] page_fault at 8160a8c2
  [exception RIP: htb_qlen_notify+24]
  RIP: a0841718  RSP: 8820ceec1858  RFLAGS: 00010282
  RAX:   RBX:   RCX: 88241747b400
  RDX: 88241747b408  RSI:   RDI: 8811fb27d000
  RBP: 8820ceec1868   R8: 88120cdeff24   R9: 88120cdeff30
  R10: 0bd4  R11: a0840919  R12: a0843340
  R13:   R14: 0001  R15: 8808dae5c2e8
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [8820ceec1870] qdisc_tree_decrease_qlen at 81565375
  #10 [8820ceec18a0] fq_codel_dequeue at a084e0a0 [sch_fq_codel]
  #11 [8820ceec1940] fq_codel_reset at a084e2f8 [sch_fq_codel]
  #12 [8820ceec1960] qdisc_destroy at 81560d2d
  #13 [8820ceec1980] htb_destroy_class at a08408f8 [sch_htb]
  #14 [8820ceec19a0] htb_put at a084095c [sch_htb]
  #15 [8820ceec19b0] tc_ctl_tclass at 815645a3
  #16 [8820ceec1a60] rtnetlink_rcv_msg at 81552cb0
  #17 [8820ceec1ae0] netlink_rcv_skb at 8156cfd1
  #18 [8820ceec1b10] rtnetlink_rcv at 8154f7f5
  #19 [8820ceec1b30] netlink_unicast at 8156c929
  #20 [8820ceec1b80] netlink_sendmsg at 8156cc69
  #21 [8820ceec1c10] sock_sendmsg at 81528cf6
  #22 [8820ceec1d60] ___sys_sendmsg at 815290bc
  #23 [8820ceec1f00] __sys_sendmsg at 8152c589
  #24 [8820ceec1f70] sys_sendmsg at 8152c5e2
  #25 [8820ceec1f80] system_call_fastpath at 816128c2
  RIP: 7f31695448b0  RSP: 7fffb1509538  RFLAGS: 00010293
  RAX: 002e  RBX: 816128c2  RCX: 0030
  RDX:   RSI: 7fffb15054d0  RDI: 0005
  RBP: 7fffb150a5f0   R8: 0007   R9: 
  R10: 7fffb1505230  R11: 0246  R12: 8152c5e2
  R13: 8820ceec1f78  R14: 00642760  R15: 7fffb15095b0
  ORIG_RAX: 002e  CS: 0033  SS: 002b

To my understanding, the following situation is taking place.

  tc_ctl_tclass
   -> htb_delete
 -> class is deleted from clhash
   -> htb_put
 -> qdisc_destroy
   -> fq_codel_reset
 -> fq_codel_dequeue
   -> qdidsc_tree_decrease_qlen
 -> cl = htb_get # returns NULL, removed in htb_delete
   -> htb_qlen_notify(sch, NULL) # BOOM

This patch checks cl for 0 before invoking qlen_notify and put.  The notify
is not necessary in this case, as the parent has already been deactivated
anyway.

Signed-off-by: Alex Gartrell 
---
 net/sched/sch_api.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 2d2f079..f5d48dd 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -700,8 +700,15 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned 
int n)
cops = sch->ops->cl_ops;
if (cops->qlen_notify) {
cl = cops->get(sch, parentid);
-   cops->qlen_notify(sch, cl);
-   cops->put(sch, cl);
+   /* This will be 0 in the event that cl is being
+* destroyed, in which case we should not attempt
+* to invoke qlen_notify upon it (it must be
+* cleaned up otherwise anyway.
+*/
+   if (cl != 0) {
+   cops->qlen_notify(sch, cl);
+   cops->put(sch, cl);
+   }
    }
sch->q.qlen -= n;
}
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Panic with demuxed ipv4 multicast udp sockets on 4.0.4

2015-07-10 Thread Alex Gartrell

Hey everyone,

There's some kind of nasty condition in which sk_rx_dst points to an
apparently garbage datastructure and it's blowing up in the early
demux code because dst->ops is NULL.  The packet in question was for
bit torrent local peer discovery
https://en.wikipedia.org/wiki/Local_Peer_Discovery .  We're seeing
this on about a 1/200 chance of panic per day.

crash> bt
PID: 1899532  TASK: 88000826cf00  CPU: 9   COMMAND: "hhvm.node.1"
 #0 [88047fc23990] machine_kexec at 8103bf05
 #1 [88047fc239e0] crash_kexec at 810cb4e8
 #2 [88047fc23ab0] oops_end at 81006468
 #3 [88047fc23ae0] no_context at 8167aac1
 #4 [88047fc23b40] __bad_area_nosemaphore at 8167acb9
 #5 [88047fc23b90] bad_area_nosemaphore at 8167aceb
 #6 [88047fc23ba0] __do_page_fault at 81044ac5
 #7 [88047fc23c10] do_page_fault at 81044eec
 #8 [88047fc23c20] page_fault at 81686c02
[exception RIP: udp_v4_early_demux+481]
RIP: 816249a1  RSP: 88047fc23cd8  RFLAGS: 00010246
RAX: 880248ee4500  RBX: 093a  RCX: 0002
RDX:   RSI:   RDI: 880248ee4500
RBP: 88047fc23d48   R8:    R9: 
R10: 0001  R11: c9000199f3a0  R12: 88006f8a6300
R13: 81cbb1c0  R14: 0001  R15: 880474798f00
ORIG_RAX:   CS: 0010  SS: 
 #9 [88047fc23cd0] udp_v4_early_demux at 81624bb3
#10 [88047fc23d50] ip_rcv_finish at 815f3055
#11 [88047fc23d80] ip_rcv at 815f3952
#12 [88047fc23dc0] __netif_receive_skb_core at 815b96d4
#13 [88047fc23e30] __netif_receive_skb at 815b9911
#14 [88047fc23e50] process_backlog at 815b99f0
#15 [88047fc23ea0] net_rx_action at 815ba1e8
#16 [88047fc23f30] __do_softirq at 81054ce6
#17 [88047fc23f90] irq_exit at 81055075
#18 [88047fc23fa0] smp_call_function_single_interrupt at 810319f5
#19 [88047fc23fb0] call_function_single_interrupt at 8168637a
---  ---
#20 [8800792dff58] call_function_single_interrupt at 8168637a
RIP: 006e7b4c  RSP: 7f4c8ba38b80  RFLAGS: 0216
RAX: 006b  RBX: 816851f2  RCX: 7f49f4de84d6
RDX: 7f49f4de84d8  RSI: 7f48dbcce731  RDI: 
RBP: 7f4c8ba38bd0   R8: 006b   R9: 
R10: 7f48dbcce737  R11: 7f49f4de84e0  R12: 7f4adab85198
R13: 0014  R14: 7f4adaaa4c00  R15: 
ORIG_RAX: ff04  CS: 0033  SS: 002b
crash> print *(struct *dst_entry)0x880248ee4500
A syntax error in expression, near `*dst_entry)0x880248ee4500'.
gdb: gdb request failed: print *(struct *dst_entry)0x880248ee4500
crash> print *(struct dst_entry*)0x880248ee4500
$1 = {
  callback_head = {
next = 0x880248ee4d00,
func = 0x0
  },
  child = 0x13eacdfb7df67f6b,
  dev = 0x880113975d00,
  ops = 0x0,
  _metrics = 13729079323838086211,
  expires = 103079215104,
  path = 0x24c8d3baa,
  from = 0x0,
  xfrm = 0x6,
  input = 0x0,
  output = 0x0,
  flags = 5536,
  pending_confirm = 33114,
  error = -1,
  obsolete = -1,
  header_len = 0,
  trailer_len = 0,
  tclassid = 0,
  __pad_to_align_refcnt = {0, 704374636708},
  __refcnt = {
counter = 14
  },
  __use = 2097153,
  lastuse = 536576,
  {
next = 0x0,
rt_next = 0x0,
rt6_next = 0x0,
dn_next = 0x0
  }
}

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

2015-07-08 Thread Alex Gartrell

On Tue, Jul 7, 2015 at 6:33 PM, Simon Horman  wrote:

> Is there any possibility you could investigate which stable trees are
> effected by this bug?

Well this was certainly a problem in 3.10 (we had our own hacky
solution at the time) and every kernel since then would have also had
this problem.  I'm pretty sure this isn't a problem in 3.4 or before.

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

2015-07-05 Thread Alex Gartrell

On Sun, Jul 5, 2015 at 8:50 PM, Simon Horman  wrote:
> Is it possible to get a 'Fixes:' tag?

I suppose it'd be appropriate to say

Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")

As that is what introduces tcp early_demux, but that's just a guess as
I haven't bisected it (not even sure my test would run on that code
base).

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

2015-07-05 Thread Alex Gartrell

On Sun, Jul 5, 2015 at 3:13 PM, Julian Anastasov  wrote:
> May be the patch fixes crashes? If yes, Simon
> should apply it for ipvs/net tree, otherwise after
> the merge window...

Yeah this is definitely a crash-fix and it's existed since at least 3.10.

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

2015-07-05 Thread Alex Gartrell

It is possible that we bind against a local socket in early_demux when we
are actually going to want to forward it.  In this case, the socket serves
no purpose and only serves to confuse things (particularly functions which
implicitly expect sk_fullsock to be true, like ip_local_out).
Additionally, skb_set_owner_w is totally broken for non full-socks.

Signed-off-by: Alex Gartrell 
---
 net/netfilter/ipvs/ip_vs_xmit.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index bf66a86..99d4a41 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -527,6 +527,21 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff 
*skb,
return ret;
 }
 
+/* In the event of a remote destination, it's possible that we would have
+ * matches against an old socket (particularly a TIME-WAIT socket). This
+ * causes havoc down the line (ip_local_out et. al. expect regular sockets
+ * and invalid memory accesses will happen) so simply drop the association
+ * in this case.
+*/
+static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb)
+{
+   /* If dev is set, the packet came from the LOCAL_IN callback and
+* not from a local TCP socket.
+*/
+   if (skb->dev)
+   skb_orphan(skb);
+}
+
 /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */
 static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 struct ip_vs_conn *cp, int local)
@@ -538,12 +553,21 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct 
sk_buff *skb,
ip_vs_notrack(skb);
else
ip_vs_update_conntrack(skb, cp, 1);
+
+   /* Remove the early_demux association unless it's bound for the
+* exact same port and address on this host after translation.
+*/
+   if (!local || cp->vport != cp->dport ||
+   !ip_vs_addr_equal(cp->af, &cp->vaddr, &cp->daddr))
+   ip_vs_drop_early_demux_sk(skb);
+
if (!local) {
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output_sk);
} else
ret = NF_ACCEPT;
+
return ret;
 }
 
@@ -557,6 +581,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff 
*skb,
if (likely(!(cp->flags & IP_VS_CONN_F_NFCT)))
ip_vs_notrack(skb);
if (!local) {
+   ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output_sk);
@@ -845,6 +870,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
struct ipv6hdr *old_ipv6h = NULL;
 #endif
 
+   ip_vs_drop_early_demux_sk(skb);
+
if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) {
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] ipvs: skb_orphan in case of forwarding

2015-07-05 Thread Alex Gartrell

It is possible that we bind against a local socket in early_demux when we
are actually going to want to forward it.  In this case, the socket serves
no purpose and only serves to confuse things (particularly functions which
implicitly expect sk_fullsock to be true, like ip_local_out).
Additionally, skb_set_owner_w is totally broken for non full-socks.

Signed-off-by: Alex Gartrell 
---
 net/netfilter/ipvs/ip_vs_xmit.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index bf66a86..65526f4 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -527,6 +527,21 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff 
*skb,
return ret;
 }
 
+/* In the event of a remote destination, it's possible that we would have
+ * matches against an old socket (particularly a TIME-WAIT socket). This
+ * causes havoc down the line (ip_local_out et. al. expect regular sockets
+ * and invalid memory accesses will happen) so simply drop the association
+ * in this case.
+*/
+static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb)
+{
+   /* If dev is set, the packet came from the LOCAL_IN callback and
+* not from a local TCP socket.
+*/
+   if (skb->dev)
+   skb_orphan(skb);
+}
+
 /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */
 static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 struct ip_vs_conn *cp, int local)
@@ -538,12 +553,21 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct 
sk_buff *skb,
ip_vs_notrack(skb);
else
ip_vs_update_conntrack(skb, cp, 1);
+
+   /* Remove the early_demux association unless it's bound for the
+* exact same port and address on this host after translation.
+*/
+   if (!local || cp->vport != cp->dport ||
+   !ip_vs_addr_equal(cp->af, &cp->vaddr, &cp->caddr))
+   ip_vs_drop_early_demux_sk(skb);
+
if (!local) {
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output_sk);
} else
ret = NF_ACCEPT;
+
return ret;
 }
 
@@ -557,6 +581,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff 
*skb,
if (likely(!(cp->flags & IP_VS_CONN_F_NFCT)))
ip_vs_notrack(skb);
if (!local) {
+   ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output_sk);
@@ -845,6 +870,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
struct ipv6hdr *old_ipv6h = NULL;
 #endif
 
+   ip_vs_drop_early_demux_sk(skb);
+
if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) {
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-03 Thread Alex Gartrell

Hey

On Fri, Jul 3, 2015 at 1:32 AM, Julian Anastasov  wrote:

> To summarize:
> - we should call skb_orphan as soon as possible after
> deciding if packets goes to local or remote real server
> but only for skb->sk set by early_demux, not for packets
> sent by TCP

Yeah, agree

> - if packets go to local server IPVS should not touch
> skb->dst, skb->sk, etc (NF_ACCEPT case)

Yeah, the thing is that early demux could totally match for a socket
that existed before we created the service, and in that instance it
might make the most sense to retain the connection and simply
NF_ACCEPT.  The problem with that approach though is that is that the
behavior changes if early_demux is not enabled.  I believe that we
should just do the consistent thing and always drop the early_demux
result if bound for non-local, as you've said.

The interesting thing though is that, for the purposes of routing,
enabling early_demux does change the behavior.  I suspect that's a
bug, but it's far enough away from actual use cases that it's probably
fine (who is out there tearing down addresses and setting up routes in
their place?)

> - for skb->sk set by early_demux, skb_orphan should happen before
> skb_set_owner_w in ip_vs_prepare_tunneled_skb because
> skb_set_owner_w will try to increase sk_wmem_alloc which is
> wrong for early_demux phase

Yeah that's my thinking as well.

> - reaching skb_set_owner_w code for skb->sk set by eraly_demux
> looks wrong to me, it can happen on:
> - redirect (DNAT), if somehow we have socket too
> - IPVS redirect: if we forward both to local and remote
> real servers
> - not likely for forward, nobody forwards traffic
> destined to local IP to remote host


What do you think of the following:

commit f04c42f8041cc4ccc4cb2a30c1058136dd497a83
Author: Alex Gartrell 
Date:   Wed Jul 1 13:24:46 2015 -0700

ipvs: orphan_skb in case of forwarding

It is possible that we bind against a local socket in early_demux when we
are actually going to want to forward it.  In this case, the socket serves
no purpose and only serves to confuse things (particularly functions which
implicitly expect sk_fullsock to be true, like ip_local_out).
    Additionally, skb_set_owner_w is totally broken for non full-socks.

Signed-off-by: Alex Gartrell 

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index bf66a86..3efe719 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -527,6 +527,19 @@ static inline int
ip_vs_tunnel_xmit_prepare(struct sk_buff *skb,
return ret;
 }

+/* In the event of a remote destination, it's possible that we would have
+ * matches against an old socket (particularly a TIME-WAIT socket). This
+ * causes havoc down the line (ip_local_out et. al. expect regular sockets
+ * and invalid memory accesses will happen) so simply drop the association
+ * in this case
+*/
+static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb) {
+   /* If dev is set, the packet came from the LOCAL_IN callback and
+* not from a local TCP socket */
+   if (skb->dev)
+   skb_orphan(skb);
+}
+
 /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */
 static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 struct ip_vs_conn *cp, int local)
@@ -539,6 +552,7 @@ static inline int ip_vs_nat_send_or_cont(int pf,
struct sk_buff *skb,
else
ip_vs_update_conntrack(skb, cp, 1);
if (!local) {
+   ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output_sk);
@@ -557,6 +571,7 @@ static inline int ip_vs_send_or_cont(int pf,
struct sk_buff *skb,
if (likely(!(cp->flags & IP_VS_CONN_F_NFCT)))
ip_vs_notrack(skb);
if (!local) {
+   ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output_sk);
@@ -845,6 +860,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
struct ipv6hdr *old_ipv6h = NULL;
 #endif

+   ip_vs_drop_early_demux_sk(skb);
+
if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) {
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Alex Gartrell

On Thu, Jul 2, 2015 at 2:18 PM, Alex Gartrell  wrote:
> If early demux was enabled, we'd use the route from the socket

Actually now that I think about it, this is probably broken, because
we don't reply to the packet but instead silently drop it.

-- 
Alex Gartrell http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Alex Gartrell

On Thu, Jul 2, 2015 at 1:44 AM, Julian Anastasov  wrote:
> I think, your patch from January is almost
> good:

I'll rebase it, add your other suggestions, test it, and send it in.

> And the patch from Eric for IPVS looks good too.

Are we sure that we want to change the semantics of set_owner_w to
orphan it?  It works for us but that's not the behavior I'd expect
from that function and might burn someone later?

I've actually been looking through the code more for other uses of
set_owner_w and I noticed this weird quirk:

The test was simple:
0) Enable ip_forward
1) Add an address to loopback and listen on it
2) Accept a connection and close it (creating a TIME-WAIT socket)
3) Add a new route to a gre tunnel

If early demux was enabled, we'd use the route from the socket
If early demux was disabled, we'd forward using the gre tunnel

Should we just replicate this behavior in ipvs?

if (!skb->dev && skb->sk) return NF_ACCEPT;

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Alex Gartrell

On Wed, Jul 1, 2015 at 4:26 PM, Eric Dumazet  wrote:
> I think you are mistaken Alex.

Indeed, I was!  Should be unsurpising.

>
> socket early demux cannot possibly set skb->destructor to sock_rfree()

Yeah I will admit adding the code to sock_rfree reflexively out of paranoia.

> If skb->destructor is set by early demux, it correctly points to sock_edemux()
>
> And this one correctly handles all socket variants.

Yes, the problem appears to be in ip_vs_prepare_tunneled_skb
(ip_vs_xmit.c:859 in 4.0)

if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) {
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
goto error;
if (skb->sk)
skb_set_owner_w(new_skb, skb->sk);
consume_skb(skb);
skb = new_skb;
}

skb_set_owner_w sets sock_wfree.

I'll figure out how to ensure that we're using an appropriate destructor here.

Appreciate the patience!

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Alex Gartrell

If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan it
(as we need to do in the ipvs forwarding case), sock_wfree and sock_rfree
are going to reach into the inet_timewait_sock and mess with fields that
don't exist.

Signed-off-by: Alex Gartrell 
---
 net/core/sock.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index 1e1fe9a..b37328f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1620,6 +1620,9 @@ void sock_wfree(struct sk_buff *skb)
struct sock *sk = skb->sk;
unsigned int len = skb->truesize;
 
+   if (sk->sk_state == TCP_TIME_WAIT)
+   return;
+
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
/*
 * Keep a reference on sk_wmem_alloc, this will be released
@@ -1665,6 +1668,9 @@ void sock_rfree(struct sk_buff *skb)
struct sock *sk = skb->sk;
unsigned int len = skb->truesize;
 
+   if (sk->sk_state == TCP_TIME_WAIT)
+   return;
+
atomic_sub(len, &sk->sk_rmem_alloc);
sk_mem_uncharge(sk, len);
 }
-- 
Alex Gartrell 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Intel-wired-lan] [PATCH 1/1] igb: Use ARRAY_SIZE instead fo sizeof(a)/sizeof(a[0])

2015-06-30 Thread Alex Gartrell

On Tue, Jun 30, 2015 at 1:16 PM, Fujinaka, Todd  wrote:

> I still would say no if I'm allowed, because to guarantee that this change - 
> that I don't think fixes anything - works in all cases, we need to do an 
> incredible amount of regression testing. Every variant of every Intel part 
> that uses this driver (and there are many) should be tested and will end up 
> being used by the community.
>

Validation is really simple: diff old_module.ko new_module.ko

And this is a good defensive measure, as it'll save you when someone
screws up and changes your array to a pointer to an array (you'll get
a build failure instead of 0).

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] net/unix: SO_REUSEPORT for AF_UNIX

2015-06-28 Thread Alex Gartrell

On Sun, Jun 28, 2015 at 2:52 AM, Conrad Hoffmann  wrote:
> Support the SO_REUSEPORT option for AF_UNIX (aka AF_LOCAL) sockets. Note
> that unlike the IP implementations, the semantics for AF_UNIX sockets are
> those of the original BSD implementation, i.e. each socket that
> successfully reuses a port completely takes over from the previous
> listener.

This is a really weird corner case from a user's perspective.  I think
sharing it is more reasonable, as you can always signal the other
process out of band.

-- 
Alex Gartrell
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

We've released a generic netlink python library -- gnlpy

2015-05-20 Thread Alex Gartrell


Hey everyone,

tl;dr; pure python generic netlink library with simple clients for ipvs 
and taskstats here: https://github.com/facebook/gnlpy


At Facebook we rely upon ipvs for most of our layer-4 load balancing 
needs.  It's mostly worked pretty great for us.  The standard way to 
interact with ipvs is ipvsadm, which will use netlink sockets to make 
rpc calls to the kernel (things like adding/remove services and real 
servers).  Because we run many, many instances of ipvs, we ended up 
scripting this away with a separate program that shells out to ipvsadm.


The down side to this approach was that we had to format the arguments 
and parse the result of the tool, which was kind of a pain. 
Additionally, we'd sometimes get into a bad state where the exec would 
fail or the binary wouldn't be there and the whole thing would kind of 
break.  To my knowledge this never caused any kind of large scale 
incident, but it was an annoying thing to deal with.


At some point, we made some changes to ipvs (heterogenous pools) and we 
were in the position of needing to roll a new ipvsadm binary to take 
advantage of the functionality.  That was fundamentally unappealing to 
yours truly, so I hacked together gnlpy (Generic NetLink PYthon library) 
instead.  It's been in production for several months.


At some point thereafter, someone on the lvs-devel list mentioned 
wanting to interact with ipvs through python and I made a vague 
assurance that we'd open source this thing.  Well here we are.


If you take a look at the code, you'll quickly notice that we haven't 
gone through the trouble of implementing every RPC call for the families 
we support and we certainly haven't gone through the trouble of 
implementing every generic netlink family.  We welcome your pull requests :)


--
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH bpf-next] libbpf: Fix unintentional success return code in bpf_object__load

Re: [PATCH] net: Provide lwtstate_free for non-lwt config

Re: [PATCH] net: Provide lwtstate_free for non-lwt config

[PATCH] net: Provide lwtstate_free for non-lwt config

Re: Panic with demuxed ipv4 multicast udp sockets on 4.0.4

[PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers

Re: [RFC PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers

[RFC PATCH net-next] ebpf: Allow dereferences of PTR_TO_STACK registers

Re: Why return E2BIG from bpf map update?

Why return E2BIG from bpf map update?

[PATCH,v2 net] net: sched: validate that class is found in qdisc_tree_decrease_qlen

[PATCH net] net: sched: validate that class is found in qdisc_tree_decrease_qlen

Re: [RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen

[RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen

Panic with demuxed ipv4 multicast udp sockets on 4.0.4

Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

Re: [PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

[PATCH,v2 net-next] ipvs: skb_orphan in case of forwarding

[PATCH net-next] ipvs: skb_orphan in case of forwarding

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

[PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

Re: [Intel-wired-lan] [PATCH 1/1] igb: Use ARRAY_SIZE instead fo sizeof(a)/sizeof(a[0])

Re: [PATCH RFC] net/unix: SO_REUSEPORT for AF_UNIX

We've released a generic netlink python library -- gnlpy

28 matches

Site Navigation

Mail list logo

Footer information