Re: [iovisor-dev] Print slow commit operations in nfsslower

2022-11-26 Thread Yonghong Song
On Tue, Nov 15, 2022 at 2:04 AM TATSUKAWA KOSUKE(立川 江介)
 wrote:
>
> Hi,
>
> I was using nfsslower in bcc tools and noticed that it doesn't print out
> slow commit operations, which sometimes cause long delays in NFS writes.
> I tried adding code to trace slow commit operations and created the
> attached patch.

The concept and code looks okay to me. One thing is you could use
RAW_TRACEPOINT_PROBE macro to be consistent with some other
bcc tools raw tracepoint usages.

Please submit a pull request.

>
> Please let me hear your opinions.
>
> NFS commits are quite slow and are not performed often, so the patch
> shouldn't incur visible overhead.
>
> Below is a sample output from the modified tool when writing a large
> file using dd command.  Most commits are issued from kworker threads,
> so the commit output may be mostly dropped when filtered by PID.
>
> TIME COMM   PIDT BYTES   OFF_KB   LAT(ms) FILENAME
> ...
> 16:46:11 dd 1212   W 1048576 651264 46.58 testfile
> 16:46:11 dd 1212   W 1048576 653312 54.41 testfile
> 16:46:11 dd 1212   W 1048576 654336 18.96 testfile
> 16:46:11 dd 1212   W 1048576 655360 49.05 testfile
> 16:46:11 dd 1212   W 1048576 657408 82.96 testfile
> 16:46:11 dd 1212   W 1048576 659456109.25 testfile
> 16:46:12 dd 1212   W 1048576 660480163.55 testfile
> 16:46:12 dd 1212   W 1048576 662528205.44 testfile
> 16:46:13 dd 1212   W 1048576 663552751.02 testfile
> 16:46:37 kworker/u8:5   1207   C 0   027449.05 testfile
> 16:46:37 kworker/u8:5   1207   C 0   026725.16 testfile
> 16:46:37 kworker/u8:5   1207   C 0   027592.04 testfile
> 16:46:37 kworker/u8:4   1206   C 0   022188.75 testfile
> 16:46:37 kworker/u8:4   1206   C 0   026092.59 testfile
> 16:46:37 kworker/u8:4   1206   C 0   027268.90 testfile
> 16:46:37 kworker/u8:4   1206   C 0   022303.24 testfile
> 16:46:37 kworker/u8:4   1206   C 0   027081.34 testfile
> 16:46:37 dd 1212   W 1048576 664576   24337.80 testfile
> 16:46:38 dd 1212   W 1048576 958464 61.77 testfile
> 16:46:38 dd 1212   W 1048576 960512 56.60 testfile
> 16:46:38 dd 1212   W 1048576 963584 55.75 testfile
> 16:46:38 dd 1212   W 1048576 965632 54.84 testfile
> ...
>
> Best regards.
> ---
> Kosuke TATSUKAWA  | 1st Platform Software Division
>   | NEC Solution Innovators
>   | tatsu-...@nec.com
>
> 
> diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
> --- a/tools/nfsslower   2022-11-14 12:39:23.677073240 +0900
> +++ b/tools/nfsslower   2022-11-14 14:44:49.960073240 +0900
> @@ -9,6 +9,8 @@
>  # This script traces some common NFS operations: read, write, opens and
>  # getattr. It measures the time spent in these operations, and prints details
>  # for each that exceeded a threshold.
> +# The script also traces commit operations, which is specific to nfs and 
> could
> +# be pretty slow.
>  #
>  # WARNING: This adds low-overhead instrumentation to these NFS operations,
>  # including reads and writes from the file system cache. Such reads and 
> writes
> @@ -25,6 +27,8 @@
>  # Currently there aren't any tracepoints available for nfs_read_file,
>  # nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
>  # tracepoints but we chose to use kprobes for consistency
> +# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
> +# nfs:nfs_commit_done.
>  #
>  # 31-Aug-2017   Samuel Nair created this. Should work with NFSv{3,4}
>
> @@ -41,8 +45,8 @@ examples = """
>  ./nfsslower -p 121  # trace pid 121 only
>  """
>  parser = argparse.ArgumentParser(
> -description="""Trace READ, WRITE, OPEN \
> -and GETATTR NFS calls slower than a threshold,\
> +description="""Trace READ, WRITE, OPEN, GETATTR \
> +and COMMIT NFS calls slower than a threshold,\
>  supports NFSv{3,4}""",
>  formatter_class=argparse.RawDescriptionHelpFormatter,
>  epilog=examples)
> @@ -66,11 +70,13 @@ bpf_text = """
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #define TRACE_READ 0
>  #define TRACE_WRITE 1
>  #define TRACE_OPEN 2
>  #define TRACE_GETATTR 3
> +#define TRACE_COMMIT 4
>
>  struct val_t {
>  u64 ts;
> @@ -79,6 +85,12 @@ struct val_t {
>  struct dentry *d;
>  };
>
> +struct commit_t {
> +u64 ts;
> +u64 offset;
> +u64 count;
> +};
> +
>  struct data_t {
>  // XXX: switch some to u32's when supported
>  u64 ts_us;
> @@ -93,6 +105,7 @@ struct data_t {
>
>  BPF_HASH(entryinfo, u64, struct val_t);
>  BPF_PERF_OUTPUT(events);
> +BPF_HASH(commitinfo, u64, struct commit_t);
>
>  int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
>  struct iov_iter *data)

Re: [iovisor-dev] Looking for some bcc issues for new bie

2022-08-10 Thread Yonghong Song
Hi, Jules,

Just go to https://github.com/iovisor/bcc/issues and you can find
quite some issues for you to investigate. Thanks!

Yonghong

On Wed, Aug 10, 2022 at 3:02 AM Jules Irenge  wrote:
>
> Hi
> I am looking for some bcc issues for newbie.
> Any link ?
>
> Jules
> ..
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#2018): https://lists.iovisor.org/g/iovisor-dev/message/2018
Mute This Topic: https://lists.iovisor.org/mt/92933167/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] #bcc Count map.ringbuf_reserve() failures

2021-11-02 Thread Yonghong Song
On Tue, Nov 2, 2021 at 7:31 AM Eelco Chaudron  wrote:
>
> [Edited Message Follows]
>
> On Tue, Nov 2, 2021 at 03:24 PM, Eelco Chaudron wrote:
>
> Hi,
>
> Was wondering if there is a way to count the number of times 
> map.ringbuf_reserve() fails for a BPF_RINGBUF_OUTPUT buffer?
>
> This way I can get notified in userspace that I have missed events, and might 
> need to increase the buffer size.
>
> Cheers,
>
> Eelco
>
> Thought I added the #bcc tag but I no longer see it :( So just in case, it's 
> not clear, this is with BCC.

You can check return value of map.ringbuf_reserve(). If the
reservation failed, you can notify user space through map, another
side channel ringbuf, perf buf, etc. Depending on your program type
and program running context, you might be able to use
bpf_send_signal() helper to send a signal to the *current* process.

>
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#2009): https://lists.iovisor.org/g/iovisor-dev/message/2009
Mute This Topic: https://lists.iovisor.org/mt/86767077/21656
Mute #bcc:https://lists.iovisor.org/g/iovisor-dev/mutehashtag/bcc
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] Access packet payload in TC egress programs

2021-10-25 Thread Yonghong Song
On Fri, Oct 22, 2021 at 12:31 AM Federico Parola
 wrote:
>
> Thanks for the answer, I wasn't aware of the existence of that helper.
> I have two additional comments:
>
> 1. The documentation of the helper says that passing a length of zero
> should pull the whole length of the packet [1], however with that
> parameter the length of direct accessible data stays unchanged. I think
> there is a mismatch in the behavior and the documentation.

The source code is
BPF_CALL_2(bpf_skb_pull_data, struct sk_buff *, skb, u32, len)
{
/* Idea is the following: should the needed direct read/write
 * test fail during runtime, we can pull in more data and redo
 * again, since implicitly, we invalidate previous checks here.
 *
 * Or, since we know how much we need to make read/writeable,
 * this can be done once at the program beginning for direct
 * access case. By this we overcome limitations of only current
 * headroom being accessible.
 */
return bpf_try_make_writable(skb, len ? : skb_headlen(skb));
}

So if len is 0, it will only try to make *existing* linear data to be
writable, so
you are right. It seems we are not not trying to pull more data in. I will
check with other kernel developers later.

>
> 2. I'd like to avoid re-parsing all the headers after I have pulled new
> data. To do so I save the offset I just reached (the end of the TCP
> header), pull data, get the new data and data_end pointers and add the
> offset to data. However the verifier does not accept my accesses to the
> packet from this point on. Here is some example code:

The current behavior is after data pull, you will need to reparse the packet.
There are a lot of helpers fitting in this case:

bool bpf_helper_changes_pkt_data(void *func)
{
if (func == bpf_skb_vlan_push ||
func == bpf_skb_vlan_pop ||
func == bpf_skb_store_bytes ||
func == bpf_skb_change_proto ||
func == bpf_skb_change_head ||
func == sk_skb_change_head ||
func == bpf_skb_change_tail ||
func == sk_skb_change_tail ||
func == bpf_skb_adjust_room ||
func == sk_skb_adjust_room ||
func == bpf_skb_pull_data ||
func == sk_skb_pull_data ||
func == bpf_clone_redirect ||
func == bpf_l3_csum_replace ||
func == bpf_l4_csum_replace ||
func == bpf_xdp_adjust_head ||
func == bpf_xdp_adjust_meta ||
func == bpf_msg_pull_data ||
func == bpf_msg_push_data ||
func == bpf_msg_pop_data ||
func == bpf_xdp_adjust_tail ||
#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
func == bpf_lwt_seg6_store_bytes ||
func == bpf_lwt_seg6_adjust_srh ||
func == bpf_lwt_seg6_action ||
#endif
#ifdef CONFIG_INET
func == bpf_sock_ops_store_hdr_opt ||
#endif
func == bpf_lwt_in_push_encap ||
func == bpf_lwt_xmit_push_encap)
return true;

return false;
}

It is possible that we could fine tune this behavior as some helpers
like bpf_skb_pull_data() may not need to start over again. But I
could miss some conditions.

Could you post your questions at b...@vger.kernel.org?
Networking people in the mailing list may give you a better
answer why this behavior for bpf_skb_pull_data() and whether
it can be improved.

>
> unsigned payload_offset = (void *)tcph + (tcph->doff << 2) - data;
> bpf_skb_pull_data(ctx, ctx->len);
> data = (void *)(long)ctx->data;
> data_end = (void *)(long)ctx->data_end;
>
> struct tls_record_hdr *rech = data + payload_offset;
> if ((void *)(rech + 1) > data_end)
>  return TC_ACT_OK;
>
> if (rech->type == TLS_CONTENT_TYPE_HANDSAHKE)
>  bpf_trace_printk("It's a handshake");
>
> Running this code gives me the error "R1 offset is outside of the
> packet" even if I performed the correct check on packet boundaries. If I
> re-parse all header the code is accepted. Is there a way to solve the
> problem?
>
> [1]
> https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L2312
>
> On 20/10/21 08:11, Y Song wrote:
> > On Tue, Oct 19, 2021 at 8:13 AM Federico Parola
> >  wrote:
> >>
> >> Dear all,
> >> how can I access the payload of the packet in a program attached to the
> >> TC egress hook (SCHED_CLS attached to clsact qdisc)?
> >> ctx->data_end points to the end of the L4 header, while on the ingress
> >> hook it points to the end of the packet (tested on kernel v5.14).
> >
> > This could be the case that linear data only covers up to the end of
> > L4 header. In such cases, you can use bpf_skb_pull_data() helper
> > to get more data into linear region and after that your ctx->data_end
> > will point to much later packet data.
> >
> >>
> >> Best regards,
> >> Federico Parola
> >>
> >>
> >>
> >>
> >>
>
>
> 
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

Re: [iovisor-dev] Access packet payload in TC egress programs

2021-10-20 Thread Yonghong Song
On Tue, Oct 19, 2021 at 8:13 AM Federico Parola
 wrote:
>
> Dear all,
> how can I access the payload of the packet in a program attached to the
> TC egress hook (SCHED_CLS attached to clsact qdisc)?
> ctx->data_end points to the end of the L4 header, while on the ingress
> hook it points to the end of the packet (tested on kernel v5.14).

This could be the case that linear data only covers up to the end of
L4 header. In such cases, you can use bpf_skb_pull_data() helper
to get more data into linear region and after that your ctx->data_end
will point to much later packet data.

>
> Best regards,
> Federico Parola
>
>
> 
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#2003): https://lists.iovisor.org/g/iovisor-dev/message/2003
Mute This Topic: https://lists.iovisor.org/mt/86442134/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] libffi trampolines and stack traces? : was Overly brief stack traces for Java/linux ?

2021-04-17 Thread Yonghong Song
On Fri, Apr 16, 2021 at 12:56 PM Yonghong Song via lists.iovisor.org
 wrote:
>
> On Thu, Apr 15, 2021 at 11:09 PM Bradley Schatz
>  wrote:
> >
> > I still digging into this issue, and have hacked memleak/bcc to show 
> > addresses when they cant be resolved. In other places, I'm seeing jit 
> > complied java stack frames showing up alongside C ones, which is the 
> > expected behaviour when using Java with -XX:+PreserveFramePointer.
> >
> > I'm still seeing these allocation traces with only two frames per the below.
> >
> > 11534336 bytes in 11 allocations from stack
> > [7efeebb115d4 unknown] [jna7632521838566054573.tmp ]
> > [7efec27f7440 unknown] [perf-15445.map]
> > 14680064 bytes in 14 allocations from stack
> > [7efeebb115d4 unknown] [jna7632521838566054573.tmp ]
> > [7efec27f7380 unknown] [perf-15445.map]
> >
> > I suspect the lower frame above is a libffi generated trampoline. Are there 
> > any known issues with the eBPF stack tracing infrastructure and such 
> > trampolines? Workarounds?
>
> The bpf stack unwinder is using the kernel one which is the frame
> pointer based and it may have issues with generated trampoline code
> which may mess up frame pointer based stack chain. One possibility is
> to use perf call-graph "dwarf" mode to get the raw data to user space
> and use more powerful library like libunwind etc. to unwind the stack.
> But I haven't do this together with bpf program yet, and cannot
> describe whether and how to do "dwarf" call-graph with bpf program
> together.

I briefly looked at the perf and kernel code and experimented with
perf dwarf mode.
It is possible for bpf to copy user stack to user space. You could do
the following:
 - when you do perf event open, the sample_type needs to be set
properly. The following
is what callchain dwarf mode had:
 sample_type = PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|
  PERF_SAMPLE_ADDR|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_CPU|
  PERF_SAMPLE_PERIOD|PERF_SAMPLE_REGS_USER|
  PERF_SAMPLE_STACK_USER|PERF_SAMPLE_DATA_SRC

I do not think you need all of them, you definitely need PERF_SAMPLE_REGS_USER,
maybe you can PERF_SAMPLE_RAW and PERF_SAMPLE_REGS_USER
to see whether it works or not.

In kernel header file  linux:include/linux/perf_event.h, we have
struct perf_sample_data {
  ...
struct perf_regsregs_user;
struct perf_regsregs_intr;
u64 stack_user_size;
}

If you have access to perf_sample_data, you can read regs_user and find
stack pointer and then you can copy some bytes (say 2KB) to user space
for analysis. Using perf_event_output may be too expensive. perf is using
user mmap memory. You can also use a map for this purpose.

How to access perf_sample_data?
Typically sampling bpf program has the signature like
  int do_sample(struct bpf_perf_event_data *ctx)
During program run, the program really got the following ctx
 struct bpf_perf_event_data_kern {
bpf_user_pt_regs_t *regs;
struct perf_sample_data *data;
struct perf_event *event;
 };
So you can use bpf_probe_read_kernel() to get 'data' and then other
information inside perf_sample_data.

You can experiment this way if you really like to see whether it can
solve your problem or not. If it works, I guess we can add a bpf helper
to do the copy as such copy length tends to be big.

>
> >
> > Thanks,
> > Bradley
> >
> >
> >
> > On 7/4/21, 9:04 pm, "Bradley Schatz"  wrote:
> >
> > >What does '[unknown] [perf-18047.map]' mean? Does this mean
> > >   perf-18047.map is not found? If the perf-.map file cannot be 
> > found,
> > >symbolization won't be possible. Maybe you want to double check 
> > this?
> >
> > The file perf-18047.map is there and from other parts of the stack 
> > trace I can see it being used to successfully resolve symbols.
> >
> > Thanks!
> >
> >
> >
> > On 7/4/21, 4:44 am, "Y Song"  wrote:
> >
> > On Mon, Apr 5, 2021 at 10:08 PM Bradley Schatz
> >  wrote:
> > >
> > > Thanks for the suggestion. I found a tunable to keep the JNI 
> > shared library in memory after loading. As you can see below, it is no 
> > longer showing as deleted.
> > >
> > > 13238272 bytes in 404 allocations from stack
> > > [unknown] [jna257690384454344.tmp]
> > > [unknown] [perf-18047.map]
> >
> > I have no experience with perf-map-agent,

Re: [iovisor-dev] libffi trampolines and stack traces? : was Overly brief stack traces for Java/linux ?

2021-04-16 Thread Yonghong Song
On Thu, Apr 15, 2021 at 11:09 PM Bradley Schatz
 wrote:
>
> I still digging into this issue, and have hacked memleak/bcc to show 
> addresses when they cant be resolved. In other places, I'm seeing jit 
> complied java stack frames showing up alongside C ones, which is the expected 
> behaviour when using Java with -XX:+PreserveFramePointer.
>
> I'm still seeing these allocation traces with only two frames per the below.
>
> 11534336 bytes in 11 allocations from stack
> [7efeebb115d4 unknown] [jna7632521838566054573.tmp ]
> [7efec27f7440 unknown] [perf-15445.map]
> 14680064 bytes in 14 allocations from stack
> [7efeebb115d4 unknown] [jna7632521838566054573.tmp ]
> [7efec27f7380 unknown] [perf-15445.map]
>
> I suspect the lower frame above is a libffi generated trampoline. Are there 
> any known issues with the eBPF stack tracing infrastructure and such 
> trampolines? Workarounds?

The bpf stack unwinder is using the kernel one which is the frame
pointer based and it may have issues with generated trampoline code
which may mess up frame pointer based stack chain. One possibility is
to use perf call-graph "dwarf" mode to get the raw data to user space
and use more powerful library like libunwind etc. to unwind the stack.
But I haven't do this together with bpf program yet, and cannot
describe whether and how to do "dwarf" call-graph with bpf program
together.

>
> Thanks,
> Bradley
>
>
>
> On 7/4/21, 9:04 pm, "Bradley Schatz"  wrote:
>
> >What does '[unknown] [perf-18047.map]' mean? Does this mean
> >   perf-18047.map is not found? If the perf-.map file cannot be 
> found,
> >symbolization won't be possible. Maybe you want to double check this?
>
> The file perf-18047.map is there and from other parts of the stack trace 
> I can see it being used to successfully resolve symbols.
>
> Thanks!
>
>
>
> On 7/4/21, 4:44 am, "Y Song"  wrote:
>
> On Mon, Apr 5, 2021 at 10:08 PM Bradley Schatz
>  wrote:
> >
> > Thanks for the suggestion. I found a tunable to keep the JNI shared 
> library in memory after loading. As you can see below, it is no longer 
> showing as deleted.
> >
> > 13238272 bytes in 404 allocations from stack
> > [unknown] [jna257690384454344.tmp]
> > [unknown] [perf-18047.map]
>
> I have no experience with perf-map-agent, but the following is what I 
> guess:
>[perf-18047.map] is used to find the mapping between address and 
> symbol.
> What does '[unknown] [perf-18047.map]' mean? Does this mean
> perf-18047.map is not found? If the perf-.map file cannot be 
> found,
> symbolization won't be possible. Maybe you want to double check this?
>
> >
> > No improvement in granularity though.
> >
> > In the VM I'm using -XX:+PreserveFramePointer 
> -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints. In perf_maps_agent, 
> I'm using "unfoldall"
> >
> > Any other suggestions?
> >
> > Thanks!
> >
> >
> >
> >
> > On 3/4/21, 2:42 am, "Y Song"  wrote:
> >
> > On Wed, Mar 31, 2021 at 11:25 PM Bradley Schatz
> >  wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I’m just starting to come to grips with bcc & perf-map-agent 
> for introspecting java on linux, with the goal of identifying what appears to 
> be an off-heap memory leak (using memleak).
> > >
> > >
> > >
> > > I appear to be getting reliable stack decoding for jvm 
> library code and for jit’ed java methods (see below for an example of the 
> former). However I am seeing some very short stack traces which don’t seem to 
> decode (the latter three stacks of below).
> > >
> > >
> > >
> > > It’s looking to me like the frame starting with “jna…” is 
> likely the native JNI shared library for the FFI library “JNA”.
> > >
> > >
> > >
> > > Any suggestions as to why these latter three are so brief 
> and/or how I can increase the resolution?
> >
> > I can see the file has been marked as deleted.
> >
> > 34603008 bytes in 33 allocations from stack
> >
> >   [unknown] [jna9005484735610534564.tmp (deleted)]
> >
> >   [unknown] [perf-31566.map]
> >
> >96468992 bytes in 92 allocations from stack
> >
> >   [unknown] [jna9005484735610534564.tmp (deleted)]
> >
> >   [unknown] [perf-31566.map]
> >
> > So the file has been removed in userspace and 

Re: [iovisor-dev] Overly brief stack traces for Java/linux ?

2021-04-06 Thread Yonghong Song
On Mon, Apr 5, 2021 at 10:08 PM Bradley Schatz
 wrote:
>
> Thanks for the suggestion. I found a tunable to keep the JNI shared library 
> in memory after loading. As you can see below, it is no longer showing as 
> deleted.
>
> 13238272 bytes in 404 allocations from stack
> [unknown] [jna257690384454344.tmp]
> [unknown] [perf-18047.map]

I have no experience with perf-map-agent, but the following is what I guess:
   [perf-18047.map] is used to find the mapping between address and symbol.
What does '[unknown] [perf-18047.map]' mean? Does this mean
perf-18047.map is not found? If the perf-.map file cannot be found,
symbolization won't be possible. Maybe you want to double check this?

>
> No improvement in granularity though.
>
> In the VM I'm using -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions 
> -XX:+DebugNonSafepoints. In perf_maps_agent, I'm using "unfoldall"
>
> Any other suggestions?
>
> Thanks!
>
>
>
>
> On 3/4/21, 2:42 am, "Y Song"  wrote:
>
> On Wed, Mar 31, 2021 at 11:25 PM Bradley Schatz
>  wrote:
> >
> > Hi,
> >
> >
> >
> > I’m just starting to come to grips with bcc & perf-map-agent for 
> introspecting java on linux, with the goal of identifying what appears to be 
> an off-heap memory leak (using memleak).
> >
> >
> >
> > I appear to be getting reliable stack decoding for jvm library code and 
> for jit’ed java methods (see below for an example of the former). However I 
> am seeing some very short stack traces which don’t seem to decode (the latter 
> three stacks of below).
> >
> >
> >
> > It’s looking to me like the frame starting with “jna…” is likely the 
> native JNI shared library for the FFI library “JNA”.
> >
> >
> >
> > Any suggestions as to why these latter three are so brief and/or how I 
> can increase the resolution?
>
> I can see the file has been marked as deleted.
>
> 34603008 bytes in 33 allocations from stack
>
>   [unknown] [jna9005484735610534564.tmp (deleted)]
>
>   [unknown] [perf-31566.map]
>
>96468992 bytes in 92 allocations from stack
>
>   [unknown] [jna9005484735610534564.tmp (deleted)]
>
>   [unknown] [perf-31566.map]
>
> So the file has been removed in userspace and current bcc won't be
> pass to parse it since it takes the file name as
> "jna9005484735610534564.tmp (deleted)"
> The file name is actually taken from /proc//maps.
>
> I am not sure whether you can hack to parse "jna9005484735610534564.tmp" 
> or not.
> But I would consider it is unsafe to do that as the original file
> related info may just
> exist in kernel and there is a reference to it. For user space, it is
> either gone or
> could be replaced by something else. So the safest way is to find a place 
> to
> do symbolization before file is gone or keep tmp file a little bit longer.
> >
> >
> >
> > Apologies if this is the wrong place for such a question. Thank you for 
> your help.
> >
> >
> >
> > Kind regards,
> >
> > Bradley
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >119408 bytes in 71 allocations from stack
> >
> >   os::malloc(unsigned long, MemoryType, NativeCallStack 
> const&)+0xb5 [libjvm.so]
> >
> >   CodeBlob::set_oop_maps(OopMapSet*) [clone .part.5]+0x75 
> [libjvm.so]
> >
> >   CodeBlob::CodeBlob(char const*, CodeBuffer*, int, int, 
> int, int, OopMapSet*)+0xe3 [libjvm.so]
> >
> >   nmethod::nmethod(Method*, int, int, int, CodeOffsets*, 
> int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, 
> ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, int)+0x4d 
> [libjvm.so]
> >
> >   nmethod::new_nmethod(methodHandle, int, int, 
> CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, 
> int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, 
> AbstractCompiler*, int)+0x219 [libjvm.so]
> >
> >   ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, 
> CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, 
> ImplicitExceptionTable*, AbstractCompiler*, int, bool, bool, RTMState)+0x1b1 
> [libjvm.so]
> >
> >   Compile::Compile(ciEnv*, C2Compiler*, ciMethod*, int, 
> bool, bool, bool)+0xe60 [libjvm.so]
> >
> >   C2Compiler::compile_method(ciEnv*, ciMethod*, int)+0xa3 
> [libjvm.so]
> >
> >   
> CompileBroker::invoke_compiler_on_method(CompileTask*)+0x808 [libjvm.so]
> >
> >   CompileBroker::compiler_thread_loop()+0x6d8 [libjvm.so]
> >
> >   JavaThread::thread_main_inner()+0x1c7 [libjvm.so]
> >
> >   JavaThread::run()+0x2fa [libjvm.so]
> >
> 

Re: [iovisor-dev] Overly brief stack traces for Java/linux ?

2021-04-02 Thread Yonghong Song
On Wed, Mar 31, 2021 at 11:25 PM Bradley Schatz
 wrote:
>
> Hi,
>
>
>
> I’m just starting to come to grips with bcc & perf-map-agent for 
> introspecting java on linux, with the goal of identifying what appears to be 
> an off-heap memory leak (using memleak).
>
>
>
> I appear to be getting reliable stack decoding for jvm library code and for 
> jit’ed java methods (see below for an example of the former). However I am 
> seeing some very short stack traces which don’t seem to decode (the latter 
> three stacks of below).
>
>
>
> It’s looking to me like the frame starting with “jna…” is likely the native 
> JNI shared library for the FFI library “JNA”.
>
>
>
> Any suggestions as to why these latter three are so brief and/or how I can 
> increase the resolution?

I can see the file has been marked as deleted.

34603008 bytes in 33 allocations from stack

  [unknown] [jna9005484735610534564.tmp (deleted)]

  [unknown] [perf-31566.map]

   96468992 bytes in 92 allocations from stack

  [unknown] [jna9005484735610534564.tmp (deleted)]

  [unknown] [perf-31566.map]

So the file has been removed in userspace and current bcc won't be
pass to parse it since it takes the file name as
"jna9005484735610534564.tmp (deleted)"
The file name is actually taken from /proc//maps.

I am not sure whether you can hack to parse "jna9005484735610534564.tmp" or not.
But I would consider it is unsafe to do that as the original file
related info may just
exist in kernel and there is a reference to it. For user space, it is
either gone or
could be replaced by something else. So the safest way is to find a place to
do symbolization before file is gone or keep tmp file a little bit longer.
>
>
>
> Apologies if this is the wrong place for such a question. Thank you for your 
> help.
>
>
>
> Kind regards,
>
> Bradley
>
>
>
>
>
>
>
>
>
>119408 bytes in 71 allocations from stack
>
>   os::malloc(unsigned long, MemoryType, NativeCallStack 
> const&)+0xb5 [libjvm.so]
>
>   CodeBlob::set_oop_maps(OopMapSet*) [clone .part.5]+0x75 
> [libjvm.so]
>
>   CodeBlob::CodeBlob(char const*, CodeBuffer*, int, int, int, 
> int, OopMapSet*)+0xe3 [libjvm.so]
>
>   nmethod::nmethod(Method*, int, int, int, CodeOffsets*, int, 
> DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, 
> ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, int)+0x4d 
> [libjvm.so]
>
>   nmethod::new_nmethod(methodHandle, int, int, CodeOffsets*, int, 
> DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, 
> ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, 
> int)+0x219 [libjvm.so]
>
>   ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, 
> CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, 
> ImplicitExceptionTable*, AbstractCompiler*, int, bool, bool, RTMState)+0x1b1 
> [libjvm.so]
>
>   Compile::Compile(ciEnv*, C2Compiler*, ciMethod*, int, bool, 
> bool, bool)+0xe60 [libjvm.so]
>
>   C2Compiler::compile_method(ciEnv*, ciMethod*, int)+0xa3 
> [libjvm.so]
>
>   CompileBroker::invoke_compiler_on_method(CompileTask*)+0x808 
> [libjvm.so]
>
>   CompileBroker::compiler_thread_loop()+0x6d8 [libjvm.so]
>
>   JavaThread::thread_main_inner()+0x1c7 [libjvm.so]
>
>   JavaThread::run()+0x2fa [libjvm.so]
>
>   java_start(Thread*)+0x102 [libjvm.so]
>
>   start_thread+0xf3 [libpthread-2.28.so]
>
>34603008 bytes in 33 allocations from stack
>
>   [unknown] [jna9005484735610534564.tmp (deleted)]
>
>   [unknown] [perf-31566.map]
>
>96468992 bytes in 92 allocations from stack
>
>   [unknown] [jna9005484735610534564.tmp (deleted)]
>
>   [unknown] [perf-31566.map]
>
>295698432 bytes in 282 allocations from stack
>
>   [unknown] [jna9005484735610534564.tmp (deleted)]
>
>   [unknown] [perf-31566.map]
>
>
>
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1982): https://lists.iovisor.org/g/iovisor-dev/message/1982
Mute This Topic: https://lists.iovisor.org/mt/81769854/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] Questions about runqlen

2021-03-20 Thread Yonghong Song
On Tue, Mar 16, 2021 at 4:00 AM Abel Wu  wrote:
>
> Hi, when I looked into the runqlen script yesterday, I found that,
> sadly, I misunderstood the "queue length" all the time not only the
> "length" part but also the "queue" part.

Could you file an "issue" for the question? This issue, the
questions/answers can be easily tracked.

>
> Queue
> =
> Only CFS runqueues are taken into account. This makes sense when
> main workloads are all under CFS scheduler, which is common in
> cloud scenario. But what I don't quite follow is that the selected
> queue is task->se.cfs_rq which is from a task view, rather than the
> top level cfs_rq from a cpu view. I suppose the task view is not
> enough to draw the whole picture of saturation?
>
> Length
> ==
> Within this scope length means the number of schedulable entities,
> that is cfs_rq->nr_running. From time sharing point of view, it is
> OK because it represents how many units involved in scheduling of
> this cfs_rq. But what about from execution point of view in which
> the number of tasks (cfs_rq->h_nr_running) will be used?
>
> And besides the above, without the shares information of each entity,
> how could runqlen help us optimizing the performance? Maybe we should
> always focus on occupancy rather than length?

There are some answers in this issue:
  https://github.com/iovisor/bcc/issues/3093

To be accurate for cgroup/task-group environments, you may
need to traverse to the root. Could you check and experiment
whether this can solve your issue? if this is the case, we may
need to enhance runqlen.py. Maybe you could help provide
a pull request? Thanks!

>
> It would be very much appreciated if someone can shed some light.
>
> Thanks & Best regards,
> Abel
>
>
> 
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1979): https://lists.iovisor.org/g/iovisor-dev/message/1979
Mute This Topic: https://lists.iovisor.org/mt/81373124/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] BCC and passing packet from XDP to user-mode app #bcc

2021-03-20 Thread Yonghong Song
On Thu, Mar 18, 2021 at 4:49 AM Federico Parola
 wrote:
>
> Hi,
> the virtual function you are looking for is perf_submit_skb():
>
> https://github.com/iovisor/bcc/blob/c8de00e1746e242cdcd68b4673a083bb467cd35e/src/cc/export/helpers.h#L193
>
> Strangely it is not documented in the reference guide.

Thanks, Federico and others. Maybe one of you can add it to the
reference_guide.md? We
do have events.perf_submit there. Thanks!

>
> Best regards,
> Federico Parola
>
> On 18/03/21 10:29, v.a.bon...@gmail.com wrote:
> > Hi!
> > Is it possible to pass full ethernet packet from XDP to user-mode app
> > using BCC?
> > I wrote C code like this:
> > BPF_PERF_OUTPUT(captured_data);
> > capture(struct xdp_md *ctx)
> > {
> >  events.perf_submit(ctx, ...);
> > }
> > But there is no flags argument in perf_submit function (but
> > bpf_perf_event_output has such argument).
> > Without BCC I can write such code to pass full packet to user-mode:
> > struct packet_info
> > {
> >  uint32_t packet_len;
> >  uint32_t iface_id;
> > };
> > struct bpf_map_def SEC("maps") captured_data =
> > {
> >  .type= BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> >  .key_size= sizeof(u32),
> >  .value_size  = sizeof(u32),
> >  .max_entries = MAX_CPUS
> > };
> > SEC("xdp")
> > int capture_kern(struct xdp_md *ctx)
> > {
> >  u32 len = ctx->data_end - ctx->data;
> >  u64 flags = BPF_F_CURRENT_CPU;
> >  flags |= (u64)len << 32;
> >  struct packet_info info = {len, ctx->ingress_ifindex};
> >  bpf_perf_event_output(ctx, _data, flags, , sizeof(info));
> >  return XDP_PASS;
> > }
> > How can I do the same when using BCC?
> >
>
>
> 
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1978): https://lists.iovisor.org/g/iovisor-dev/message/1978
Mute This Topic: https://lists.iovisor.org/mt/81425283/21656
Mute #bcc:https://lists.iovisor.org/g/iovisor-dev/mutehashtag/bcc
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] BCC Support for BPF Subprograms with Tail Calls (Kernel 5.10 Feature)

2021-02-25 Thread Yonghong Song
On Wed, Feb 24, 2021 at 12:24 PM  wrote:
>
> Hello,
>
> I was wondering if BCC implements the new BPF feature (as of kernel 5.10) to 
> allow BPF programs to utilize both BPF tail calls and BPF subprograms. This 
> behavior is described near the end of this section of the BPF reference 
> guide. I am interested in this functionality to extend a BPF program in order 
> to reach the limit of 8KB of stack space.

You can use bpf tail calls today. You can look at
bcc/tests/cc/test_prog_table.cc for an example. bcc does not support
subprogram yet. In the future we do plan to be more libbpf compatible
so we can use those features.
BTW, the stack limit is 512 bytes not 8KB.

>
> Thanks,
>
> Jake
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1963): https://lists.iovisor.org/g/iovisor-dev/message/1963
Mute This Topic: https://lists.iovisor.org/mt/80886632/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] android adeb KASAN_SHADOW_SCALE_SHIFT

2021-02-17 Thread Yonghong Song
Unfortunately, the value is defined in Makefile,

```
ifeq ($(CONFIG_KASAN_SW_TAGS), y)
KASAN_SHADOW_SCALE_SHIFT := 4
else ifeq ($(CONFIG_KASAN_GENERIC), y)
KASAN_SHADOW_SCALE_SHIFT := 3
endif

KBUILD_CFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
KBUILD_CPPFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
KBUILD_AFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
```

We could add something above in helpers.h e.g.,
```
#if defined(__aarch64__)
#if defined(CONFIG_KASAN_SW_TAGS)
#define KASAN_SHADOW_SCALE_SHIFT 4
#elif defined(CONFIG_KASAN_GENERIC)
#define KASAN_SHADOW_SCALE_SHIFT 3
#endif
#endif
```

You can also add the above code to the tool itself.

On Wed, Feb 10, 2021 at 10:18 AM katrina lulz
 wrote:
>
> Hi *,
> I managed to setup adeb on a pixel4 with custom kernel compiled as suggested 
> by adeb's README.
> The setup is working  fine for some BCC tools, as vfsstat but a few as 
> opensnoop and the trace command return the following error:
>
> In file included from ./arch/arm64/include/asm/thread_info.h:13:
> ./arch/arm64/include/asm/memory.h:136:24: error: use of undeclared identifier 
> 'KASAN_SHADOW_SCALE_SHIFT'
> return kimage_vaddr - KIMAGE_VADDR;
>
> I verified by the config.gz on the device that IKHEADERS and the other BPF 
> related configs are correctly enabled.
> Any ideas on how to fix the above error?
>
> thanks,
> best.
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1952): https://lists.iovisor.org/g/iovisor-dev/message/1952
Mute This Topic: https://lists.iovisor.org/mt/80538782/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] BPF perf event: runq length

2021-02-17 Thread Yonghong Song
On Mon, Feb 15, 2021 at 3:45 AM Raga lahari  wrote:
>
> Hi,
>
>
> I am trying to write a BPF perf event program to get CPU runq length.  The 
> Following is the code snippet. I am observing that a big integer (len is 
> 2839296536 ) as queue length in trace output for some instances.
>
>
> Can someone please let me know that  whether this approach helps to get 
> length?

Take a look at bcc tool runqlen.py. Did you get abnormal len with runqlen.py?

>
>
> struct cfs_rq_partial {
>
> struct load_weight load;
>
> unsigned long runnable_weight;
>
> unsigned int nr_running;
>
> unsigned int h_nr_running;
>
> };
>
>
> #define _(P) ({typeof(P) val = 0; bpf_probe_read(, sizeof(val), ); 
> val;})
>
>
> SEC("perf_event")
>
> int do_sample(struct bpf_perf_event_data *ctx)
>
> {
>
>
>
> struct cfs_rq_partial *my_q = NULL;
>
> struct task_struct *task = NULL;
>
> unsigned int len;
>
>
>
> task = (struct task_struct *)bpf_get_current_task();
>
> my_q = _(task->se.cfs_rq);
>
> len = _(my_q->nr_running);
>
> bpf_printk("len is %u", len);
>
>
>
>   …..
>
> }
>
>
> I have tested with another program and confirmed that cfs_rq has 
> runnable_weight filed.
>
>
>
>
> Regards,
> Ragalahari
>
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1951): https://lists.iovisor.org/g/iovisor-dev/message/1951
Mute This Topic: https://lists.iovisor.org/mt/80651333/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] Weird behaviour when updating a hash map from userspace

2021-01-15 Thread Yonghong Song
On Fri, Jan 15, 2021 at 12:42 PM William Findlay
 wrote:
>
> Hi all.
>
> Currently debugging a very strange behaviour with eBPF hash maps and was 
> wondering if anyone else has run into a similar issue? I am using libbpf-rs 
> with BPF CO-RE and my kernel version is 5.9.14.
>
> My setup: I have a map with some compound key and I am updating it once from 
> userspace using libbpf and once (later) from a BPF program, using the same 
> key both times, but with different values.
>
> Here's the weird part: Somehow both key,value pairs are being stored in the 
> map, according to output from bpftool. Even more bizarre, the value provided 
> from userspace is essentially a "ghost value" the entire time -- all map 
> lookups fail until the map has been updated from a BPF program as described 
> above.
>
> To be clear, the weirdness is two-fold:
>
> Lookup should not fail after updating the map the first time; and
> The second value should be overwriting the first one.
>
> After performing both updates, here is the output from bpftool showcasing the 
> weird behaviour:
>
> [{
> "key": {
> "id": 3069983010007500772,
> "device": 48
> },
> "value": 10
> },{
> "key": {
> "id": 3069983010007500772,
> "device": 48
> },
> "value": 40
> }
>
> ]

Does your key data structure have padding? Different padding values
will cause different actual keys.

If padding is not an issue in your case, could you construct a test
case (best if no rust involved) so we can take a deep look?
You can file a task to document this issue if you intend to send a test case.
Thanks!

>
> This behaviour also seems to be inconsistent between different maps and yet 
> consistent between different runs. For some maps, I get the expected result 
> and for others I get this weirdness instead.
>
> Is this possibly a bug in the kernel? Any assistance would be greatly 
> appreciated.
>
> Regards,
> William
[...]


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1947): https://lists.iovisor.org/g/iovisor-dev/message/1947
Mute This Topic: https://lists.iovisor.org/mt/79712510/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] verifier: variable offset stack access question

2020-12-29 Thread Yonghong Song
On Fri, Dec 25, 2020 at 5:41 PM Andrei Matei  wrote:
>
> For posterity, I think I can now answer my own question. I suspect
> things were different in 2018 (because otherwise I don’t see how the
> referenced exchange makes sense); here’s my understanding about the
> verifier’s rules for stack accesses today:
>
> There’s two distinct aspects relevant to the use of variable stack offsets:
>
> 1) “Direct” stack access with variable offset. This is simply
> forbidden; you can’t read or write from a dynamic offset in the stack
> because, in the case of reads, the verifier doesn’t know what type of
> memory would be returned (is it “misc” data? Is it a spilled
> register?) and, in the case of writes, what stack slot’s memory type
> should be updated.
> Separately, when reading from the stack with a fixed offset, the
> respective memory needs to have been initialized (i.e. written to)
> before.
>
> 2) Passing pointers to the stack to helper functions which will write
> through the pointer (such as bpf_probe_read_user()). Here, if the
> stack offset is variable, then all the memory that falls within the
> possible bounds has to be initialized.
> If the offset is fixed, then the memory doesn’t necessarily need to be
> initialized (at least not if the helper’s argument is of type
> ARG_PTR_TO_UNINIT_MEM). Why the restriction in the variable offset
> case? Because, in that case, it cannot be known what memory the helper
> will end up initializing; if the verifier pretended that all the
> memory within the offset bounds would be initialized then further
> reads could leak uninitialized stack memory.

I think your above assessment is kind of correct. For any read/write
to stack in bpf programs, the stack offset must be known so the
verifier knows exactly what the program tries to do. For helpers,
variable length of stack is permitted and the verifier will do
analysis to ensure the stack meets the memory (esp. initialized
memory) requirement as stated in helper proto definition.

>
>
> 
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1945): https://lists.iovisor.org/g/iovisor-dev/message/1945
Mute This Topic: https://lists.iovisor.org/mt/79191519/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] verifier: variable offset stack access question

2020-12-29 Thread Yonghong Song
On Wed, Dec 23, 2020 at 2:21 PM Andrei Matei  wrote:
>
> Hello Yonghong, all,
>
> I'm curious about a verifier workaround that Yonghong provided two years ago, 
> in this thread.
> Brendan Gregg was asking about accessing stack buffers through a register 
> with a variable offset, and Yonghong suggested a memset as a solution:
> "
> You can initialize the array with ' ' to workaround the issue:
> struct data_t data;
> uint64_t max = sizeof(data.argv);
> const char *argp = NULL;
> memset(, ' ', sizeof(data));
> bpf_probe_read(, sizeof(argp), (void *)&__argv[0]);
> uint64_t len = bpf_probe_read_str(, max, argp);
> len &= 0x; // to avoid: "math between fp pointer and register 
> errs"
> bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed valid
> "
>
> My question is - how does the memset help? I sort of understand the trouble 
> with variable stack access (different regions of the stack can hold memory of 
> different types), and I've looked through the verifier's code but I've failed 
> to get a clue.

I cannot remember details. Here, what "memset" did is to initialize
related bytes in stack to 0. I guess maybe at that point
bpf_probe_read_str requires an initialized memory?

Right now, bpf_probe_read_str does not require initialized memory, so
memset may not be necessary.

>
> As far as actually trying the trick, I've had difficulty importing  
> in my bpf program. I'm not working in the context of BCC, so maybe that makes 
> the difference. I've tried zero-ing out my buffer manually, and it didn't 
> seem to change anything. I've had better success allocating my buffer using 
> map memory rather than stack memory, but I'm still curious what a memset 
> could do for me.

A lot of string.h functions are implemented as external functions in
glibc. This won't work for bpf programs as the bpf program is not
linked against glibc. The clang compiler will translate the above
memset to some stores if memset() size is not big enough. Better,
using clang __builtin_memset() so it won't have any relation with
glibc.

>
> Thanks a lot!
>
> - Andrei
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1944): https://lists.iovisor.org/g/iovisor-dev/message/1944
Mute This Topic: https://lists.iovisor.org/mt/79191519/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] BPF Maps with wildcards

2020-11-19 Thread Yonghong Song
On Thu, Nov 19, 2020 at 9:57 AM Marinos Dimolianis
 wrote:
>
> Thanks for the response.
> LPM is actually the closest solution however I wanted a structure closer to 
> the way TCAMs operate in which you can have wildcards also in the interim 
> bits.
> I believe that something like that does not exist and I need to implement it 
> using available structures in
eBPF/XDP.

Right, BPF does not have TCAM style maps. If you organize data
structure properly, you may be able to use LPM.

>
> Στις Πέμ, 19 Νοε 2020 στις 5:27 π.μ., ο/η Y Song  έγραψε:
>>
>> On Wed, Nov 18, 2020 at 6:20 AM  wrote:
>> >
>> > Hi all, I am trying to find a way to represent wildcards in BPF Map Keys?
>> > I could not find anything relevant to that, does anyone know anything 
>> > further.
>> > Are there any efforts towards that functionality?
>>
>> The closest map is lpm (trie) map. You may want to take a look.


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1939): https://lists.iovisor.org/g/iovisor-dev/message/1939
Mute This Topic: https://lists.iovisor.org/mt/78341110/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] BPF Maps with wildcards

2020-11-18 Thread Yonghong Song
On Wed, Nov 18, 2020 at 6:20 AM  wrote:
>
> Hi all, I am trying to find a way to represent wildcards in BPF Map Keys?
> I could not find anything relevant to that, does anyone know anything further.
> Are there any efforts towards that functionality?

The closest map is lpm (trie) map. You may want to take a look.


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1936): https://lists.iovisor.org/g/iovisor-dev/message/1936
Mute This Topic: https://lists.iovisor.org/mt/78341110/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] Tracepoint/Kprobe for tracking inbound connections

2020-09-30 Thread Yonghong Song
On Tue, Sep 29, 2020 at 4:14 AM Kanthi P  wrote:
>
> Hi,
>
> I am looking for tracking inbound connections on a system using 
> tracepoints/kprobes.
>
> I was checking "trace_inet_sock_set_state", with which we can track the state 
> changes during connection establishment and closure. It seems straightforward 
> to track total connections, but since we only want inbound, one way would be 
> to look at what are the ip addresses/ports on which a node listens to and 
> while tracking the state changes, I can see if the local address/port matches 
> to the one this system listens on and based on that make a decision whether 
> its an inbound connection or not. This looks a bit roundabout way for me, so 
> thought of reaching for suggestions to do it simpler.
>
> Another way is to store the socker address when TCP_SYN_RECV to 
> TCP_ESTABLISHED state change happens and during closure we can check if it is 
> for this socket, so we know its inbound connection. But this would make the 
> map size grow too high as we have about 50k concurrent connections.
>
> Can you suggest a better way to do this?

Maybe you can use sk_local_storage? You can attach a piece of
information to the socket during TCP_SYN_RECV and later on during
TCP_ESTABLISHED to check that data, and you can delete that data from
the socket if you do not need it any more,
all in bpf program.

>
> Thanks,
> Kanthi
>
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1913): https://lists.iovisor.org/g/iovisor-dev/message/1913
Mute This Topic: https://lists.iovisor.org/mt/77193554/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [iovisor-dev] Load BPF program at boot-time?

2020-09-08 Thread Yonghong Song
On Sun, Sep 6, 2020 at 7:55 AM Shung-Hsi Yu  wrote:
>
> Hi,
>
> Is it possible to load a BPF program at boot time?

It is possible. See the patch below:
https://lore.kernel.org/bpf/20200819042759.51280-1-alexei.starovoi...@gmail.com/

I tried to load a BPF program and pin it in bpffs system. The system could
be extended to load bpf program, even attach it if other subsystem is ready.
But this needs kernel work.

> What I'm trying to achieve is to trace every single call to a certain
> function since the kernel starts, without missing anything.
>
> More specifically, I'm trying to debug iommu_alloc failures by looking
> at the stacktrace to find out which subsystem/driver allocated too
> many IOMMU slots on a ppc64le system, which I do not have direct
> access to.
>
> I've considered writing a systemd unit file that loads a BPF program
> before the sysinit target[1], but I'm not sure if that's early enough.
> An alternative seems to be to use boot-time tracing with ftrace[2]
> instead (which I end up doing), but it requires recompiling the kernel
> inorder to add tracepoints to retrieve the function call arguments,
> and there isn't an easy way to stop tracing to prevent the tracing
> buffer overflows (I end up writing a systemd unit file that sets a
> ftrace event trigger that turns off tracing).

bpf program seems a good choice here since it can store arbitrary
data in its maps and based on the tracing state, it can stop tracing.

There are still some potential issues relating to not recompile kernel
and just change bpf programs and recompile bpf programs and
rebooting should just work, which is not available today. I guess this
probably can be improved. If you are interested, please take a look
at the above patch and may improve kernel to cover your use case.

>
> Maybe there is a better way to do something like this?
>
>
> Much thanks,
> Shung-Hsi Yu
>
> [1]: https://www.freedesktop.org/software/systemd/man/bootup.html
> [2]: https://www.kernel.org/doc/html/latest/trace/boottime-trace.html
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1911): https://lists.iovisor.org/g/iovisor-dev/message/1911
Mute This Topic: https://lists.iovisor.org/mt/76667790/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Reading Pinned maps in eBPF Programs

2020-08-16 Thread Yonghong Song
On Fri, Aug 14, 2020 at 12:05 PM Ian  wrote:
>
> Hello BPF Community!
>
> Hope you are all doing well. I am trying to have a user space program create 
> a BPF Hash map with a single element containing its PID. This map could then 
> be read by all the BPF programs loaded by the user space program. Any event 
> the BPF programs would handle would first compare the PID with the user space 
> program. If the PIDs matched (this is a single threaded application) the 
> event will be thrown out to eliminate events being processed that are from 
> the user space programs own feedback. I was doing some research into this and 
> found a similar post here: 
> https://lists.iovisor.org/g/iovisor-dev/message/1389?p=,,,20,0,0,0::Created,,Userspace+Maps,20,2,0,23673879
>  that discusses the possibility of this in C++ and BCC. I am curious as to 
> how this could be possible using the standard BPF functions and Libbpf 
> library on Ubuntu 20.04 and Linux Kernel v5.4. NOTE: BTF is not currently 
> compiled into this kernel.
>
> I have created and pinned the map in my user space program like this:
>
> char map_name[] = "pid_map";
>
> int fd = bpf_create_map_name(BPF_MAP_TYPE_HASH, _name, sizeof(u32), 
> sizeof(u32), 1, 0) };
>
> u32 key = 1;
>
> bpf_map_update_elem(fd, , , BPF_ANY);
>
> char pid_map_path[] = "/sys/fs/bpf/pid_map";
>
> bpf_obj_pin(fd, _map_path);
>
> NOTE: Error checking and some syntax stuff was removed for brevity.
>
> In my BPF programs I know I cannot "open" a map using bpf_obj_open. 
> Therefore, I need a reference. I looked into the link provided above, 
> essentially in the BPF program all they did was define the map as an extern 
> map def. So I reproduced this in my BPF program like this:

You can use bpf_obj_get() API to get a reference to the pinned map.

>
> u32 *pid = bpf_map_lookup_elem(_map, );
> extern struct bpf_map_def pid_map;
>
> To see if the BPF Loading process would catch the matching map names. 
> Interestingly this would result in a libbpf error:
> libbpf: failed to find BTF for extern 'pid_map': -3
>
> Looking at this error message it would appear that there is a way to get this 
> kind of functionality using BTF. The error message to implies that some sort 
> of BTF metadata is being searched in some location to match the extern map I 
> have declared. Knowing this I am curious as to how I can create a reference 
> for multiple BPF programs that could read the data in the pid_map to prevent 
> feedback issues. I have looked into libbpf and the standard BPF.h functions 
> but couldn't really find anything that seemed plausible. One thing I did see 
> and am also curious about is the usage of BPF_ANNOTATE_KV_PAIR. This macro 
> seemed like a possibility but my lack of understanding of BTF has not been 
> able to confirm it. I also wasn't sure if using bpf_helpers.h in a user space 
> program was ideal.

BPF_ANNOTATE_KV_PAIR is old way to provide map key/value types, mostly
for pretty print. bcc still uses it. libbpf can use more advanced
mechanisms with direct .maps section attribute.

>
>
> Thank you so much in advance for any response! I really have been amazed at 
> how responsive the community is here. You all have helped me learn so much 
> about BPF!
>
> Ian

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1893): https://lists.iovisor.org/g/iovisor-dev/message/1893
Mute This Topic: https://lists.iovisor.org/mt/76194102/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] BPF Concurrency

2020-06-21 Thread Yonghong Song
On Sun, Jun 21, 2020 at 4:17 PM Kanthi P  wrote:
>
> Thanks Andrii. __sync_fetch_and_add doesn't seem to work as expected, it is 
> adding the increment, but it is returning the wrong value.
> I am actually hitting the same issue mentioned here: 
> https://lists.iovisor.org/g/iovisor-dev/topic/problems_with/23670176?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,20,23670176
>
> Can anyone suggest if it is fixed recently? I am on 4.15 kernel.

You cannot use the return value. A recent llvm should return an error
if you try to use it.

There is some preliminary work to have more atomic operations in the
BPF ISA. https://reviews.llvm.org/D72184. We could add a version of
fetch_and_add with proper return value. This may take some time as we
need to ensure kernel has proper support.

>
> Thanks,
> Kanthi
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1872): https://lists.iovisor.org/g/iovisor-dev/message/1872
Mute This Topic: https://lists.iovisor.org/mt/74407447/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] #bcc - skb_network_header crashes in a BPF Kernel trace function

2020-05-07 Thread Yonghong Song
On Wed, May 6, 2020 at 11:00 PM Yonghong Song via lists.iovisor.org
 wrote:
>
> On Wed, May 6, 2020 at 9:26 AM  wrote:
> >
> > Hello - I am looking to trace ip_forward_finish. The intent is to trace 
> > latency of all TCP connections going through a linux based gateway router.  
> > Hence thought of tracing ip_forward_finish kernel function. And capture the 
> > time-stamp of SYN, SYN-ACK and ACK messages at the router.
> >
> > The issue is accessing iphdr inside the trace function crashes with the 
> > below error:
> >
> > bpf: Failed to load program: Permission denied
> > 0: (79) r6 = *(u64 *)(r1 +96)
> > 1: (b7) r1 = 0
> > 2: (6b) *(u16 *)(r10 -24) = r1
> > 3: (bf) r3 = r6
> > 4: (07) r3 += 192
> > 5: (bf) r1 = r10
> > 6: (07) r1 += -24
> > 7: (b7) r2 = 2
> > 8: (85) call bpf_probe_read#4
> > 9: (69) r1 = *(u16 *)(r10 -24)
> > 10: (55) if r1 != 0x8 goto pc+7
> >  R0=inv(id=0) R1=inv8 R6=inv(id=0) R10=fp0
> > 11: (69) r1 = *(u16 *)(r6 +196)
> > R6 invalid mem access 'inv'
>
> You did not show the code which actually caused the problem.
>
> > bpf_probe_read(_Hdr, sizeof(ip_Hdr), (void*)ip_hdr(skb));
> >
> > if ( (ip_Hdr.protocol != IPPROTO_TCP))
> >  return 0;
> >
> > return 0;
> > }
>
> There must be code after "if ( (ip_Hdr.protocol != IPPROTO_TCP)) return 0;" .
> You may need bpf_probe_read() for memory accesses there.
>
> >
> > HINT: The invalid mem access 'inv' error can happen if you try to 
> > dereference memory without first using bpf_probe_read() to copy it to the 
> > BPF stack. Sometimes the bpf_probe_read is automatic by the bcc rewriter, 
> > other times you'll need to be explicit.
> >
> > The code fragment I originally had was as below and the crash occurs when 
> > an access to ip_Hdr->protocol is made. And I also checked that ip_Hdr is 
> > not null.
> >
> > int trace_forward_finish(struct pt_regs *ctx,struct net *net, struct sock 
> > *sk, struct sk_buff *skb)
> > {
> >
> > if (skb->protocol != htons(ETH_P_IP)) return 0;
> >
> > struct iphdr* ip_Hdr = (struct iphdr *) skb_network_header(skb);
> >
> > if (ip_Hdr->protocol != IPPROTO_TCP)
> >  return 0;
> >
> >
> > /// Other code
> >
> >   }
> >
> > Per the HINT in the message, I did try to change to bpf_probe_read but 
> > still the same outcome
> >
> > int trace_forward_finish(struct pt_regs *ctx,struct net *net, struct sock 
> > *sk, struct sk_buff *skb)
> > {
> > if (skb->protocol != htons(ETH_P_IP)) return 0;
> >
> > struct iphdr ip_Hdr;
> > bpf_probe_read(_Hdr, sizeof(ip_Hdr), (void*)ip_hdr(skb));

I see the issue now. ip_hdr(skb) eventually transforms to

static inline unsigned char *skb_network_header(const struct sk_buff *skb)
{
return skb->head + skb->network_header;
}

The above two pointer dereferences need bpf probe read.
Unfortunately, you may need put the above function in your bpf program
so you could use bpf_probe_read to access skb->head and skb->network_header.

> >
> > if ( (ip_Hdr.protocol != IPPROTO_TCP))
> >  return 0;
> >
> > return 0;
> > }
> >
> > Any help would be appreciated.
> >
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1844): https://lists.iovisor.org/g/iovisor-dev/message/1844
Mute This Topic: https://lists.iovisor.org/mt/74032161/21656
Mute #bcc: https://lists.iovisor.org/mk?hashtag=bcc=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] #bcc - skb_network_header crashes in a BPF Kernel trace function

2020-05-07 Thread Yonghong Song
On Wed, May 6, 2020 at 9:26 AM  wrote:
>
> Hello - I am looking to trace ip_forward_finish. The intent is to trace 
> latency of all TCP connections going through a linux based gateway router.  
> Hence thought of tracing ip_forward_finish kernel function. And capture the 
> time-stamp of SYN, SYN-ACK and ACK messages at the router.
>
> The issue is accessing iphdr inside the trace function crashes with the below 
> error:
>
> bpf: Failed to load program: Permission denied
> 0: (79) r6 = *(u64 *)(r1 +96)
> 1: (b7) r1 = 0
> 2: (6b) *(u16 *)(r10 -24) = r1
> 3: (bf) r3 = r6
> 4: (07) r3 += 192
> 5: (bf) r1 = r10
> 6: (07) r1 += -24
> 7: (b7) r2 = 2
> 8: (85) call bpf_probe_read#4
> 9: (69) r1 = *(u16 *)(r10 -24)
> 10: (55) if r1 != 0x8 goto pc+7
>  R0=inv(id=0) R1=inv8 R6=inv(id=0) R10=fp0
> 11: (69) r1 = *(u16 *)(r6 +196)
> R6 invalid mem access 'inv'

You did not show the code which actually caused the problem.

> bpf_probe_read(_Hdr, sizeof(ip_Hdr), (void*)ip_hdr(skb));
>
> if ( (ip_Hdr.protocol != IPPROTO_TCP))
>  return 0;
>
> return 0;
> }

There must be code after "if ( (ip_Hdr.protocol != IPPROTO_TCP)) return 0;" .
You may need bpf_probe_read() for memory accesses there.

>
> HINT: The invalid mem access 'inv' error can happen if you try to dereference 
> memory without first using bpf_probe_read() to copy it to the BPF stack. 
> Sometimes the bpf_probe_read is automatic by the bcc rewriter, other times 
> you'll need to be explicit.
>
> The code fragment I originally had was as below and the crash occurs when an 
> access to ip_Hdr->protocol is made. And I also checked that ip_Hdr is not 
> null.
>
> int trace_forward_finish(struct pt_regs *ctx,struct net *net, struct sock 
> *sk, struct sk_buff *skb)
> {
>
> if (skb->protocol != htons(ETH_P_IP)) return 0;
>
> struct iphdr* ip_Hdr = (struct iphdr *) skb_network_header(skb);
>
> if (ip_Hdr->protocol != IPPROTO_TCP)
>  return 0;
>
>
> /// Other code
>
>   }
>
> Per the HINT in the message, I did try to change to bpf_probe_read but still 
> the same outcome
>
> int trace_forward_finish(struct pt_regs *ctx,struct net *net, struct sock 
> *sk, struct sk_buff *skb)
> {
> if (skb->protocol != htons(ETH_P_IP)) return 0;
>
> struct iphdr ip_Hdr;
> bpf_probe_read(_Hdr, sizeof(ip_Hdr), (void*)ip_hdr(skb));
>
> if ( (ip_Hdr.protocol != IPPROTO_TCP))
>  return 0;
>
> return 0;
> }
>
> Any help would be appreciated.
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1843): https://lists.iovisor.org/g/iovisor-dev/message/1843
Mute This Topic: https://lists.iovisor.org/mt/74032161/21656
Mute #bcc: https://lists.iovisor.org/mk?hashtag=bcc=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Array brace-enclosed initialization

2020-03-23 Thread Yonghong Song
On Mon, Mar 23, 2020 at 10:05 AM Federico Parola  wrote:
>
> Hello everybody,
> in my XDP eBPF program I'm trying to initialize an array with a 
> brace-enclosed list, however my code is rejected by the verifier.
> Here is a simple piece of code to replicate the problem:
>
> #include 
>
> #ifndef __section
> # define __section(NAME)  \
>__attribute__((section(NAME), used))
> #endif
>
> #ifndef BPF_FUNC
> # define BPF_FUNC(NAME, ...)  \
>(*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
> #endif
>
> #ifndef printk
> # define printk(fmt, ...)  \
> ({ \
> char fmt[] = fmt;  \
> trace_printk(fmt, sizeof(fmt), ##__VA_ARGS__); \
> })
> #endif
>
> static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...);
>
> __section("prog")
> int xdp_prog(struct xdp_md *ctx) {
>   int i;
>   int array[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

The init list is too long. The compiler puts {0, 1, 2, ..., 9} in a readonly
section. That is why you got the failure below.

If you use native clang compilation and a recent kernel, libbpf
should be able to automatically create a map for you so your
code will work.

>
> #pragma nounroll
>   for (i = 0; i < 10; i++) {
> printk("%d", array[i]);
>   }
>
>   return XDP_PASS;
> }
>
> char __license[] __section("license") = "GPL"
>
> This is the error reported by the verifier:
>
> 0: (b7) r6 = 0
> 1: (b7) r7 = 25637
> 2: (b7) r8 = 0
> 3: (73) *(u8 *)(r10 -2) = r6
> last_idx 3 first_idx 0
> regs=40 stack=0 before 2: (b7) r8 = 0
> regs=40 stack=0 before 1: (b7) r7 = 25637
> regs=40 stack=0 before 0: (b7) r6 = 0
> 4: (6b) *(u16 *)(r10 -4) = r7
> 5: (18) r1 = 0x0
> 7: (0f) r1 += r8
> 8: (61) r3 = *(u32 *)(r1 +0)
> R1 invalid mem access 'inv'
> processed 8 insns (limit 100) max_states_per_insn 0 total_states 0 
> peak_states 0 mark_read 0
>
> I tried compiling with clang-6 and clang-9 with optimizaton set to O1 and O2, 
> but I get this error in all cases.
> If I initialize the array in another way (e.g. with a loop) the program works 
> correctly.
> The eBPF bytecode generated by clang is the following:
>
> .text
> .file "test.c"
> .section prog,"ax",@progbits
> .globl xdp_prog# -- Begin function xdp_prog
> .p2align 3
> .type xdp_prog,@function
> xdp_prog:   # @xdp_prog
> # %bb.0:
> r6 = 0
> r7 = 25637
> r8 = 0
> LBB0_1: # =>This Inner Loop Header: Depth=1
> *(u8 *)(r10 - 2) = r6
> *(u16 *)(r10 - 4) = r7
> r1 = .L__const.xdp_prog.array ll
> r1 += r8
> r3 = *(u32 *)(r1 + 0)
> r1 = r10
> r1 += -4
> r2 = 3
> call 6
> r8 += 4
> if r8 != 24 goto LBB0_1
> # %bb.2:
> r0 = 2
> exit
> .Lfunc_end0:
> .size xdp_prog, .Lfunc_end0-xdp_prog
> # -- End function
> .type .L__const.xdp_prog.array,@object # @__const.xdp_prog.array
> .section .rodata,"a",@progbits
> .p2align 2
> .L__const.xdp_prog.array:
> .long 0   # 0x0
> .long 1   # 0x1
> .long 2   # 0x2
> .long 3   # 0x3
> .long 4   # 0x4
> .long 5   # 0x5
> .long 6   # 0x6
> .long 7   # 0x7
> .long 8   # 0x8
> .long 9   # 0x9
> .size .L__const.xdp_prog.array, 40
>
> .type .L__const.xdp_prog.fmt,@object # @__const.xdp_prog.fmt
> .section .rodata.str1.1,"aMS",@progbits,1
> .L__const.xdp_prog.fmt:
> .asciz "%d"
> .size .L__const.xdp_prog.fmt, 3
>
> .type __license,@object   # @__license
> .section license,"aw",@progbits
> .globl __license
> __license:
> .asciz "GPL"
> .size __license, 4
>
>
> .addrsig
> .addrsig_sym xdp_prog
> .addrsig_sym __license
>
> It seems like the array in the stack is not initialized in the code. With 
> some declarations the code works, for example decalring the array in the 
> following way:
> int array[10] = {0, 1, 2, 3};
> everything works, the generated bytecode is the following:
>
> .text
> .file "test.c"
> .section prog,"ax",@progbits
> .globl xdp_prog# -- Begin function xdp_prog
> .p2align 3
> .type xdp_prog,@function
> xdp_prog:   # @xdp_prog
> # %bb.0:
> r1 = 2
> *(u32 *)(r10 - 36) = r1
> r6 = 0
> *(u32 *)(r10 - 8) = r6
> *(u32 *)(r10 - 12) = r6
> *(u32 *)(r10 - 16) = r6
> *(u32 *)(r10 - 20) = r6
> *(u32 *)(r10 - 24) = r6
> *(u32 *)(r10 - 28) = r6
> *(u32 *)(r10 - 4) = r6
> r1 = 3
> *(u32 *)(r10 - 32) = r1
> r1 = 1
> *(u32 *)(r10 - 40) = r1
> *(u8 *)(r10 - 42) = r6
> r7 = 25637
> *(u16 *)(r10 - 44) = r7
> r1 = r10
> r1 += -44
> r2 = 3
> r3 = 1
> call 6
> r8 = 4
> LBB0_1: # =>This Inner Loop Header: Depth=1
> r1 = r10
> r1 += -40
> r1 += r8
> r3 = *(u32 *)(r1 + 0)
> *(u8 *)(r10 - 42) = r6
> *(u16 *)(r10 - 44) = r7
> r1 

Re: [iovisor-dev] Can multiple BPF programs use same per-cpu perf ring buffer?

2020-02-17 Thread Yonghong Song
On Sun, Feb 16, 2020 at 8:43 PM Hayden Livingston
 wrote:
>
> Imagine I have a per-cpu perf ring buffer for all my cpus.
>
> Now I have two eBPF programs.
>
> In both these eBPF programs I do bpf_update_elem(myFD, ,
> , BPF_ANY)
>
> Will this mean that multiple eBPF programs will be able to write their
> data into a single buffer (of course associated with cpu).
>
> This would be amazing if it is truly possible. It seems like it should
> be possible.

Yes, you can do this.

>
> I have not tried yet.
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1807): https://lists.iovisor.org/g/iovisor-dev/message/1807
Mute This Topic: https://lists.iovisor.org/mt/71343794/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Why is BPF_PERF_OUTPUT max_entries set to total processor count?

2020-02-16 Thread Yonghong Song
On Sun, Feb 16, 2020 at 5:09 PM Hayden Livingston
 wrote:
>
> Thanks. I had to re-read your reply and the kernel code multiple
> times, but I think I get it now. Please confirm.
>
> It is this call is made by user mode code:
>
> fd = bpf_create_map(BPF_MAP_TYPE_PERF_EVENT_ARRAY, /*key_size*/
> sizeof(int), /*value_size*/ sizeof(int), NUM_POSSIBLE_CPUS, 0);
>
> key is smp_processor_id. value is perf_events fd. This is why the map
> is both is key integer and value integer.
>
> Why so many indirections? Is it to support pinning where user program
> can different ring buffers?

Perf event ring buffer is per cpu.

>
> To me it seems the kernel code that uses cpu index to look into array
> could just to told fd directly.

Yes, it is what it did in the kernel. Each array element holds one ring buffer.

>
> On Sun, Feb 16, 2020 at 1:50 PM Y Song  wrote:
> >
> > PERF_EVENT_OUTPUT map is to hold per cpu ring buffers created by
> > perf_event_open.
> > That is why its typical size is the number of cpus on the host.
> >
> > On Sun, Feb 16, 2020 at 1:52 AM Hayden Livingston
> >  wrote:
> > >
> > > I'm very confused why BCC creates a map of number of processors for
> > > the perf_events output map.
> > >
> > > I can imagine it being 1 since all it does is act as a kernel-user
> > > mode intermediary and it is true that the code cannot be preempted.
> > >
> > > Or if it can be preempted then I can imagine that since there can't be
> > > more than processor count it is the max depth one has to worry about.
> > >
> > > Is my thinking flawed? Or maybe there is a completely different reason?
> > >
> > >
> > >
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1803): https://lists.iovisor.org/g/iovisor-dev/message/1803
Mute This Topic: https://lists.iovisor.org/mt/71322089/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Why is BPF_PERF_OUTPUT max_entries set to total processor count?

2020-02-16 Thread Yonghong Song
PERF_EVENT_OUTPUT map is to hold per cpu ring buffers created by
perf_event_open.
That is why its typical size is the number of cpus on the host.

On Sun, Feb 16, 2020 at 1:52 AM Hayden Livingston
 wrote:
>
> I'm very confused why BCC creates a map of number of processors for
> the perf_events output map.
>
> I can imagine it being 1 since all it does is act as a kernel-user
> mode intermediary and it is true that the code cannot be preempted.
>
> Or if it can be preempted then I can imagine that since there can't be
> more than processor count it is the max depth one has to worry about.
>
> Is my thinking flawed? Or maybe there is a completely different reason?
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1801): https://lists.iovisor.org/g/iovisor-dev/message/1801
Mute This Topic: https://lists.iovisor.org/mt/71322089/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Build error on current Amazon Linux 2

2019-09-18 Thread Yonghong Song
What is your llvm version? What is your cmake output? Looks like llvm
libraries are not really linked?

On Wed, Sep 18, 2019 at 7:56 PM  wrote:
>
> Before I started to make a small change to bcc, I thought that I ought to 
> verify that 'master' would actually build -- I'm running on Amazon Linux 2 
> AMI (HVM), SSD Volume Type - ami-0b69ea66ff7391e80. I followed the 
> instructions in INSTALL.md and the make actually fails:
>
> Scanning dependencies of target bcc-lua
>
> [ 38%] Building C object src/lua/CMakeFiles/bcc-lua.dir/src/main.c.o
>
> [ 39%] Linking C executable bcc-lua
>
> [ 39%] Built target bcc-lua
>
> Scanning dependencies of target bps
>
> [ 39%] Building C object introspection/CMakeFiles/bps.dir/bps.c.o
>
> [ 39%] Linking C executable bps
>
> [ 39%] Built target bps
>
> Scanning dependencies of target FollyRequestContextSwitch
>
> [ 40%] Building CXX object 
> examples/cpp/CMakeFiles/FollyRequestContextSwitch.dir/FollyRequestContextSwitch.cc.o
>
> [ 40%] Linking CXX executable FollyRequestContextSwitch
>
> ../../src/cc/frontends/clang/libclang_frontend.a(b_frontend_action.cc.o): In 
> function `ebpf::BTypeVisitor::genParamIndirectAssign(clang::FunctionDecl*, 
> std::__cxx11::basic_string, std::allocator 
> >&, char const**)':
>
> b_frontend_action.cc:(.text+0x1d55): undefined reference to 
> `clang::Rewriter::getRewrittenText[abi:cxx11](clang::SourceRange) const'
>
> ../../src/cc/frontends/clang/libclang_frontend.a(b_frontend_action.cc.o): In 
> function `ebpf::BTypeVisitor::genParamDirectAssign(clang::FunctionDecl*, 
> std::__cxx11::basic_string, std::allocator 
> >&, char const**)':
>
> b_frontend_action.cc:(.text+0x32d7): undefined reference to 
> `clang::Rewriter::getRewrittenText[abi:cxx11](clang::SourceRange) const'
>
> ../../src/cc/frontends/clang/libclang_frontend.a(b_frontend_action.cc.o): In 
> function `ebpf::BTypeVisitor::VisitBinaryOperator(clang::BinaryOperator*)':
>
> b_frontend_action.cc:(.text+0x4353): undefined reference to 
> `clang::Rewriter::getRewrittenText[abi:cxx11](clang::SourceRange) const'
>
> ../../src/cc/frontends/clang/libclang_frontend.a(b_frontend_action.cc.o): In 
> function `ebpf::BTypeVisitor::VisitFunctionDecl(clang::FunctionDecl*)':
>
>
> Has anyone else seen this?
>
> Thanks
>
> Philip
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1787): https://lists.iovisor.org/g/iovisor-dev/message/1787
Mute This Topic: https://lists.iovisor.org/mt/34195978/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] CPU Concurrency Issues

2019-08-20 Thread Yonghong Song
On Tue, Aug 20, 2019 at 12:05 PM Arnaldo Carvalho de Melo
 wrote:
>
> Em Tue, Aug 20, 2019 at 11:22:50AM -0700, Yonghong Song escreveu:
> > On Tue, Aug 20, 2019 at 11:02 AM Matthew Ahrens  wrote:
> > > On Tue, Aug 20, 2019 at 10:47 AM Yonghong Song  wrote:
> > > > Also we have per cpu counter to prevent when bpf program interrupted 
> > > > then another bpf to run for tracing programs.
>
> > > I think that means that if an interrupt fires while the bpf program is 
> > > run, the interrupt will run, but if the interrupt causes another tracing 
> > > event to fire, the associated bpf program will not run (i.e. the event 
> > > will be ignored / dropped).  Is that right?
>
> > Yes. See
> > https://github.com/torvalds/linux/blob/master/kernel/trace/bpf_trace.c#L88-L97
>
> Yonghong, I thought there would be some counter to at least let users
> know that drop happened, has this ever surfaced? Or is there some way to
> know about those drops that I'm missing?

You did not miss anything. Currently, there are no counters to count those
drops due to nmi or due to bpf program already running on that cpu.

There is effort by Daniel Xu to expose nhit/nmisses counters
from k/uprobe trace infra. Even kprobe is not a miss, bpf program may not
fire due to the above reasons.
https://lore.kernel.org/bpf/20190820214819.16154-1-...@dxuuu.xyz/T/#t

debugfs has k/uprobe_profile to count nhit/nmisses from k/uprobe trace infra.

We could add a counter into trace_event_call->event to count hit/miss. The hit
can also be counted by bpf program itself. The "miss" should be rare, and
most bpf programs e.g. in bcc are designed to tolerate occasional probe miss,
which should not affect much on the final aggregation results.

How strongly do you feel such a bpf prog hit/miss counter for tracing programs
is needed?

>
> - Arnaldo

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1783): https://lists.iovisor.org/g/iovisor-dev/message/1783
Mute This Topic: https://lists.iovisor.org/mt/32960072/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] CPU Concurrency Issues

2019-08-20 Thread Yonghong Song
On Tue, Aug 20, 2019 at 11:02 AM Matthew Ahrens  wrote:
>
>
>
> On Tue, Aug 20, 2019 at 10:47 AM Yonghong Song  wrote:
>>
>> On Tue, Aug 20, 2019 at 10:38 AM Matthew Ahrens  wrote:
>> >
>> > On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song  wrote:
>> >>
>> >> On Mon, Aug 19, 2019 at 1:54 PM  wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > I'm trying to verify that there are no concurrency issues with an 
>> >> > approach I'm using cpu_id as a key to a HASH_MAP.   My understanding is 
>> >> > that bcc disables preemption but the details are fuzzy and I haven't 
>> >> > been able to find anything in the code.  Can anyone shed some light on 
>> >> > this?
>> >>
>> >> preemption is a kernel thing, bcc does not disable it.  You need to
>> >> check kernel configuration CONFIG_PREEMPT in your host.
>> >
>> >
>> > When running bpf code from a kprobe / kretprobe, does anything ensure that 
>> > cpu_id doesn't change while the bpf is running (e.g. due to preemption)?  
>> > Does anything ensure that no other bpf code runs on this CPU while this 
>> > kprobe is running (e.g. due to an interrupt firing and hitting a different 
>> > kprobe)?  If either of those things can happen, it seems difficult to 
>> > atomically increment an entry in a HASH_MAP (even when using the cpu_id as 
>> > a key).
>>
>> bpf program run is wrapped in preempt_disable()/preempt_enable()
>> region. Also we have per cpu counter to prevent when bpf program
>> interrupted then another bpf to run for tracing programs. Therefore,
>> for tracing programs, the situation you mentioned in the above won't
>> happen.
>>
>> The only thing which can happen is networking and tracing program
>> share the same map, e.g.,
>>networking bpf program can be nmi interrupted and a bpf tracing
>> program (perf_event type) may be running.
>>
>
> Great, thanks Yonghong!  And I get that this is part of the infrastructure 
> that calls into the bpf code (e.g. perf_events), not bcc.
>
> > Also we have per cpu counter to prevent when bpf program interrupted then 
> > another bpf to run for tracing programs.
>
> I think that means that if an interrupt fires while the bpf program is run, 
> the interrupt will run, but if the interrupt causes another tracing event to 
> fire, the associated bpf program will not run (i.e. the event will be ignored 
> / dropped).  Is that right?

Yes. See
https://github.com/torvalds/linux/blob/master/kernel/trace/bpf_trace.c#L88-L97

>
> --matt
>
>>
>>
>>
>> >
>> > --matt
>>
>> 
>>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1780): https://lists.iovisor.org/g/iovisor-dev/message/1780
Mute This Topic: https://lists.iovisor.org/mt/32960072/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] CPU Concurrency Issues

2019-08-20 Thread Yonghong Song
On Tue, Aug 20, 2019 at 10:38 AM Matthew Ahrens  wrote:
>
> On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song  wrote:
>>
>> On Mon, Aug 19, 2019 at 1:54 PM  wrote:
>> >
>> > Hi all,
>> >
>> > I'm trying to verify that there are no concurrency issues with an approach 
>> > I'm using cpu_id as a key to a HASH_MAP.   My understanding is that bcc 
>> > disables preemption but the details are fuzzy and I haven't been able to 
>> > find anything in the code.  Can anyone shed some light on this?
>>
>> preemption is a kernel thing, bcc does not disable it.  You need to
>> check kernel configuration CONFIG_PREEMPT in your host.
>
>
> When running bpf code from a kprobe / kretprobe, does anything ensure that 
> cpu_id doesn't change while the bpf is running (e.g. due to preemption)?  
> Does anything ensure that no other bpf code runs on this CPU while this 
> kprobe is running (e.g. due to an interrupt firing and hitting a different 
> kprobe)?  If either of those things can happen, it seems difficult to 
> atomically increment an entry in a HASH_MAP (even when using the cpu_id as a 
> key).

bpf program run is wrapped in preempt_disable()/preempt_enable()
region. Also we have per cpu counter to prevent when bpf program
interrupted then another bpf to run for tracing programs. Therefore,
for tracing programs, the situation you mentioned in the above won't
happen.

The only thing which can happen is networking and tracing program
share the same map, e.g.,
   networking bpf program can be nmi interrupted and a bpf tracing
program (perf_event type) may be running.



>
> --matt

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1778): https://lists.iovisor.org/g/iovisor-dev/message/1778
Mute This Topic: https://lists.iovisor.org/mt/32960072/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] CPU Concurrency Issues

2019-08-19 Thread Yonghong Song
On Mon, Aug 19, 2019 at 1:54 PM  wrote:
>
> Hi all,
>
> I'm trying to verify that there are no concurrency issues with an approach 
> I'm using cpu_id as a key to a HASH_MAP.   My understanding is that bcc 
> disables preemption but the details are fuzzy and I haven't been able to find 
> anything in the code.  Can anyone shed some light on this?

preemption is a kernel thing, bcc does not disable it.  You need to
check kernel configuration CONFIG_PREEMPT in your host.

>
>
>
> Thank you,
>
> Brad Lewis
>
>
>
>
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1775): https://lists.iovisor.org/g/iovisor-dev/message/1775
Mute This Topic: https://lists.iovisor.org/mt/32960072/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] agenda: IO Visor TSC/Dev Meeting

2019-08-06 Thread Yonghong Song
On Tue, Aug 6, 2019 at 7:31 AM  wrote:
>
> Hello Brenden!
>
> I’m not sure if it counts as an agenda item, but I’m interested in
> recording process events using tracepoints, and I would like to know
> what are the best practices when attempting to do so.

Which tracepoint do you have in mind for your particular use case?

>
> Due to project goals (endpoint monitoring) one of the requirements is
> to avoid losing any event data.
>
> It is probably not a surprise given the limits imposed by the verifier,
> but I’m having trouble with variadic functions and long strings.
>
> The following are some events I would like to capture with reasonable
> success:
>
> String padding, causing the string I need to be truncated:
>
> bash -c “ /bin/rm -rf /home”

The recent kernel (5.3) added bounded loop support up to 1M
instructions. You can have a bounded loop like
   start = ...
   for (i = 0; i < 256 && start < end && start[i] == ' ')
 start++;
The verifier should be able to handle this properly.

In the old kernel, you will have to manually unroll the loop
and do the checking.

>
> Argument padding, causing the BPF program to not reach the last
> elements:
>
> sudo bash --verbose --verbose .. --verbose -c ‘printf
> “SELINUX=disabled\nSELINUXTYPE=targeted\n” > /etc/selinux/config’

Not sure what is the issue here.
Maybe you can describe your bpf program and tracepoint setup
with more details. So we can understand better about the problem.

>
> I thought about trying to (tail?) call additional BPF programs to work
> around the second issue, but I’m not sure how to proceed with the first
> one.
>
> Thanks!
>
> Alessandro Gario
>
> On Mon, 2019-08-05 at 20:55 -0700, Brenden Blanco wrote:
> > Hi All,
> >
> > We have the bi-weekly phone conference scheduled for two days from
> > now, does
> > anybody have a discussion topic to add to the agenda? As a reminder,
> > we are
> > planning to hold the meeting only if agenda items are proposed.
> >
> > Cheers,
> > Brenden
> >
> >
> >
>
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1770): https://lists.iovisor.org/g/iovisor-dev/message/1770
Mute This Topic: https://lists.iovisor.org/mt/32737902/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtainnamespace data from current task.

2019-08-05 Thread Yonghong Song
eeing this once but then it’s never repeated running in later 
> runs of the same sample code.
>
>
>
> [ 75.097213] CPU: 0 PID: 1382 Comm: bash Not tainted 
> 5.3.0-rc1patch12-00120-g5d8525007580-dirty #7
>
> [   75.097213] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> Desktop Reference Platform, BIOS 6.00 04/13/2018
>
> [   75.097215] RIP: 0010:idr_alloc_u32+0xa8/0xc0
>
> [   75.097216] Code: 31 d2 48 89 e6 48 89 ef e8 e5 54 00 00 31 c0 48 8b 7c 24 
> 20 65 48 33 3c 25 28 00 00 00 75 1a 48 83 c4 28 5b 5d 41 5c 41 5d c3 <0f> 0b 
> 0d 04 00 80 00 89 47 04 e9 78 ff ff ff e8 14 ed 93 ff 0f 1f
>
> [   75.097216] RSP: 0018:ac39007c7d08 EFLAGS: 00010046
>
> [   75.097217] RAX:  RBX:  RCX: 
> 0001
>
> [   75.097218] RDX: ac39007c7d5c RSI: 0d99 RDI: 
> bea46b88
>
> [   75.097218] RBP: bea46b88 R08: 0a20 R09: 
> 
>
> [   75.097218] R10: 0120 R11:  R12: 
> 
>
> [   75.097219] R13: ac39007c7d5c R14: 012c R15: 
> bea46b80
>
> [   75.097219] FS:  7f7ba7691b40() GS:a0fef5c0() 
> knlGS:
>
> [   75.097220] CS:  0010 DS:  ES:  CR0: 80050033
>
> [   75.097220] CR2: 022eea64 CR3: 00022f498004 CR4: 
> 003606f0
>
> [   75.097246] DR0:  DR1:  DR2: 
> 
>
> [   75.097246] DR3:  DR6: fffe0ff0 DR7: 
> 0400
>
> [   75.097247] Call Trace:
>
> [   75.097251]  ? kmem_cache_alloc+0x161/0x600
>
> [   75.097252]  idr_alloc_cyclic+0x53/0xb0
>
> [   75.097254]  alloc_pid+0xc1/0x2a0
>
> [   75.097255]  copy_process+0xdc4/0x1c20
>
> [   75.097256]  _do_fork+0x74/0x360
>
> [   75.097257]  ? recalc_sigpending+0x17/0x50
>
> [   75.097258]  __x64_sys_clone+0x7f/0xa0
>
> [   75.097260]  do_syscall_64+0x55/0x130
>
> [   75.097261]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [   75.097263] RIP: 0033:0x7f7ba6d6538b
>
> [   75.097264] Code: db 45 85 f6 0f 85 95 01 00 00 64 4c 8b 04 25 10 00 00 00 
> 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 
> 00 f0 ff ff 0f 87 de 00 00 00 85 c0 41 89 c5 0f 85 e5 00 00
>
> [   75.097265] RSP: 002b:7ffc9a6cf180 EFLAGS: 0246 ORIG_RAX: 
> 0038
>
> [   75.097265] RAX: ffda RBX:  RCX: 
> 7f7ba6d6538b
>
> [   75.097266] RDX:  RSI:  RDI: 
> 01200011
>
> [   75.097266] RBP: 7ffc9a6cf1b0 R08: 7f7ba7691b40 R09: 
> 0002
>
> [   75.097266] R10: 7f7ba7691e10 R11: 0246 R12: 
> 
>
> [   75.097267] R13: 004cd15e R14:  R15: 
> 
>
> [   75.097268] ---[ end trace 2937ef4d051e39cd ]---
>
>
>
> I need a couple of days to debug this and then submit the patch if nobody 
> finds anything odd in the code.
>
> I hope tomorrow I’ll have good news on this if nobody finds anything odd in 
> the code first.
>
>
>
> Bests
>
>
>
> From: Y Song
> Sent: 05 August 2019 16:16
> To: carlos antonio neira bustos
> Cc: Alexei Starovoitov; Tom Herbert via iovisor-dev; Eric Biederman
> Subject: Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to 
> obtainnamespace data from current task.
>
>
>
> Carlos,
>
>
>
> Any update on this? It would be good to submit the patch earlier for a
>
> release cycle
>
> so we got a chance it may be accepted in the current release cycle.
>
>
>
> On Fri, Jul 26, 2019 at 3:41 AM carlos antonio neira bustos
>
>  wrote:
>
> >
>
> > Thanks Yonghong!.
>
> > Yes, I was looking at filename_lookup today on how to do it, I'll work on 
> > that.
>
> > Thank you very much.
>
> >
>
> >
>
> > El vie., 26 de jul. de 2019 01:18, Yonghong Song  
> > escribió:
>
> >>
>
> >> On Thu, Jul 25, 2019 at 12:11 PM Alexei Starovoitov
>
> >>  wrote:
>
> >> >
>
> >> > On Thu, Jul 25, 2019 at 11:53 AM neirac  wrote:
>
> >> > >
>
> >> > > Hey Yonghong,
>
> >> > > I have changed getname_kernel interface to specify the allocation type,
>
> >> > > all previous callers preserve GFP_KERNEL as the allocation type, after 
> >> > > this change
>
> >> > > getname_kernel could be used with GFP_ATOMIC in ebpf helpers.
>
> >> > > If this change goes in, I could complete bpf_get_current_pidns_info 
> >> > > helper.
>
> >> > > Let me know your thoughts.
>
> >> >
>
> >> > please figure out how to get pidns_info w/o calling into 
> >> > getname_kernel().
>
> >>
>
> >> Looks like we can call filename_lookup() directly. We do not need to
>
> >> call getname_kernel().
>
> >> We know the exact path name. We can do a atomic allocation inside the 
> >> helper.
>
> >> Carlos, could you take a look along this line? In this way, we do not
>
> >> need to change
>
> >> any files under linux/fs.
>
> >>
>
> >> 
>
> >>
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1766): https://lists.iovisor.org/g/iovisor-dev/message/1766
Mute This Topic: https://lists.iovisor.org/mt/32734768/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

2019-08-05 Thread Yonghong Song
Carlos,

Any update on this? It would be good to submit the patch earlier for a
release cycle
so we got a chance it may be accepted in the current release cycle.

On Fri, Jul 26, 2019 at 3:41 AM carlos antonio neira bustos
 wrote:
>
> Thanks Yonghong!.
> Yes, I was looking at filename_lookup today on how to do it, I'll work on 
> that.
> Thank you very much.
>
>
> El vie., 26 de jul. de 2019 01:18, Yonghong Song  
> escribió:
>>
>> On Thu, Jul 25, 2019 at 12:11 PM Alexei Starovoitov
>>  wrote:
>> >
>> > On Thu, Jul 25, 2019 at 11:53 AM neirac  wrote:
>> > >
>> > > Hey Yonghong,
>> > > I have changed getname_kernel interface to specify the allocation type,
>> > > all previous callers preserve GFP_KERNEL as the allocation type, after 
>> > > this change
>> > > getname_kernel could be used with GFP_ATOMIC in ebpf helpers.
>> > > If this change goes in, I could complete bpf_get_current_pidns_info 
>> > > helper.
>> > > Let me know your thoughts.
>> >
>> > please figure out how to get pidns_info w/o calling into getname_kernel().
>>
>> Looks like we can call filename_lookup() directly. We do not need to
>> call getname_kernel().
>> We know the exact path name. We can do a atomic allocation inside the helper.
>> Carlos, could you take a look along this line? In this way, we do not
>> need to change
>> any files under linux/fs.
>>
>> 
>>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1764): https://lists.iovisor.org/g/iovisor-dev/message/1764
Mute This Topic: https://lists.iovisor.org/mt/31207586/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] the size of BPF_MAP_TYPE_PERCPU_ARRAY doesn't match the number of CPU

2019-08-01 Thread Yonghong Song
On Thu, Aug 1, 2019 at 7:02 PM  wrote:
>
> map defined:
> struct bpf_map_def SEC("maps/protocount") proto_count = {
> .type = BPF_MAP_TYPE_PERCPU_ARRAY,
> .key_size = sizeof(__u32),
> .value_size = sizeof(__u32),
> .max_entries = 1,
> };
>
> userspace code:
> int32_t * allocArray(size_t ln) { return (int32_t*) malloc(ln * 
> sizeof(int32_t)); }
> void sum(int32_t* arr, size_t ln, void* sum) {
> int32_t* s = (int32_t*)sum;
> int i=0;
> for(i=0;i printf("cpu %d: %d\n", i, arr[i]);
>     (*s)+=arr[i];
> }
> sum = (void*)s;
> }
>
>
> Hi Yonghong Song,
> The code is as above.
> When I use 32bit key, the count result is in CPU 0 and 2, when I use 64bit 
> key, the result is in CPU 0 and 1 as expect.

For array map, the key type must be 32bit int.
```
int array_map_alloc_check(union bpf_attr *attr)
{
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
int numa_node = bpf_map_attr_numa_node(attr);

/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
attr->value_size == 0 ||
attr->map_flags & ~ARRAY_CREATE_FLAG_MASK ||
!bpf_map_flags_access_ok(attr->map_flags) ||
(percpu && numa_node != NUMA_NO_NODE))
return -EINVAL;

if (attr->value_size > KMALLOC_MAX_SIZE)
/* if value_size is bigger, the user space won't be able to
 * access the elements.
 */
return -E2BIG;

return 0;
}
```


I guess you mean value size. here.
When you got the values from kernel, the value size is rounded to 8. See
https://github.com/torvalds/linux/blob/master/kernel/bpf/arraymap.c#L81

So if you use 64bit value size, you will get correct value.
If you use 32bit value size, you should iterate through with int64_t
size, but only read the first 4 bytes for each iteration.

>
> Thanks
> forrest chen
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1759): https://lists.iovisor.org/g/iovisor-dev/message/1759
Mute This Topic: https://lists.iovisor.org/mt/32664687/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] the size of BPF_MAP_TYPE_PERCPU_ARRAY doesn't match the number of CPU

2019-08-01 Thread Yonghong Song
On Wed, Jul 31, 2019 at 1:34 AM  wrote:
>
> Hi all,
>
> I define a BPF_MAP_TYPE_PERCPU_ARRAY and use it to count packets in the xdp 
> program. When I read the map from userspace program, I find that the entry 
> number doesn't match local CPU numbers. I have 2 CPUs in my VM, but the count 
> result appear in index 0 and 2, my expectation is index 0 and 1.
> So why the counting result always appear in index 0 and 2 (or CPU 0 and 2), 
> even when my VM only have 2 cores. Does it because I run the program in VM?

Could you post how to reproduce the issue so people can help?
If you do provide reproducible steps, could you submit an issue to
help tracking?

Thanks,

>
> Thanks,
> forrest chen
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1757): https://lists.iovisor.org/g/iovisor-dev/message/1757
Mute This Topic: https://lists.iovisor.org/mt/32664687/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

2019-07-25 Thread Yonghong Song
On Thu, Jul 25, 2019 at 12:11 PM Alexei Starovoitov
 wrote:
>
> On Thu, Jul 25, 2019 at 11:53 AM neirac  wrote:
> >
> > Hey Yonghong,
> > I have changed getname_kernel interface to specify the allocation type,
> > all previous callers preserve GFP_KERNEL as the allocation type, after this 
> > change
> > getname_kernel could be used with GFP_ATOMIC in ebpf helpers.
> > If this change goes in, I could complete bpf_get_current_pidns_info helper.
> > Let me know your thoughts.
>
> please figure out how to get pidns_info w/o calling into getname_kernel().

Looks like we can call filename_lookup() directly. We do not need to
call getname_kernel().
We know the exact path name. We can do a atomic allocation inside the helper.
Carlos, could you take a look along this line? In this way, we do not
need to change
any files under linux/fs.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1752): https://lists.iovisor.org/g/iovisor-dev/message/1752
Mute This Topic: https://lists.iovisor.org/mt/31207586/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

2019-07-23 Thread Yonghong Song
On Tue, Jul 23, 2019 at 7:37 AM Carlos Antonio Neira Bustos
 wrote:
>
> Hey Yonghong,
> I also needed to replace the call to __getname as allocations on the slab 
> cache
> could sleep, that makes getname_kernel the only name cache consumer that does 
> not block.
> What do you think ?

Maybe the following alternative approach is better to preserve the
existing behavior.
Change getname_kernel(filename) to getname_kernel(filename, flags) and
the *flags* will
be used for kmem_cache_alloc and kmalloc. Existing behavior won't
change and bpf helper
will call getname_kernel(filename, GFP_ATOMIC). More code changes but
cleaner and not
unnecessarily overusing kernel atomic memory region.


>
>
> From 639233cc05e96a81054b049d16d1b9193ae667e9 Mon Sep 17 00:00:00 2001
> From: Carlos 
> Date: Mon, 22 Jul 2019 17:50:02 -0400
> Subject: [PATCH] [bpf-next 1/1] BPF: getname_kernel don't block on
>  allocation.
>
> This patch calls directly kmem_cache_alloc to specify the allocation type as 
> GFP_ATOMIC,
> the reason is that currently ebpf cannot call functions that could sleep and 
> future work
> will need to call kern_path that currently could sleep on allocation.
> ---
>  fs/namei.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..a4269ca52503 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -122,6 +122,10 @@
>   * PATH_MAX includes the nul terminator --RR.
>   */
>
> +/* [July 2019 neirac] getname_kernel changed allocation type to GFP_ATOMIC,
> + * as ebpf needs to use kern_path.
> + */
> +
>  #define EMBEDDED_NAME_MAX  (PATH_MAX - offsetof(struct filename, iname))
>
>  struct filename *
> @@ -215,7 +219,7 @@ getname_kernel(const char * filename)
> struct filename *result;
> int len = strlen(filename) + 1;
>
> -   result = __getname();
> +   result = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> if (unlikely(!result))
> return ERR_PTR(-ENOMEM);
>
> @@ -225,7 +229,7 @@ getname_kernel(const char * filename)
> const size_t size = offsetof(struct filename, iname[1]);
> struct filename *tmp;
>
> -   tmp = kmalloc(size, GFP_KERNEL);
> +   tmp = kmalloc(size, GFP_ATOMIC);
> if (unlikely(!tmp)) {
> __putname(result);
> return ERR_PTR(-ENOMEM);
> --
> 2.11.0
>
>
>
> On Fri, Jul 19, 2019 at 09:06:54AM -0700, Y Song wrote:
> > On Fri, Jul 19, 2019 at 6:02 AM cnb  wrote:
> > >
> > > kern_path is the one, should I move this and dependencies to a new file 
> > > called bfp_namei.c, in there I'll change it to use GFP_ATOMIC or create a 
> > > new function to replace kern_path. What do you think?
> >
> > I think adding GFP_ATOMIC to function getname_kernel() is reasonable.
> >
> > const size_t size = offsetof(struct filename, iname[1]);
> > struct filename *tmp;
> >
> > tmp = kmalloc(size, GFP_KERNEL);
> >
> > /* fs/open.c */
> > struct audit_names;
> > struct filename {
> > const char  *name;  /* pointer to actual string */
> > const __user char   *uptr;  /* original userland pointer */
> > int refcnt;
> > struct audit_names  *aname;
> > const char  iname[];
> > };
> >
> > The size is pretty small. on x64, it should be 25 bytes. If the system
> > cannot honor
> > this 25 bytes in non-blocking mode, it is probably already in stress,
> > bpf subsystem itself may not work reliably as it uses some GFP_ATOMIC
> > as well for some map (e.g., update) helpers.
> >
> > So let us have this patch (adding GFP_ATOMIC) in the first series. We can
> > think of alternatives (e.g., separate functions)  if anybody objects.
> >
> > >
> > > El jue., 18 de jul. de 2019 22:10, Y Song  escribió:
> > >>
> > >> On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
> > >>  wrote:
> > >> >
> > >> > Hi,
> > >> > Yes, I'm still interested as I need this capability at $WORK, but 
> > >> > haven't had the time to re-write the dentry functions we need, as 
> > >> > currently they could sleep.
> > >> > I could resume work on this next monday when I get back from vacations.
> > >> > I really appreciate your help and guidance on this.
> > >>
> > >> Great. Which dentry function are you referring to? dput? kern_path
> > >> cannot be used as it uses
> > >> kmalloc without GFP_ATOMIC. See 
> > >> https://github.com/iovisor/bcc/issues/1329.
> > >>
> > >> >
> > >> > Bests
> > >> >
> > >> >
> > >> > El jue., 18 de jul. de 2019 20:29, Y Song  
> > >> > escribió:
> > >> >>
> > >> >> Hi, Carlos,
> > >> >>
> > >> >> Are you still interested in upstreaming this patch? Looks like there
> > >> >> still a desire
> > >> >> to make bcc work inside the containers.
> > >> >>
> > >> >> The bpf-next will open next week. If you would like, could you submit 
> > >> >> again?
> > >> >> I will review 

Re: [iovisor-dev] [PATCH] Report proper module on kernel backtrace

2019-07-22 Thread Yonghong Song
Jiri,

Thanks for fixing the issue. Could you submit the patch
as a pull request? The pull request is the standard bcc
review mechanism.

Yonghong

On Mon, Jul 22, 2019 at 1:36 AM Jiri Olsa  wrote:
>
> Raghavendra Rao reported that memleak does not display
> proper name of the related kernel module, but just the
> "kernel" string, like here for xfs module functions:
>
>   131072 bytes in 4 allocations from stack
>   ..
>   bvec_alloc+0x92 [kernel]
>   bio_alloc_bioset+0x13f [kernel]
>   xfs_add_to_ioend+0x2df [kernel]
>   xfs_do_writepage+0x148 [kernel]
>   write_cache_pages+0x171 [kernel]
>   xfs_vm_writepages+0x59 [kernel]
>   do_writepages+0x43 [kernel]
>   ...
>
> The kernel resolver code is parsing /proc/kallsyms, which
> already has the module information in.
>
> This patch is adding support to parse the module info
> from /proc/kallsyms and initialize the module with
> proper value.
>
> Above memleak backtrace now looks like:
>
>   131072 bytes in 4 allocations from stack
>
>   bvec_alloc+0x92 [kernel]
>   bio_alloc_bioset+0x13f [kernel]
>   xfs_add_to_ioend+0x2df [xfs]
>   xfs_do_writepage+0x148 [xfs]
>   write_cache_pages+0x171 [kernel]
>   xfs_vm_writepages+0x59 [xfs]
>   do_writepages+0x43 [kernel]
>   ...
>
> Reported-by: Raghavendra Rao 
> Signed-off-by: Jiri Olsa 
> ---
>  src/cc/bcc_proc.c  | 20 ++--
>  src/cc/bcc_proc.h  |  2 +-
>  src/cc/bcc_syms.cc |  8 
>  src/cc/libbpf  |  2 +-
>  src/cc/syms.h  |  5 +++--
>  tests/cc/test_c_api.cc |  2 +-
>  6 files changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/src/cc/bcc_proc.c b/src/cc/bcc_proc.c
> index c8f4041ed27b..d6e17282eed6 100644
> --- a/src/cc/bcc_proc.c
> +++ b/src/cc/bcc_proc.c
> @@ -208,7 +208,7 @@ done:
>
>  int bcc_procutils_each_ksym(bcc_procutils_ksymcb callback, void *payload) {
>char line[2048];
> -  char *symname, *endsym;
> +  char *symname, *endsym, *modname, *endmod = NULL;
>FILE *kallsyms;
>unsigned long long addr;
>
> @@ -237,7 +237,23 @@ int bcc_procutils_each_ksym(bcc_procutils_ksymcb 
> callback, void *payload) {
>  while (*endsym && !isspace(*endsym)) endsym++;
>  *endsym = '\0';
>
> -callback(symname, addr, payload);
> +// Parse module name if it's available
> +modname = endsym + 1;
> +while (*modname && isspace(*endsym)) modname++;
> +
> +if (*modname && *modname == '[') {
> +  endmod = ++modname;
> +  while (*endmod && *endmod != ']') endmod++;
> +  if (*endmod)
> +*(endmod) = '\0';
> +  else
> +endmod = NULL;
> +}
> +
> +if (!endmod)
> +  modname = "kernel";
> +
> +callback(symname, modname, addr, payload);
>}
>
>fclose(kallsyms);
> diff --git a/src/cc/bcc_proc.h b/src/cc/bcc_proc.h
> index a56f680d5f56..368f27d9e0c3 100644
> --- a/src/cc/bcc_proc.h
> +++ b/src/cc/bcc_proc.h
> @@ -42,7 +42,7 @@ typedef struct mod_info {
>  typedef int (*bcc_procutils_modulecb)(mod_info *, int, void *);
>
>  // Symbol name, address, payload
> -typedef void (*bcc_procutils_ksymcb)(const char *, uint64_t, void *);
> +typedef void (*bcc_procutils_ksymcb)(const char *, const char *, uint64_t, 
> void *);
>
>  char *bcc_procutils_which_so(const char *libname, int pid);
>  char *bcc_procutils_which(const char *binpath);
> diff --git a/src/cc/bcc_syms.cc b/src/cc/bcc_syms.cc
> index a946fe1a1048..95ec8bad398a 100644
> --- a/src/cc/bcc_syms.cc
> +++ b/src/cc/bcc_syms.cc
> @@ -47,9 +47,9 @@ bool ProcStat::is_stale() {
>  ProcStat::ProcStat(int pid)
>  : procfs_(tfm::format("/proc/%d/exe", pid)), inode_(getinode_()) {}
>
> -void KSyms::_add_symbol(const char *symname, uint64_t addr, void *p) {
> +void KSyms::_add_symbol(const char *symname, const char *modname, uint64_t 
> addr, void *p) {
>KSyms *ks = static_cast(p);
> -  ks->syms_.emplace_back(symname, addr);
> +  ks->syms_.emplace_back(symname, modname, addr);
>  }
>
>  void KSyms::refresh() {
> @@ -67,13 +67,13 @@ bool KSyms::resolve_addr(uint64_t addr, struct bcc_symbol 
> *sym, bool demangle) {
>if (syms_.empty())
>  goto unknown_symbol;
>
> -  it = std::upper_bound(syms_.begin(), syms_.end(), Symbol("", addr));
> +  it = std::upper_bound(syms_.begin(), syms_.end(), Symbol("", "", addr));
>if (it != syms_.begin()) {
>  it--;
>  sym->name = (*it).name.c_str();
>  if (demangle)
>sym->demangle_name = sym->name;
> -sym->module = "kernel";
> +sym->module = (*it).mod.c_str();
>  sym->offset = addr - (*it).addr;
>  return true;
>}
> diff --git a/src/cc/libbpf b/src/cc/libbpf
> index 45ad8626010b..33b017498543 16
> --- a/src/cc/libbpf
> +++ b/src/cc/libbpf
> @@ -1 +1 @@
> -Subproject commit 45ad8626010b384306bc277c46a4107f21414b22
> +Subproject commit 33b017498543167b65fa948d3a0267794c78787f
> diff --git a/src/cc/syms.h b/src/cc/syms.h
> 

Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

2019-07-19 Thread Yonghong Song
On Fri, Jul 19, 2019 at 6:02 AM cnb  wrote:
>
> kern_path is the one, should I move this and dependencies to a new file 
> called bfp_namei.c, in there I'll change it to use GFP_ATOMIC or create a new 
> function to replace kern_path. What do you think?

I think adding GFP_ATOMIC to function getname_kernel() is reasonable.

const size_t size = offsetof(struct filename, iname[1]);
struct filename *tmp;

tmp = kmalloc(size, GFP_KERNEL);

/* fs/open.c */
struct audit_names;
struct filename {
const char  *name;  /* pointer to actual string */
const __user char   *uptr;  /* original userland pointer */
int refcnt;
struct audit_names  *aname;
const char  iname[];
};

The size is pretty small. on x64, it should be 25 bytes. If the system
cannot honor
this 25 bytes in non-blocking mode, it is probably already in stress,
bpf subsystem itself may not work reliably as it uses some GFP_ATOMIC
as well for some map (e.g., update) helpers.

So let us have this patch (adding GFP_ATOMIC) in the first series. We can
think of alternatives (e.g., separate functions)  if anybody objects.

>
> El jue., 18 de jul. de 2019 22:10, Y Song  escribió:
>>
>> On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
>>  wrote:
>> >
>> > Hi,
>> > Yes, I'm still interested as I need this capability at $WORK, but haven't 
>> > had the time to re-write the dentry functions we need, as currently they 
>> > could sleep.
>> > I could resume work on this next monday when I get back from vacations.
>> > I really appreciate your help and guidance on this.
>>
>> Great. Which dentry function are you referring to? dput? kern_path
>> cannot be used as it uses
>> kmalloc without GFP_ATOMIC. See https://github.com/iovisor/bcc/issues/1329.
>>
>> >
>> > Bests
>> >
>> >
>> > El jue., 18 de jul. de 2019 20:29, Y Song  escribió:
>> >>
>> >> Hi, Carlos,
>> >>
>> >> Are you still interested in upstreaming this patch? Looks like there
>> >> still a desire
>> >> to make bcc work inside the containers.
>> >>
>> >> The bpf-next will open next week. If you would like, could you submit 
>> >> again?
>> >> I will review the patch on bpf-next as well to make sure we made
>> >> forward progress.
>> >>
>> >> Please let me know. Thanks!
>> >>
>> >> Yonghong
>> >>
>> >> On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
>> >>  wrote:
>> >> >
>> >> > As a bpf program cannot sleep, I needed to add a spinlock to kern_path, 
>> >> > as
>> >> > it calls getname_kernel() which may sleep.
>> >> > The inode is accessed directly, as we are just interested in the 
>> >> > inode's s_dev.
>> >> > Let me know if this approach is the correct one.
>> >> > ---
>> >> > From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
>> >> > From: Carlos 
>> >> > Date: Tue, 16 Apr 2019 17:10:46 -0400
>> >> > Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain 
>> >> > namespace data
>> >> >   from current task
>> >> >
>> >> > This helper obtains the active namespace from current and returns pid, 
>> >> > tgid,
>> >> > device and namespace id as seen from that namespace, allowing to 
>> >> > instrument
>> >> > a process inside a container.
>> >> > Device is read from /proc/self/ns/pid, as in the future it's possible 
>> >> > that
>> >> > different pid_ns files may belong to different devices, according
>> >> > to the discussion between Eric Biederman and Yonghong in 2017 linux 
>> >> > plumbers
>> >> > conference.
>> >> > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in 
>> >> > bcc's
>> >> > scripts but this helper returns the pid as seen by the root namespace 
>> >> > which is
>> >> > fine when a bcc script is not executed inside a container.
>> >> > When the process of interest is inside a container, pid filtering will 
>> >> > not work
>> >> > if bpf_get_current_pid_tgid() is used. This helper addresses this 
>> >> > limitation
>> >> > returning the pid as it's seen by the current namespace where the 
>> >> > script is
>> >> > executing.
>> >> >
>> >> > This helper has the same use cases as bpf_get_current_pid_tgid() as it 
>> >> > can be
>> >> > used to do pid filtering even inside a container.
>> >> >
>> >> > For example a bcc script using bpf_get_current_pid_tgid() 
>> >> > (tools/funccount.py):
>> >> >
>> >> > u32 pid = bpf_get_current_pid_tgid() >> 32;
>> >> > if (pid != )
>> >> > return 0;
>> >> > Could be modified to use bpf_get_current_pidns_info() as follows:
>> >> >
>> >> > struct bpf_pidns pidns;
>> >> > bpf_get_current_pidns_info(, sizeof(struct bpf_pidns));
>> >> > u32 pid = pidns.tgid;
>> >> > u32 nsid = pidns.nsid;
>> >> > if ((pid != ) && (nsid != 
>> >> > ))
>> >> > return 0;
>> >> >
>> >> > To find out the name PID 

Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

2019-07-18 Thread Yonghong Song
On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
 wrote:
>
> Hi,
> Yes, I'm still interested as I need this capability at $WORK, but haven't had 
> the time to re-write the dentry functions we need, as currently they could 
> sleep.
> I could resume work on this next monday when I get back from vacations.
> I really appreciate your help and guidance on this.

Great. Which dentry function are you referring to? dput? kern_path
cannot be used as it uses
kmalloc without GFP_ATOMIC. See https://github.com/iovisor/bcc/issues/1329.

>
> Bests
>
>
> El jue., 18 de jul. de 2019 20:29, Y Song  escribió:
>>
>> Hi, Carlos,
>>
>> Are you still interested in upstreaming this patch? Looks like there
>> still a desire
>> to make bcc work inside the containers.
>>
>> The bpf-next will open next week. If you would like, could you submit again?
>> I will review the patch on bpf-next as well to make sure we made
>> forward progress.
>>
>> Please let me know. Thanks!
>>
>> Yonghong
>>
>> On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
>>  wrote:
>> >
>> > As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
>> > it calls getname_kernel() which may sleep.
>> > The inode is accessed directly, as we are just interested in the inode's 
>> > s_dev.
>> > Let me know if this approach is the correct one.
>> > ---
>> > From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
>> > From: Carlos 
>> > Date: Tue, 16 Apr 2019 17:10:46 -0400
>> > Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace 
>> > data
>> >   from current task
>> >
>> > This helper obtains the active namespace from current and returns pid, 
>> > tgid,
>> > device and namespace id as seen from that namespace, allowing to instrument
>> > a process inside a container.
>> > Device is read from /proc/self/ns/pid, as in the future it's possible that
>> > different pid_ns files may belong to different devices, according
>> > to the discussion between Eric Biederman and Yonghong in 2017 linux 
>> > plumbers
>> > conference.
>> > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
>> > scripts but this helper returns the pid as seen by the root namespace 
>> > which is
>> > fine when a bcc script is not executed inside a container.
>> > When the process of interest is inside a container, pid filtering will not 
>> > work
>> > if bpf_get_current_pid_tgid() is used. This helper addresses this 
>> > limitation
>> > returning the pid as it's seen by the current namespace where the script is
>> > executing.
>> >
>> > This helper has the same use cases as bpf_get_current_pid_tgid() as it can 
>> > be
>> > used to do pid filtering even inside a container.
>> >
>> > For example a bcc script using bpf_get_current_pid_tgid() 
>> > (tools/funccount.py):
>> >
>> > u32 pid = bpf_get_current_pid_tgid() >> 32;
>> > if (pid != )
>> > return 0;
>> > Could be modified to use bpf_get_current_pidns_info() as follows:
>> >
>> > struct bpf_pidns pidns;
>> > bpf_get_current_pidns_info(, sizeof(struct bpf_pidns));
>> > u32 pid = pidns.tgid;
>> > u32 nsid = pidns.nsid;
>> > if ((pid != ) && (nsid != ))
>> > return 0;
>> >
>> > To find out the name PID namespace id of a process, you could use this 
>> > command:
>> >
>> > $ ps -h -o pidns -p 
>> >
>> > Or this other command:
>> >
>> > $ ls -Li /proc//ns/pid
>> >
>> > Signed-off-by: Carlos Antonio Neira Bustos 
>> > ---
>> >  include/linux/bpf.h  |  1 +
>> >  include/uapi/linux/bpf.h | 25 ++-
>> >  kernel/bpf/core.c|  1 +
>> >  kernel/bpf/helpers.c | 65 
>> > 
>> >  kernel/trace/bpf_trace.c |  2 ++
>> >  5 files changed, 93 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> > index e4d4c1771ab0..4393f8f088cc 100644
>> > --- a/include/linux/bpf.h
>> > +++ b/include/linux/bpf.h
>> > @@ -987,6 +987,7 @@ extern const struct bpf_func_proto 
>> > bpf_sk_redirect_map_proto;
>> >  extern const struct bpf_func_proto bpf_spin_lock_proto;
>> >  extern const struct bpf_func_proto bpf_spin_unlock_proto;
>> >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>> > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>> >
>> >  /* Shared helpers among cBPF and eBPF. */
>> >  void bpf_user_rnd_init_once(void);
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index 31a27dd337dc..7bf457875c31 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -2500,6 +2500,18 @@ union bpf_attr {
>> >   * Return
>> >   * 0 if iph and th are a valid SYN cookie ACK, or a negative 
>> > error
>> >   * otherwise.
>> > + *
>> > + * int bpf_get_current_pidns_info(struct bpf_pidns_info 

Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

2019-07-18 Thread Yonghong Song
Hi, Carlos,

Are you still interested in upstreaming this patch? Looks like there
still a desire
to make bcc work inside the containers.

The bpf-next will open next week. If you would like, could you submit again?
I will review the patch on bpf-next as well to make sure we made
forward progress.

Please let me know. Thanks!

Yonghong

On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
 wrote:
>
> As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
> it calls getname_kernel() which may sleep.
> The inode is accessed directly, as we are just interested in the inode's 
> s_dev.
> Let me know if this approach is the correct one.
> ---
> From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
> From: Carlos 
> Date: Tue, 16 Apr 2019 17:10:46 -0400
> Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
>   from current task
>
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not 
> work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
>
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
>
> For example a bcc script using bpf_get_current_pid_tgid() 
> (tools/funccount.py):
>
> u32 pid = bpf_get_current_pid_tgid() >> 32;
> if (pid != )
> return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
>
> struct bpf_pidns pidns;
> bpf_get_current_pidns_info(, sizeof(struct bpf_pidns));
> u32 pid = pidns.tgid;
> u32 nsid = pidns.nsid;
> if ((pid != ) && (nsid != ))
> return 0;
>
> To find out the name PID namespace id of a process, you could use this 
> command:
>
> $ ps -h -o pidns -p 
>
> Or this other command:
>
> $ ls -Li /proc//ns/pid
>
> Signed-off-by: Carlos Antonio Neira Bustos 
> ---
>  include/linux/bpf.h  |  1 +
>  include/uapi/linux/bpf.h | 25 ++-
>  kernel/bpf/core.c|  1 +
>  kernel/bpf/helpers.c | 65 
> 
>  kernel/trace/bpf_trace.c |  2 ++
>  5 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e4d4c1771ab0..4393f8f088cc 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -987,6 +987,7 @@ extern const struct bpf_func_proto 
> bpf_sk_redirect_map_proto;
>  extern const struct bpf_func_proto bpf_spin_lock_proto;
>  extern const struct bpf_func_proto bpf_spin_unlock_proto;
>  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 31a27dd337dc..7bf457875c31 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2500,6 +2500,18 @@ union bpf_attr {
>   * Return
>   * 0 if iph and th are a valid SYN cookie ACK, or a negative 
> error
>   * otherwise.
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 
> size_of_pidns)
> + * Description
> + * Copies into *pidns* pid, namespace id and tgid as seen by the
> + * current namespace and also device from /proc/self/ns/pid.
> + * *size_of_pidns* must be the size of *pidns*
> + *
> + * This helper is used when pid filtering is needed inside a
> + * container as bpf_get_current_tgid() helper returns always the
> + * pid id as seen by the root namespace.
> + * Return
> + * 0 on success -EINVAL on error.
>   */
>  #define __BPF_FUNC_MAPPER(FN)  \
> FN(unspec), \
> @@ -2602,7 +2614,8 @@ union bpf_attr {
> FN(skb_ecn_set_ce), \
> FN(get_listener_sock),  \
> FN(skc_lookup_tcp), \
> -   FN(tcp_check_syncookie),
> +   FN(tcp_check_syncookie),\
> +   FN(get_current_pidns_info),
>
>  /* integer value in 

Re: [iovisor-dev] math between pkt pointer and register with unbounded min value is not allowed #verifier

2019-07-17 Thread Yonghong Song
Hi, Simon,

Finally got some time to deep dive into the verifier for this issue.
I filed an issue https://github.com/iovisor/bcc/issues/2463 with
possible suggestion to add a new checksum calculation helper
for xdp. I forgot what is your ultimate workaround for this issue.
Could you comment on the issue?

Thanks!

Yonghong

On Thu, Mar 7, 2019 at 9:37 AM Y Song  wrote:
>
> On Wed, Mar 6, 2019 at 7:08 AM  wrote:
> >
> > I'm playing with bcc to prototype an UDP load balancer.
> >
> > I'm facing an issue that I didn't succeed to understand...
> >
> > In my code I tried to validate my UDP packet using code like this :
> >
> > struct udphdr *udp;
> > udp = iph + 1;
> > if (udp + 1 > data_end)
> > return XDP_DROP;
> > __u16 udp_len = bpf_ntohs(udp->len);
> > //__u16 udp_len = 8;
> > if (udp_len < 8)
> > return XDP_DROP;
> > if (udp_len > 512) // TODO use a more approriate max value
> > return XDP_DROP;
> > if ((void *) udp + udp_len > data_end)
> > return XDP_DROP;
> >
> > And the verifier does not like it ..
>
> This is caused by compiler optimizations.
>
> >
> > 28: (71) r2 = *(u8 *)(r7 +23)
> > 29: (b7) r0 = 2
> > 30: (55) if r2 != 0x11 goto pc+334
> >  R0=inv2 R1=pkt_end(id=0,off=0,imm=0) R2=inv17 R3=inv5 
> > R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=34,imm=0) 
> > R8=pkt(id=0,off=34,r=34,imm=0) R9=pkt(id=0,off=14,r=34,imm=0) 
> > R10=fp0,call_-1
> > 31: (bf) r2 = r8
> > 32: (07) r2 += 8
> > 33: (b7) r0 = 1
> > 34: (2d) if r2 > r1 goto pc+330
> >  R0=inv1 R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=42,r=42,imm=0) 
> > R3=inv5 R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=42,imm=0) 
> > R8=pkt(id=0,off=34,r=42,imm=0) R9=pkt(id=0,off=14,r=42,imm=0) 
> > R10=fp0,call_-1
> > 35: (69) r3 = *(u16 *)(r7 +38)
> > 36: (dc) r3 = be16 r3
>
> r3 get the value from memory, its value could be any one as permitted
> by the type.
>
> > 37: (bf) r2 = r3
> > 38: (07) r2 += -8
> > 39: (57) r2 &= 65535
> > 40: (b7) r0 = 1
> > 41: (25) if r2 > 0x1f8 goto pc+323
>
> test is done by r2. We indeed get better range for r2 (below:
> R2=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) )
> but r3 range is not tightened.
>
> >  R0=inv1 R1=pkt_end(id=0,off=0,imm=0) 
> > R2=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) R3=inv(id=0) 
> > R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=42,imm=0) 
> > R8=pkt(id=0,off=34,r=42,imm=0) R9=pkt(id=0,off=14,r=42,imm=0) 
> > R10=fp0,call_-1
> > 42: (bf) r2 = r7
> > 43: (0f) r2 += r3
> > math between pkt pointer and register with unbounded min value is not 
> > allowed
>
> Here, we use r3 to do arith and the verifier not happy. If we use the
> tightened old r2, it may work.
> The compiler does the right thing, just verifier is not advanced enough.
>
> >
> > I'm pretty sure the issue is about udp_len, that's why I tried to validate 
> > its value before to use it ... but without success...
> > When I set udp_len to 8 (just for testing) this seems to works. Any idea 
> > about that ?
>
> Yes, you will need some source workaround. You could try below (untested):
>
> if (udp_len > 512) // TODO use a more approriate max value
>   return XDP_DROP;
> +  udp_len = udp_len & 0x1ff;
> if ((void *) udp + udp_len > data_end)
>  return XDP_DROP;
>
> >
> > Full code is available here : 
> > https://gist.github.com/sbernard31/d4fee7518a1ff130452211c0d355b3f7
> >
> > (I'm using python-bpfcc  0.8.0-4 from debian sid with a 4.19.12 kernel)
> > (I don't know if this is the right place for this kind of question, )
> >
> > 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1735): https://lists.iovisor.org/g/iovisor-dev/message/1735
Mute This Topic: https://lists.iovisor.org/mt/30285987/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Behavior of bpf_obj_get

2019-05-30 Thread Yonghong Song
On Thu, May 30, 2019 at 11:26 AM Adam Drescher
 wrote:
>
> Thank you for your answer, that makes sense.
>
> This leads me to a follow up question: is there a standard way to see
> if a pinned map has been reloaded, without closing and re-opening the
> map file every time to check? From your answer, now I know we cannot

If you take a reference count for the pinned map, the map cannot be
removed.
If you are talking about the content change of a pinned map, the applicaiton
has to check the map key/values to find it. Just like shared memory.

> do this by monitoring a change in the map's file descriptor. We cannot
> use the map ID from bpf_obj_get_info_by_fd, as the info still
> corresponds to the old map. As far as I can tell, we don't have access
> to the BPF object or program pointers in a standalone userspace
> daemon.
You got a map fd, you should use map fd to do standard map lookup/insert/delete
operations.

>
> On Thu, May 30, 2019 at 12:00 PM Y Song  wrote:
> >
> > On Thu, May 30, 2019 at 8:40 AM Adam Drescher  
> > wrote:
> > >
> > > I am seeing unexpected behavior from bpf_obj_get, although this is
> > > likely due to my inexperience with BPF.
> > >
> > > In a loader program, I create a pinned map at
> > > "sys/fs/bpf/test/xdp_stats_map". In a separate statistics program, I
> > > access the pinned map via file descriptor -- I got the file descriptor
> > > from a call to bpf_obj_get and provided the pathname above.
> > >
> > > In a polling loop, I call bpf_obj_get on the same pathname and compare
> > > this value to the original. However, instead of getting the same file
> > > descriptor, the file descriptor returned by bpf_obj_get increments by
> > > 1 each invocation (so it returns 4, 5, 6, 7, ...). I am not doing
> > > anything externally to reload the map or change the file descriptor.
> > > Why is this happening? Looking at samples/bpf/fds_example.c as example
> > > usage of bpf_obj_get, I would expect to get the same file descriptor
> > > back. Any ideas?
> >
> > No, you won't get the same file descriptor although they pointing to
> > the same map.
> >   - the original map fd holds a reference count in the kernel.
> >   - bpf_obj_get returns a new fd and map reference count is increased
> > by 1 as well
> >
> > You can close the original map fd and the fd returned by bpf_obj_get() would
> > still be valid, and vice verse.
> >
> > >
> > > Relevant code (filename has already been populated elsewhere):
> > >
> > > #ifndef PATH_MAX
> > > #define PATH_MAX4096
> > > #endif
> > > char filename[PATH_MAX];
> > >
> > > static void stats_poll(int map_fd, __u32 map_type, int interval)
> > > {
> > > int fd;
> > > while (1) {
> > > fd = bpf_obj_get(filename);
> > > printf("filename: %s\n", filename);
> > > printf("bpf_obj_get: %d - map_fd: %d\n", fd, map_fd);
> > > sleep(interval);
> > > }
> > > }
> > >
> > > Relevant output:
> > >  - BPF map (bpf_map_type:6) id:42 name:xdp_stats_map key_size:4
> > > value_size:16 max_entries:5
> > > filename: /sys/fs/bpf/test/xdp_stats_map
> > > bpf_obj_get: 4 - map_fd: 3
> > >
> > > filename: /sys/fs/bpf/test/xdp_stats_map
> > > bpf_obj_get: 5 - map_fd: 3
> > >
> > > filename: /sys/fs/bpf/test/xdp_stats_map
> > > bpf_obj_get: 6 - map_fd: 3
> > >
> > > 
> > >

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1703): https://lists.iovisor.org/g/iovisor-dev/message/1703
Mute This Topic: https://lists.iovisor.org/mt/31870056/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Do you know if I can use a bpf file from bcc for snort ?

2019-05-27 Thread Yonghong Song
On Mon, May 27, 2019 at 4:04 AM Dorian ROSSE  wrote:
>
> Hello everybody,
>
>
> Do you know if I can use a bpf file from bcc for snort ?

You mean a bpf program, right?
Do you mean to have a bpf program to do L7 parsing?
If simple one, it should work. See bcc/examples/networking/http_filter/*.
But since kernel verifier currently does not support loops,
the complex options may not be supported.

>
> Thank you in advance to answer if I can and how to do my ask,
>
> Regards.
>
>
> Dorian ROSSE.
>
>
>
> Provenance : Courrier pour Windows 10
>
>
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1686): https://lists.iovisor.org/g/iovisor-dev/message/1686
Mute This Topic: https://lists.iovisor.org/mt/31808311/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Facing an error while compiling for bpf using clang

2019-05-22 Thread Yonghong Song
 bpf needs 3.7.1 and higher version. Later clang (>= 3.7.1) has more features.

On Tue, May 21, 2019 at 11:13 PM Prashanth Fernando
 wrote:
>
> Hi,
>
> The clang version I am using is 3.4.2
>
> clang --version
> clang version 3.4.2 (tags/RELEASE_34/dot2-final)
> Target: x86_64-redhat-linux-gnu
> Thread model: posix
>
> Tks,
> PRashanth
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1677): https://lists.iovisor.org/g/iovisor-dev/message/1677
Mute This Topic: https://lists.iovisor.org/mt/31716444/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] R? min value is negative, either use unsigned or 'var &= const' #verifier

2019-04-11 Thread Yonghong Song
On Thu, Apr 11, 2019 at 8:37 AM Simon  wrote:
>
> I finally discover that checksum can be calculated via incremental update. 
> (see RFC 1624)
>
> Using it, I didn't have to deal with dynamic sized payload and so no more 
> issue with the verifier.

Glad you find a solution!

>
> So I go back to use bcc :)
>
> Again, Thx a lot Yonghong for your time !
>
> Please don't forget to keep me in touch about that even if I'm not impacted 
> anymore, I'm curious to know how you move forward on this !

Sure.

>
> (By the way did you get the point about the error just above, is the verifier 
> able to understand this kind of dynamic check ?)

Verifier understands some dynamic check if these dynamic check is
resolved to be constant vs. variable or constant vs. constant compares
during path sensitive verification.

>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1647): https://lists.iovisor.org/g/iovisor-dev/message/1647
Mute This Topic: https://lists.iovisor.org/mt/30315706/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] minutes: IO Visor TSC/Dev Meeting

2019-04-05 Thread Yonghong Song
Hi, Jiong,

To follow up the iovisor meeting discussion, the below is my prototype
for an end_loop
instruction in llvm:
https://github.com/yonghong-song/llvm/commit/b83226772100317092cae6478229ed6ca3b9903c
The goal is to help verifier to just focus on these marked cases,
rejecting any other backedges.

Please let me and John know if you get some better idea for bounded
loop support in llvm/verifier.

Thanks,

Yonghong

On Wed, Apr 3, 2019 at 11:49 AM Brenden Blanco  wrote:
>
> Hi All,
>
> Thanks for the good discussion today! Below are my notes.
>
> Thanks,
> Brenden
>
> === Discussion ===
[...]
> Yonghong:
> * Compile once run anywhere work continues
>   * bitfield handling bugs in IR/debuginfo
>
> Daniel:
> * Global support work continues
>   * BTF side patches submitted to bpf mailing list
>   * tests included
>
> Jiong:
> * 32 bit patch set
>   * test methodology improvements
>   * updated patches later in the week
>   * some concerns around shifts, to be addressed in later improvements
>
> Andrii:
> * BTF and compile-once work integration
>   to share prototype tool with Saeed
>
> Brendan:
> * Is there a tool to measure queue latency in qdisc->netdev layer?
>   * debian/ubuntu are packaging bpftrace
>   * except libbcc renamed to libbpf_cc
>   * some issues with mixing iovisor's libbcc and debian's
>
> Jesper:
> * Fedora adding packaging support for libbpf
>
> Alexei:
> * systemd also adding support for libbpf - link to be provided?
>
> === Attendees ===
> Brenden Blanco
> Andril Nakryiko
> Daniel Borkmann
> Jesper Brouer
> Jiong Wang
> Marco Leogrande
> Michael Savisko
> Paul Chaignon
> Quentin Monnet
> Alexei Starovoitov
> Saeed
> Flavio
> Rony
> Jonathan Lemon
> Brendan Gregg
> Dan Siemon
> Joe Stringer
> John
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1644): https://lists.iovisor.org/g/iovisor-dev/message/1644
Mute This Topic: https://lists.iovisor.org/mt/30527411/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] math between pkt pointer and register with unbounded min value is not allowed #verifier

2019-03-27 Thread Yonghong Song
On Wed, Mar 27, 2019 at 1:48 PM Pablo Alvarez via Lists.Iovisor.Org
 wrote:
>
> Why is it required that llvm compile the BPF code with -O2? That seems
> to be part of what is causing these verifier problems...

-O0 won't work as helper call will become an indirect call
  static void *(*bpf_map_lookup_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_lookup_elem;
Another reason is below value tracking in verifier.

The verifier does not handle value spill well. It kept track of map
pointers, packet pointers, etc.
But if a particular value is spilled, later on, when reload from
stack, it will become unknown.
For example,
r1 = 10;/* verifier state: r1 = 10 */
*(r10 - 40) = r1;
   r2 = *(r10 - 40);  /* verifier state: r2, unknown */

The reason is that keep tracking of values could increase verification
time quite a bit.
This is measured sometime back, but it may warrant to do some measurement again
at this moment to see whether can relax this restriction.

So practically -O0 won't work. -O1 may or may not work and most people
do not use it.
-O2 is preferred. Also for performance reasons, we want to make -O2
work as BPF JIT
inside the kernel does not do any optimization, it is simple one insn
each time translation.

>
> On 3/27/19 4:23 PM, Yonghong Song wrote:
> > On Wed, Mar 27, 2019 at 10:17 AM Jiong Wang  
> > wrote:
> >>
> >>> On 27 Mar 2019, at 16:43, Simon  wrote:
> >>>
> >>> Thx a lot for your time Jiong.
> >>>
> >>> The more I played with bpf/xdp, the more I understand that the challenge 
> >>> is about making "optimized byte code" compliant for the verifier.
> >>>
> >>> How could I do this kind of checks my self ? I mean looking how llvm 
> >>> optimized my code ? (to be able to do same kind of analyses you do above?)
> >> Just my humble opinion, I would recommend:
> >>
> >>1.  get used to verifier rejection information, for example:
> >>
> >>R0=inv1 R1=pkt(id=0,off=0,r=42,imm=0) R2=pkt_end(id=0,off=0,imm=0) 
> >> R3=inv(id=0) R4=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) R5=inv5 
> >> R10=fp0,call_-1
> >> 40: (0f) r1 += r3
> >> math between pkt pointer and register with unbounded min value is not 
> >> allowed
> >>
> >>It tells you the status of each registers at the rejection point,
> >>for example, now R3 is “inv”, meaning a scalar value (not a 
> >> pointer),
> >>and is without value range, then r4 has value range, and maximum 
> >> value
> >>is 504.
> > If you use BPF constructor debug=16 flag, it will print out the
> > register state for every insn if you are even more curious.
> >
> >>2. known what verifier will reject. Could refer to:
> >>
> >>   
> >> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/verifier?id=473c5daa86ffe91e937856cc32b4faa61db2e3e3
> >>
> >>   those are unit examples of what will be rejected, and some of them 
> >> are with
> >>   meaningful test name or comments so could be easy to understand.
> > To resolve this issue, llvm may need to do more:
> >- prevent/undo optimization which may cause ultimate verifier rejections.
> >- provide hints (e.g., through BTF) to verifier so verifier may
> > selectively do some analysis
> >  or enable some tracking for the cases where BTF instructed to
> > handle. For example,
> >  BTF may tell verifier two register have the same state at a
> > particular point and verifier
> >  only needs to check these two registers with limited range and no
> > others, etc.
> >
> >> Regards,
> >> Jiong
> >>
> >>
> >>
> >>>
> >>
> >>
> >>
> >
> >
>
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1635): https://lists.iovisor.org/g/iovisor-dev/message/1635
Mute This Topic: https://lists.iovisor.org/mt/30285987/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] R? min value is negative, either use unsigned or 'var &= const' #verifier

2019-03-20 Thread Yonghong Song
Hi, Jiong,

Thanks for your interest to help with this issue.
You can reproduce with the code at
  
https://github.com/sbernard31/udploadbalancer/tree/bf71e99fbd0c3f806a43076fc12a47e966422839
Using command:
  sudo python ulb.py lo -vip 10.41.44.13 -rs
00:00:00:00:00:00/127.0.0.1 -p 5683 5684
You need to have bcc installed in the system.

Yonghong


On Tue, Mar 19, 2019 at 11:23 PM Yonghong Song via Lists.Iovisor.Org
 wrote:
>
> On Tue, Mar 19, 2019 at 9:06 AM Simon  wrote:
> >
> > The compiler is doing optimization which make verifier fail. It is possible 
> > an early compiler with less optimizations may work.
> >
> > Maybe a silly question, but does it make sense to try to change compiler 
> > optimization option ? (I tried  to play with -O option without success)
>
> Maybe. I have not looked at this yet from compiler side. Sometimes you
> won't have an easy compiler option to turn off. Tuning -O may not
> help. Lowering -O to -O1/-O0 may help to remove this particular
> optimization, but may introduce more spills which verifier will also
> reject.
>
> >
> > Please keep me informed about your progress about this issue :)
>
> Sure. Will let you know if I have made progress in this.
>
> >
> > Thx again.
> >
> >
> >
> >
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1625): https://lists.iovisor.org/g/iovisor-dev/message/1625
Mute This Topic: https://lists.iovisor.org/mt/30315706/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] R? min value is negative, either use unsigned or 'var &= const' #verifier

2019-03-20 Thread Yonghong Song
On Tue, Mar 19, 2019 at 9:06 AM Simon  wrote:
>
> The compiler is doing optimization which make verifier fail. It is possible 
> an early compiler with less optimizations may work.
>
> Maybe a silly question, but does it make sense to try to change compiler 
> optimization option ? (I tried  to play with -O option without success)

Maybe. I have not looked at this yet from compiler side. Sometimes you
won't have an easy compiler option to turn off. Tuning -O may not
help. Lowering -O to -O1/-O0 may help to remove this particular
optimization, but may introduce more spills which verifier will also
reject.

>
> Please keep me informed about your progress about this issue :)

Sure. Will let you know if I have made progress in this.

>
> Thx again.
>
>
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1619): https://lists.iovisor.org/g/iovisor-dev/message/1619
Mute This Topic: https://lists.iovisor.org/mt/30315706/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] R? min value is negative, either use unsigned or 'var &= const' #verifier

2019-03-18 Thread Yonghong Song
On Mon, Mar 18, 2019 at 4:18 AM Simon  wrote:
>
> Thx a lot again for your time and your detailed explanation.
>
> About the workaround you proposed, I didn't get where I should repeat __u16 
> udp_len = bpf_ntohs(udp->len);.. I tried several spot but didn't succeed to 
> make it works.

This is a tough issue. I spent a couple of hours trying various source
workaround and did not succeed.
To illustrate my experiment, the following is what I tried to do to
move the code udp_len calculation and its usage closer to each other
to avoid register spill/refills.

```
-bash-4.4$ diff ulb.c ulb.c.org
61,62c61,62
< static inline int ipv4_l4_csum(struct udphdr *data_start, __u32 data_size,
< __u64 *csum, struct iphdr *iph, void
*data_end) {
---
> static inline void ipv4_l4_csum(void *data_start, __u32 data_size,
> __u64 *csum, struct iphdr *iph) {
70,87c70,71
<
<   data_start = iph + 1;
<   if (data_start + 1 > data_end)
<   return XDP_DROP;
<   data_size = bpf_ntohs(data_start->len);
<   if (data_size < 8)
<   return -1;
<   if (data_size > 512)
<   return -1;
<   data_size = data_size & 0x1ff;
<   data_start = (void *)data_start + data_size;
<   if ((void *) data_start <= data_end) {
< *csum = bpf_csum_diff(0, 0, (void *)data_start - data_size,
data_size, *csum);
< *csum = csum_fold_helper(*csum);
< return 0;
<   }
<
<   return -1;
---
>   *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
>   *csum = csum_fold_helper(*csum);
140a125,132
> __u16 udp_len = bpf_ntohs(udp->len);
> if (udp_len < 8)
> return XDP_DROP;
> if (udp_len > 512) // TODO use a more approriate max value
> return XDP_DROP;
> udp_len = udp_len & 0x1ff;
> if ((void *) udp + udp_len > data_end)
> return XDP_DROP;
200,203c192
< __u16 udp_len = 0;
< int ret = ipv4_l4_csum(udp, udp_len, , iph, data_end) ;
< if (ret == -1)
<   return XDP_DROP;
---
> ipv4_l4_csum(udp, udp_len, , iph) ;
-bash-4.4$
```

But my attempt above not working. The main reason is the compiler is
always trying to
use the original assignment register
   data_start = iph + 1;   /* data_start is in r8 */
   /* r1 = r8, and below r1 is used for data_start */
   if (data_start   + 1 > data_end)
  return XDP_DROP;
   ...
   *csum = bpf_csum_diff(0, 0, (void *)data_start - data_size,
data_size, *csum);  /* data_start in r8 */

The data_start is refined in r1 while r8 is used in the final argument passing.

I think I need to look at kernel verifier side and compiler side.

> By the way repeat it, means probably shifting byte again ... so I pretty sure 
> I didn't  get well what you mean :/ ... Did you succeed to make it works with 
> this workaround ?
>
> This is maybe a stupid question, but it seems to me that I'm not doing so 
> exotic code here. I mean calculate a checksum with XDP and I fall in not so 
> easy "issues". Is there something I do which is totally wrong or uncommon ?

You are not. The compiler is doing optimization which make verifier
fail. It is possible an early compiler with less optimizations may
work.

>
> I will take a look at SCALAR_VALUE issue later on.But if you or anybody has 
> cycles and want to look into this issue,feel free to do so.
>
> I didn't have enough skill to do that, I'm not even sure to understand what 
> "spill/reload" means ?

Something like:
udp_len = ...
...
... = udp_len

The generated code:
 store_to_stack for udp_len;
 ...
 get udp_len value from stack
 use udp_len

> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1613): https://lists.iovisor.org/g/iovisor-dev/message/1613
Mute This Topic: https://lists.iovisor.org/mt/30315706/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] R? min value is negative, either use unsigned or 'var &= const' #verifier

2019-03-17 Thread Yonghong Song
On Mon, Mar 11, 2019 at 11:13 PM Yonghong Song via Lists.Iovisor.Org
 wrote:
>
> On Mon, Mar 11, 2019 at 4:08 AM Simon  wrote:
> >
> > I tried to understand again this verifier error again and probably my 
> > previous post does not contain enough information.
> >
> > I  understand that :
> >
> > 93: (67) r0 <<= 32
> > 294: (c7) r0 s>>= 32
> > 295: (b7) r1 = 0
> > 296: (b7) r2 = 0
> > 297: (bf) r3 = r8
> > 298: (79) r4 = *(u64 *)(r10 -40)
> > 299: (bf) r5 = r0
> > 300: (85) call bpf_csum_diff#28
> > R4 min value is negative, either use unsigned or 'var &= const'
> >
> > is about this line  (in ipv4_l4_csum)
> >
> >   *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
> >
> > R1=0,
> > R2=0,
> > R3= R8=pkt(id=0,off=34,r=42,imm=0) = data_start =  a pointer to struct 
> > udphdr *udp
> > R4= something in the stack  = data_size = __u16 udp_len
> >
> > So I can not understand how this bring to R4 min value is negative, either 
> > use unsigned or 'var &= const'
>
> I took a brief look. Indeed, it is very strange. I can see proper
> value of udp_len is stored into
> r10 - 40, but when it is retrieved later, the value became unkown
>
> I will try to experiment with this problem later this week.
>
> Could you do me a favor to make it reproducible with python2? My env.
> flexible to rebuild/retry with kernel is python2 friendly.
>
>
> >
> > 298: (79) r4 = *(u64 *)(r10 -40)
> >
> > As I understand this line, r4 will get a value in the stack 
> > (R10=fp0,call_-1 fp-48=pkt) and cast  this value in a u64, so unsigned. 
> > (min value = 0)

The reason for the failure is due spill/reload does not preserve the
original register state for scalar value.
Look at the kernel function:
static bool is_spillable_regtype(enum bpf_reg_type type)
{
switch (type) {
case PTR_TO_MAP_VALUE:
case PTR_TO_MAP_VALUE_OR_NULL:
case PTR_TO_STACK:
case PTR_TO_CTX:
case PTR_TO_PACKET:
case PTR_TO_PACKET_META:
case PTR_TO_PACKET_END:
case PTR_TO_FLOW_KEYS:
case CONST_PTR_TO_MAP:
case PTR_TO_SOCKET:
case PTR_TO_SOCKET_OR_NULL:
case PTR_TO_SOCK_COMMON:
case PTR_TO_SOCK_COMMON_OR_NULL:
case PTR_TO_TCP_SOCK:
case PTR_TO_TCP_SOCK_OR_NULL:
return true;
default:
return false;
}
}

The original variable udp_len is defined as
__u16 udp_len = bpf_ntohs(udp->len);
if (udp_len < 8)
return XDP_DROP;
if (udp_len > 512) // TODO use a more approriate max value
return XDP_DROP;
udp_len = udp_len & 0x1ff;

and it is saved to stack, and its register type is SCALAR_VALUE.
But since SCALAR_VALUE is not part of is_spillable_regtype() so
when the value is reloaded from stack to register, the original
state is not copied and the worst case scalar value is assumed
and this will incur the error you see.

I tried to add SCALAR_VALUE to the is_spillable_regtype()
and it does not work and more changes in verifier is required.

The trick like `data_size = data_size & 0x1ff` may not work
as the compiler may optimize it away and the original spill
may still exist.

The workaound could be to repeat
  __u16 udp_len = bpf_ntohs(udp->len);
here udp is a pointer to the package, even it is spilled, its register
state can be restored.

I will take a look at SCALAR_VALUE issue later on.
But if you or anybody has cycles and want to look into this issue,
feel free to do so.

Thanks!

Yonghong

> >
> > (By the way I can not understand why this is a u64 and not a u16 as udp_len 
> > variable or u32 as  data_size parameter of ipv4_l4_csum function or u32 as 
> > tosize from bpf_csum_diff function...)
> >
> > I tried to use the &= tricks like :
> >
> > data_size = data_size & 0x1ff;
> > *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
> >
> > Same issue ...
> >
> > Here a more longer trace from the verifier :
> >
> > R0=inv(id=0,umax_value=4295032831,var_off=(0x0; 0x1))
> > R1=inv(id=0,umax_value=65536,var_off=(0x0; 0x1))
> > R6=ctx(id=0,off=0,imm=0)
> > R7=pkt(id=0,off=0,r=42,imm=0)
> > R8=pkt(id=0,off=34,r=42,imm=0)
> > R9=pkt(id=0,off=30,r=42,imm=0)
> > R10=fp0,call_-1 fp-48=pkt
> > 239: (57) r0 &= 65535
> > 240: (0f) r0 += r1
> > 241: (bf) r1 = r0
> > 242: (77) r1 >>= 16
> > 243: (0f) r1 += r0
> > 244: (a7) r1 ^= -1
> > 245: (6b) *(u16 *)(r7 +24) = r1
> > 246: (b7) r1 = 0
> > 247: (6b) *(u16 *)(r7 +40) = r1
> > 248: (b7) r1 = 0
> > 249: (b7) r2 = 0
> >

Re: [iovisor-dev] R? min value is negative, either use unsigned or 'var &= const' #verifier

2019-03-12 Thread Yonghong Song
On Mon, Mar 11, 2019 at 4:08 AM Simon  wrote:
>
> I tried to understand again this verifier error again and probably my 
> previous post does not contain enough information.
>
> I  understand that :
>
> 93: (67) r0 <<= 32
> 294: (c7) r0 s>>= 32
> 295: (b7) r1 = 0
> 296: (b7) r2 = 0
> 297: (bf) r3 = r8
> 298: (79) r4 = *(u64 *)(r10 -40)
> 299: (bf) r5 = r0
> 300: (85) call bpf_csum_diff#28
> R4 min value is negative, either use unsigned or 'var &= const'
>
> is about this line  (in ipv4_l4_csum)
>
>   *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
>
> R1=0,
> R2=0,
> R3= R8=pkt(id=0,off=34,r=42,imm=0) = data_start =  a pointer to struct udphdr 
> *udp
> R4= something in the stack  = data_size = __u16 udp_len
>
> So I can not understand how this bring to R4 min value is negative, either 
> use unsigned or 'var &= const'

I took a brief look. Indeed, it is very strange. I can see proper
value of udp_len is stored into
r10 - 40, but when it is retrieved later, the value became unkown

I will try to experiment with this problem later this week.

Could you do me a favor to make it reproducible with python2? My env.
flexible to rebuild/retry with kernel is python2 friendly.


>
> 298: (79) r4 = *(u64 *)(r10 -40)
>
> As I understand this line, r4 will get a value in the stack (R10=fp0,call_-1 
> fp-48=pkt) and cast  this value in a u64, so unsigned. (min value = 0)
>
> (By the way I can not understand why this is a u64 and not a u16 as udp_len 
> variable or u32 as  data_size parameter of ipv4_l4_csum function or u32 as 
> tosize from bpf_csum_diff function...)
>
> I tried to use the &= tricks like :
>
> data_size = data_size & 0x1ff;
> *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
>
> Same issue ...
>
> Here a more longer trace from the verifier :
>
> R0=inv(id=0,umax_value=4295032831,var_off=(0x0; 0x1))
> R1=inv(id=0,umax_value=65536,var_off=(0x0; 0x1))
> R6=ctx(id=0,off=0,imm=0)
> R7=pkt(id=0,off=0,r=42,imm=0)
> R8=pkt(id=0,off=34,r=42,imm=0)
> R9=pkt(id=0,off=30,r=42,imm=0)
> R10=fp0,call_-1 fp-48=pkt
> 239: (57) r0 &= 65535
> 240: (0f) r0 += r1
> 241: (bf) r1 = r0
> 242: (77) r1 >>= 16
> 243: (0f) r1 += r0
> 244: (a7) r1 ^= -1
> 245: (6b) *(u16 *)(r7 +24) = r1
> 246: (b7) r1 = 0
> 247: (6b) *(u16 *)(r7 +40) = r1
> 248: (b7) r1 = 0
> 249: (b7) r2 = 0
> 250: (79) r3 = *(u64 *)(r10 -48)
> 251: (b7) r4 = 4
> 252: (b7) r5 = 0
> 253: (85) call bpf_csum_diff#28
> 254: (67) r0 <<= 32
> 255: (c7) r0 s>>= 32
> 256: (b7) r1 = 0
> 257: (b7) r2 = 0
> 258: (bf) r3 = r9
> 259: (b7) r4 = 4
> 260: (bf) r5 = r0
> 261: (85) call bpf_csum_diff#28
> 262: (71) r1 = *(u8 *)(r7 +23)
> 263: (dc) r1 = be32 r1
> 264: (63) *(u32 *)(r10 -24) = r1
> 265: (67) r0 <<= 32
> 266: (c7) r0 s>>= 32
> 267: (bf) r9 = r10
> 268: (07) r9 += -24
> 269: (b7) r1 = 0
> 270: (b7) r2 = 0
> 271: (bf) r3 = r9
> 272: (b7) r4 = 4
> 273: (bf) r5 = r0
> 274: (85) call bpf_csum_diff#28
> 275: (79) r1 = *(u64 *)(r10 -40)
> 276: (dc) r1 = be32 r1
> 277: (63) *(u32 *)(r10 -24) = r1
> 278: (67) r0 <<= 32
> 279: (c7) r0 s>>= 32
> 280: (b7) r1 = 0
> 281: (b7) r2 = 0
> 282: (bf) r3 = r9
> 283: (b7) r4 = 4
> 284: (bf) r5 = r0
> 285: (85) call bpf_csum_diff#28
> 286: (67) r0 <<= 32
> 287: (c7) r0 s>>= 32
> 288: (b7) r1 = 0
> 289: (b7) r2 = 0
> 290: (bf) r3 = r8
> 291: (79) r4 = *(u64 *)(r10 -40)
> 292: (bf) r5 = r0
> 293: (85) call bpf_csum_diff#28
>
> I reference the commit instead of repository to keep the link consistent over 
> the time : 
> https://github.com/sbernard31/udploadbalancer/tree/5ca93d0893a60bc70a75f30eb5cfde496a9e5d93
>
> Again do not hesitate to redirect me to better place if I'm not asking at the 
> right place :)
>
> Thx again for your time.
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1607): https://lists.iovisor.org/g/iovisor-dev/message/1607
Mute This Topic: https://lists.iovisor.org/mt/30315706/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] math between pkt pointer and register with unbounded min value is not allowed #verifier

2019-03-08 Thread Yonghong Song
On Fri, Mar 8, 2019 at 9:22 AM Simon  wrote:
>
>
> 35: (69) r3 = *(u16 *)(r7 +38)
> 36: (dc) r3 = be16 r3
>
> r3 get the value from memory, its value could be any one as permitted
> by the type.
>
> Does it mean that r3 is considered as be16 ? I do not understand why as I 
> explicitly convert it in u16.

The be16 is to convert r3 with big endian encoding. If the host system
is big endian, it will do nothing. Otherwise,
it will convert from little endian to big endian.

>
> This output language is a readable format of bpf bytecode, right ? Is there 
> any documentation to lean/understand it ?

Yes, there is no documentation. It intends to be self explanatory. I
guess "be16" is special and may need some documentation. Otherwise
assembly-style codes should be easy to understand.

>
> The compiler does the right thing, just verifier is not advanced enough.
>
> Is it worthy to share this issue of verifier.c with bpf maintainers ? The 
> compiler which is used here is clang which is called by bcc, right ?

I am also a regular kernel/bpf reviewer. The bpf maintainers/community
are aware of this limitation. As you mentioned, the verifier is
already very complex. To implement complex tracking like described in
this thread will make verifier even more complex, hence this is
delayed. One of reason is that we have reasonable, although
unpleasant, workarounds.

Yes, it is compiled with clang.

>
> Yes, you will need some source workaround. You could try below (untested):
> + udp_len = udp_len & 0x1ff;
>
> I tested it and it seems to work. Thx a lot !!
>
> But that means I can not use the u16 max value ?

You can. I add that because you have a test to limit the range of the
value to 511.

>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1605): https://lists.iovisor.org/g/iovisor-dev/message/1605
Mute This Topic: https://lists.iovisor.org/mt/30285987/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] math between pkt pointer and register with unbounded min value is not allowed #verifier

2019-03-07 Thread Yonghong Song
On Wed, Mar 6, 2019 at 7:08 AM  wrote:
>
> I'm playing with bcc to prototype an UDP load balancer.
>
> I'm facing an issue that I didn't succeed to understand...
>
> In my code I tried to validate my UDP packet using code like this :
>
> struct udphdr *udp;
> udp = iph + 1;
> if (udp + 1 > data_end)
> return XDP_DROP;
> __u16 udp_len = bpf_ntohs(udp->len);
> //__u16 udp_len = 8;
> if (udp_len < 8)
> return XDP_DROP;
> if (udp_len > 512) // TODO use a more approriate max value
> return XDP_DROP;
> if ((void *) udp + udp_len > data_end)
> return XDP_DROP;
>
> And the verifier does not like it ..

This is caused by compiler optimizations.

>
> 28: (71) r2 = *(u8 *)(r7 +23)
> 29: (b7) r0 = 2
> 30: (55) if r2 != 0x11 goto pc+334
>  R0=inv2 R1=pkt_end(id=0,off=0,imm=0) R2=inv17 R3=inv5 
> R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=34,imm=0) 
> R8=pkt(id=0,off=34,r=34,imm=0) R9=pkt(id=0,off=14,r=34,imm=0) R10=fp0,call_-1
> 31: (bf) r2 = r8
> 32: (07) r2 += 8
> 33: (b7) r0 = 1
> 34: (2d) if r2 > r1 goto pc+330
>  R0=inv1 R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=42,r=42,imm=0) R3=inv5 
> R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=42,imm=0) 
> R8=pkt(id=0,off=34,r=42,imm=0) R9=pkt(id=0,off=14,r=42,imm=0) R10=fp0,call_-1
> 35: (69) r3 = *(u16 *)(r7 +38)
> 36: (dc) r3 = be16 r3

r3 get the value from memory, its value could be any one as permitted
by the type.

> 37: (bf) r2 = r3
> 38: (07) r2 += -8
> 39: (57) r2 &= 65535
> 40: (b7) r0 = 1
> 41: (25) if r2 > 0x1f8 goto pc+323

test is done by r2. We indeed get better range for r2 (below:
R2=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) )
but r3 range is not tightened.

>  R0=inv1 R1=pkt_end(id=0,off=0,imm=0) 
> R2=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) R3=inv(id=0) 
> R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=42,imm=0) 
> R8=pkt(id=0,off=34,r=42,imm=0) R9=pkt(id=0,off=14,r=42,imm=0) R10=fp0,call_-1
> 42: (bf) r2 = r7
> 43: (0f) r2 += r3
> math between pkt pointer and register with unbounded min value is not allowed

Here, we use r3 to do arith and the verifier not happy. If we use the
tightened old r2, it may work.
The compiler does the right thing, just verifier is not advanced enough.

>
> I'm pretty sure the issue is about udp_len, that's why I tried to validate 
> its value before to use it ... but without success...
> When I set udp_len to 8 (just for testing) this seems to works. Any idea 
> about that ?

Yes, you will need some source workaround. You could try below (untested):

if (udp_len > 512) // TODO use a more approriate max value
  return XDP_DROP;
+  udp_len = udp_len & 0x1ff;
if ((void *) udp + udp_len > data_end)
 return XDP_DROP;

>
> Full code is available here : 
> https://gist.github.com/sbernard31/d4fee7518a1ff130452211c0d355b3f7
>
> (I'm using python-bpfcc  0.8.0-4 from debian sid with a 4.19.12 kernel)
> (I don't know if this is the right place for this kind of question, )
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1600): https://lists.iovisor.org/g/iovisor-dev/message/1600
Mute This Topic: https://lists.iovisor.org/mt/30285987/21656
Mute #verifier: https://lists.iovisor.org/mk?hashtag=verifier=2590197
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Sharing eBPF map between two eBPF kernel programs

2019-03-07 Thread Yonghong Song
On Mon, Mar 4, 2019 at 10:45 PM Kanthi P  wrote:
>
> Thanks for that!
>
> I think those work because bcc has given utilities to do that.
>
> I am using C and we use libbpf to load programs and to do map operations.
>
> Do you know if we can achieve it in this way too?

You should be able to do the work with C as well. Just use APIs from libbpf.h,
e.g., bpf_prog_get_next_id, bpf_prog_get_fd_by_id, bpf_map_get_fd_by_id.
Yes, we may miss bpf_map_get_next_id and bpf_obj_get_info_by_fd.
You can just add prototype to bcc/src/cc/libbpf.h and the implementation
is libbpf submodule.

We can add the missing piece in bcc/src/cc/{libbpf.h, libbpf.c}.

>
> Thanks,
> Kanthi
>
> On Tue, Mar 5, 2019 at 2:21 AM Y Song  wrote:
>>
>> On Mon, Mar 4, 2019 at 9:52 AM Kanthi P  wrote:
>> >
>> > Hi,
>> >
>> > I have two eBPF programs, each with its own user and kernel programs.
>> >
>> > One of the programs defines an eBPF map.
>> >
>> > How do I access this map from another eBPF kernel program?
>>
>> In examples/cpp, we have UseExternalMap.cc, which gives an example of
>> how to share maps between two processes with C++ interface.
>> Basically, if you know the ID of the map in another application (you
>> can see id use bpftool), you can get a fd for that map by ID with API
>> `bpf_map_get_fd_by_id()`, and example shows how to inject the new `fd`
>> to bcc compilation system.
>>
>> We do not have a python way to share map yet bwtween two different processes.
>>
>> Please take a look. If you feel the interface needs to improve to
>> support your use case, we can do that.
>>
>> >
>> > PS:
>> >
>> > I can pin the map and that helps me access this map from other program's 
>> > user space.
>> > But looking for a direct way in which I can access this map from other 
>> > programs's kernel space.
>> >
>> > Thanks!
>> > 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1599): https://lists.iovisor.org/g/iovisor-dev/message/1599
Mute This Topic: https://lists.iovisor.org/mt/30216635/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Sharing eBPF map between two eBPF kernel programs

2019-03-04 Thread Yonghong Song
On Mon, Mar 4, 2019 at 9:52 AM Kanthi P  wrote:
>
> Hi,
>
> I have two eBPF programs, each with its own user and kernel programs.
>
> One of the programs defines an eBPF map.
>
> How do I access this map from another eBPF kernel program?

In examples/cpp, we have UseExternalMap.cc, which gives an example of
how to share maps between two processes with C++ interface.
Basically, if you know the ID of the map in another application (you
can see id use bpftool), you can get a fd for that map by ID with API
`bpf_map_get_fd_by_id()`, and example shows how to inject the new `fd`
to bcc compilation system.

We do not have a python way to share map yet bwtween two different processes.

Please take a look. If you feel the interface needs to improve to
support your use case, we can do that.

>
> PS:
>
> I can pin the map and that helps me access this map from other program's user 
> space.
> But looking for a direct way in which I can access this map from other 
> programs's kernel space.
>
> Thanks!
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1594): https://lists.iovisor.org/g/iovisor-dev/message/1594
Mute This Topic: https://lists.iovisor.org/mt/30216635/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Atomicity of load and store instructions in eBPF

2019-02-09 Thread Yonghong Song
On Fri, Feb 8, 2019 at 2:48 PM Mauricio Vasquez
 wrote:
>
>
> On 2/7/19 3:49 PM, Yonghong Song wrote:
> > On Thu, Feb 7, 2019 at 8:15 AM Mauricio Vasquez
> >  wrote:
> >> Hello folks,
> >>
> >> We have an eBPF program that shares a 64-bits integer with userspace
> >> using an array map. The eBPF program only reads the integer while the
> >> user-space application only writes it.
> >>
> >> We know this is possible that under some conditions, for example the
> >> integer is not aligned to cache line size in x86_64, the reader can see
> >> a partially updated value. We wonder if eBPF have some guarantees about
> > not just cache line, you mean not in its naturally aligned boundary?
>
> Exactly.
>
> > One load/store is actually broken into multiple loads/stores by the 
> > compiler?
> >
> >> atomicity of the load operation in this case, or if this is totally
> >> dependent on the target architecture.
> >>
> >> Is there a way to implement this without using bpf_spin_lock?
> > So you mean both user space and bpf program are using bpf_spin_lock?
>
> Please consider this example based on BCC
> https://gist.github.com/mauriciovasquezbernal/73ade7609eadcf23fb254401ed945cb3.
> In this case the counter member is naturally aligned, so in x86_64 the
> program will never see a partial update. However this could be not true
> for all architectures. AFAIK bpf_spin_lock is the only way to implement
> this example in a safe way for all architectures.
>
> In other words, it is not possible to make assumptions about load or
> store operations atomicity in eBPF, this is dependent on the
> architecture where the program is run.
>
> Is that statement correct?

There are some memory model discussion in the netdev mailing list.
I think this shows bpf memory model is really tied to underlying arch indeed.

Some kinds of lock is the only way to guarantee the atomacity.
bpf_spin_lock is a kernel bpf helper. I guess user space could implement
the spin_lock for map element as well based on how kernel bpf_spin_lock
is implemented.

>
> Thanks,
>
> Mauricio.
>
>
> >> Thanks,
> >>
> >> Mauricio.
> >>
> >>
> >>
> >>
> > 
> >
> >

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1574): https://lists.iovisor.org/g/iovisor-dev/message/1574
Mute This Topic: https://lists.iovisor.org/mt/29691062/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Atomicity of load and store instructions in eBPF

2019-02-07 Thread Yonghong Song
On Thu, Feb 7, 2019 at 8:15 AM Mauricio Vasquez
 wrote:
>
> Hello folks,
>
> We have an eBPF program that shares a 64-bits integer with userspace
> using an array map. The eBPF program only reads the integer while the
> user-space application only writes it.
>
> We know this is possible that under some conditions, for example the
> integer is not aligned to cache line size in x86_64, the reader can see
> a partially updated value. We wonder if eBPF have some guarantees about

not just cache line, you mean not in its naturally aligned boundary?
One load/store is actually broken into multiple loads/stores by the compiler?

> atomicity of the load operation in this case, or if this is totally
> dependent on the target architecture.
>
> Is there a way to implement this without using bpf_spin_lock?

So you mean both user space and bpf program are using bpf_spin_lock?

>
> Thanks,
>
> Mauricio.
>
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1566): https://lists.iovisor.org/g/iovisor-dev/message/1566
Mute This Topic: https://lists.iovisor.org/mt/29691062/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Question about BPF map type BPF_MAP_TYPE_PERCPU_HASH

2019-01-31 Thread Yonghong Song
On Thu, Jan 31, 2019 at 12:09 AM Raga lahari  wrote:
>
> Hi everyone,
>
>
> I am using BPF_MAP_TYPE_HASH in BPF map definition and my program is working  
> as expected.
>
> I have read eBPF-map type  BPF_MAP_TYPE_PERCPU_HASH allows for lock free uses 
> of hash-tables in eBPF for high performance needs.
>
> I am trying to modify BPF map type from BPF_MAP_TYPE_HASH to 
> BPF_MAP_TYPE_PERCPU_HASH. But, when I run the program,  I am getting issues 
> in  BPF map iteration (using bpf_map_get_next_key)
>
>
> Can someone help me to know any points need to consider to use 
> BPF_MAP_TYPE_PERCPU_HASH map type .

You can take a look at bcc/tests/cc/{test_array_table.cc,
test_hash_table.cc, test_bpf_table.cc}.
There are a few examples there testing percpu maps.

>
>
>
> Thanks,
>
> Ragalahari
>
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1560): https://lists.iovisor.org/g/iovisor-dev/message/1560
Mute This Topic: https://lists.iovisor.org/mt/29604090/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Simple bpf_map reader prog

2018-12-24 Thread Yonghong Song
This is exactly what `bpftool map` is doing. The source code
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/map.c

You can find the map you are interested by iterating all the maps in the system
and then iterating to get all map key/values for that map.

On Mon, Dec 24, 2018 at 9:41 AM Иван Иванов  wrote:
>
> Can you give link or some code to understand, how can i get data from maps in 
> c. I saw example, but as i begginer in such programming, i can understand 
> map_fd, obj...etc.
> I create xdp prog, inserted in via ip link set... And can see data with 
> bpftool in hex. How can i simply write a prog to get data from map without 
> writing code to load. I am now at start of the way  )
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1548): https://lists.iovisor.org/g/iovisor-dev/message/1548
Mute This Topic: https://lists.iovisor.org/mt/28847664/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Unexpectedly high latencies are showed by self-written script (bpf_ktime_get_ns issue?)

2018-12-03 Thread Yonghong Song
On Mon, Dec 3, 2018 at 3:51 AM Aleksei Zakharov  wrote:
>
>
>
> 29.11.2018, 07:07, "Y Song" :
> > On Mon, Nov 26, 2018 at 4:29 AM Aleksei Zakharov  
> > wrote:
> >>  Hi, everyone!
> >>  I've written a small script using bcc python lib for qemu block io 
> >> latency tracing.
> >>  I attach two uprobes to block_acct_start and block_acct_done.
> >>
> >>  On some servers i can see unexpectedly high latencies:
> >>  ~# ./qemuioslower 4
> >>  Tracing for qemu I/O... Ctrl-C to end
> >>  TIME LATENCY SIZE(KB) IO TYPE
> >>  0.0 18446744073709551 8192 write
> >>  9.7294584 18446744073709551 0 flush
> >>  24.815983 18446744073709551 4096 write
> >>  25.422378 18446744073709551 0 flush
> >>
> >>  For me it looks like bpf_ktime_get_ns in stop() function returned time 
> >> that is less, than bpf_ktime_get_ns() in start(), but why does this happen?
> >>  May someone help me with understanding this behavior?
> >
> > Maybe you want to check with bcc issue
> > https://github.com/iovisor/bcc/issues/728.
> > Basically there is a kernel bug related to bpf_ktime_get_ns() and it
> > is fixed in 4.9. Not sure about your kernel version.
> Thanks for your answer!
> I've tried to reproduce this with perf example from issue:
> perf record -F 9 --cpu 0 --clockid CLOCK_MONOTONIC -- sleep 1
> But it looks normal.
> Also, i'm using 4.15 kernel, where this bug should be fixed already.
> I tried to run different tools from bcc-tools, and i didn't find any strange 
> output.
>
> Is there any special case (non-buggy behavior) where such latencies could be 
> shown?
> We use servers with 2-4 numa nodes, might this be the cause?

If start() and stop() execute on two different cpus, their raw time
may be backward
since two different cpus mostly like have slightly different realtime
clock numbers.
Could you this be an issue for you? This all depends on your kernel timer setup.

>
> Also, I tried to trace latency between scsi_req_enqueue and scsi_req_complete,
> and i didn't see such high latencies in this case.
>
> >
> >>  BPF code:
> >>  #include 
> >>  BPF_HASH(req, u64);
> >>  enum BlockAcctType {
> >>  BLOCK_ACCT_READ,
> >>  BLOCK_ACCT_WRITE,
> >>  BLOCK_ACCT_FLUSH,
> >>  BLOCK_MAX_IOTYPE,
> >>  };
> >>  struct data_t {
> >>  u64 ts;
> >>  u64 lat;
> >>  u64 bytes;
> >>  enum BlockAcctType type;
> >>  };
> >>  typedef struct BlockAcctCookie {
> >>  int64_t bytes;
> >>  int64_t start_time_ns;
> >>  enum BlockAcctType type;
> >>  } BlockAcctCookie;
> >>  BPF_PERF_OUTPUT(events);
> >>
> >>  void start(struct pt_regs *ctx) {
> >>  u32 pid = bpf_get_current_pid_tgid();
> >>  if (FILTER_PID)
> >>  return;
> >>  u64 key = 0;
> >>  bpf_probe_read(, sizeof(key), (void *)PT_REGS_PARM1(ctx));
> >>  u64 ts = bpf_ktime_get_ns();
> >>  req.update(, );
> >>  }
> >>
> >>  void stop(struct pt_regs *ctx,void *first, struct BlockAcctCookie 
> >> *cookie) {
> >>  struct data_t data = {};
> >>  u64 key = 0;
> >>  bpf_probe_read(, sizeof(key), (void *)PT_REGS_PARM1(ctx));
> >>  data.ts = bpf_ktime_get_ns();
> >>  data.type = cookie->type;
> >>  data.bytes = cookie->bytes;
> >>  u64 *s_time = req.lookup();
> >>  if (s_time != 0) {
> >>  data.lat = (data.ts - *s_time);
> >>  data.lat /= 1000;
> >>  if (data.lat > MIN_US) {
> >>  events.perf_submit(ctx, , sizeof(data));
> >>  }
> >>  req.delete();
> >>  }
> >>  }
> >>
> >>  Kernel: Linux hostname 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 
> >> 25 08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >>  bcc packages versions:
> >>  bcc-tools 0.7.0-1
> >>  libbcc 0.7.0-1
> >>  libcc1-0:amd64 5.4.0-6ubuntu1~16.04.10
> >>  python-bcc 0.7.0-1
> >>
> >>  --
> >>  Regards,
> >>  Aleksei Zakharov
> >>
> >>  
>
> --
> Regards,
> Aleksei Zakharov
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1543): https://lists.iovisor.org/g/iovisor-dev/message/1543
Mute This Topic: https://lists.iovisor.org/mt/28320360/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] question about per_cpu maps

2018-11-29 Thread Yonghong Song
On Thu, Nov 29, 2018 at 5:48 AM Pablo Alvarez via Lists.Iovisor.Org
 wrote:
>
> Good to know, thanks!
>
> There is no mention of the non-preemption in the bpf or tc-bpf man pages
> or the bcc tutorials. Is it possible to change that? I would be happy to
> add a note about this if pointed in the right direction.

Yes. bpf man page severely lags behind the current state. Yes, you can help
make the change. linux-man mailing list is probably proper place to send patches
(https://www.spinics.net/lists/linux-man/).

>
> For example, this paragraph in the man bpf page could be edited (ALLCAPS
> additions)
>
> "eBPF programs can be attached to different events.  These events can
>be the arrival of network packets, tracing events, classification
>events by network queueing  disciplines (for eBPF programs attached
>to a tc(8) classifier), and other types that may be added in the
>future.  A new event triggers execution of the eBPF program, which
>may store information about the event in eBPF maps. Beyond storing
>data, eBPF programs may call a fixed set of in-kernel helper
>functions. EBPF PROGRAMS ARE NEVER PREEMPTED BY THE KERNEL. THIS
> INCLUDES TAIL CALLS TO OTHER EBPF PROGRAMS."
>
> Best regards
>
> Pablo Alvarez
>
>
> On 11/28/18 17:52, Mauricio Vasquez wrote:
> >
> > On 11/28/18 1:50 PM, Pablo Alvarez via Lists.Iovisor.Org wrote:
> >> Dear eBPF community
> >>
> >> If I understand correctly, per_cpu maps are meant to avoid threading
> >> issues, and for example the need to call BPF_XADD to keep counters safe.
> >> There is something I am unclear about:
> >>
> >> - Is it possible for a BPF program to be preempted by another BPF
> >> program on the same CPU?
> >
> > eBPF programs are never preempted by the kernel, this allows to keep
> > counters in per_cpu maps without need to use synchronized primitives on
> > them.
> >
> > This is also guaranteed that there is not preemption in a chain of tail
> > calls, then per_cpu maps can also be used to store "global variables".
> >
> >> If this is the case, it seems that the following scenario could arise:
> >>
> >> BPF program 1 on cpu 0 reads a counter and is preempted before
> >> incrementing it
> >> BPF program 2 on cpu 0 reads the same counter, increments it, and
> >> finishes
> >> BPF program 1 on cpu 0 increments the counter
> >>
> >> At which point both programs would have read the same value for the
> >> counter, with possible problems ensuing.
> >>
> >>
> >> Is this a valid scenario? Am I missing something about how the per_cpu
> >> maps are intended to be used?
> >>
> >> Thanks
> >>
> >> Pablo Alvarez
> >>
> >>
> >>
> >>
> >
> >
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1540): https://lists.iovisor.org/g/iovisor-dev/message/1540
Mute This Topic: https://lists.iovisor.org/mt/28471975/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Unexpectedly high latencies are showed by self-written script (bpf_ktime_get_ns issue?)

2018-11-28 Thread Yonghong Song
On Mon, Nov 26, 2018 at 4:29 AM Aleksei Zakharov  wrote:
>
> Hi, everyone!
> I've written a small script using bcc python lib for qemu block io latency 
> tracing.
> I attach two uprobes to block_acct_start and block_acct_done.
>
> On some servers i can see unexpectedly high latencies:
> ~# ./qemuioslower 4
> Tracing for qemu I/O... Ctrl-C to end
> TIME   LATENCYSIZE(KB)   IO TYPE
> 0.018446744073709551 8192   write
> 9.7294584  18446744073709551 0  flush
> 24.815983  18446744073709551 4096   write
> 25.422378  18446744073709551 0  flush
>
> For me it looks like bpf_ktime_get_ns in stop() function returned time that 
> is less, than bpf_ktime_get_ns() in start(), but why does this happen?
> May someone help me with understanding this behavior?

Maybe you want to check with bcc issue
https://github.com/iovisor/bcc/issues/728.
Basically there is a kernel bug related to bpf_ktime_get_ns() and it
is fixed in 4.9. Not sure about your kernel version.

>
> BPF code:
> #include 
> BPF_HASH(req, u64);
> enum BlockAcctType {
> BLOCK_ACCT_READ,
> BLOCK_ACCT_WRITE,
> BLOCK_ACCT_FLUSH,
> BLOCK_MAX_IOTYPE,
> };
> struct data_t {
> u64 ts;
> u64 lat;
> u64 bytes;
> enum BlockAcctType type;
> };
> typedef struct BlockAcctCookie {
> int64_t bytes;
> int64_t start_time_ns;
> enum BlockAcctType type;
> } BlockAcctCookie;
> BPF_PERF_OUTPUT(events);
>
> void start(struct pt_regs *ctx) {
> u32 pid = bpf_get_current_pid_tgid();
> if (FILTER_PID)
> return;
> u64 key = 0;
> bpf_probe_read(, sizeof(key), (void *)PT_REGS_PARM1(ctx));
> u64 ts = bpf_ktime_get_ns();
> req.update(, );
> }
>
> void stop(struct pt_regs *ctx,void *first, struct BlockAcctCookie *cookie) {
> struct data_t data = {};
> u64 key = 0;
> bpf_probe_read(, sizeof(key), (void *)PT_REGS_PARM1(ctx));
> data.ts = bpf_ktime_get_ns();
> data.type = cookie->type;
> data.bytes = cookie->bytes;
> u64 *s_time = req.lookup();
> if (s_time != 0) {
> data.lat = (data.ts - *s_time);
> data.lat /= 1000;
> if (data.lat > MIN_US) {
> events.perf_submit(ctx, , sizeof(data));
> }
> req.delete();
> }
> }
>
> Kernel: Linux hostname 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 
> 08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> bcc packages versions:
> bcc-tools 0.7.0-1
> libbcc0.7.0-1
> libcc1-0:amd645.4.0-6ubuntu1~16.04.10
> python-bcc0.7.0-1
>
>
> --
> Regards,
> Aleksei Zakharov
>
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1538): https://lists.iovisor.org/g/iovisor-dev/message/1538
Mute This Topic: https://lists.iovisor.org/mt/28320360/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] eBPF TC for adjusting the packet length

2018-11-17 Thread Yonghong Song
On Sun, Nov 18, 2018 at 6:55 AM Kanthi P  wrote:
>
> Hi,
>
> Can anyone suggest a way to adjust the packet length(for example to strip off 
> the payload and have only headers) using eBPF TC program?

You may check helper bpf_skb_change_tail(), which could serve your purpose.

>
> Thanks,
> Kanthi
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1526): https://lists.iovisor.org/g/iovisor-dev/message/1526
Mute This Topic: https://lists.iovisor.org/mt/28214488/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] minutes: IO Visor TSC/Dev Meeting

2018-10-08 Thread Yonghong Song
On Wed, Oct 3, 2018 at 11:26 AM Brenden Blanco  wrote:
>
> Hi all,
>
> Thanks for joining the call this week. Here are my notes of the items 
> discussed.
>
> -Brenden
>
> === Discussion ===
>
> Yonghong:
> - BTF support in LLVM WIP
>   - prototype is approaching functionality
>   - to post to llvm mailing list soon (enabled through -g -target=bpf mode)

The link to the llvm patch which generates .BTF and .BTF.ext sections:
https://reviews.llvm.org/D52950

>
> William:
> - OVS + AF_XDP performance improvements
>   - more batching on tx side
>   - add more asynchronous logic
>   - goes from 5Mpps -> 14Mpps forwarding
> - tested ixgbe AF_XDP v2 support - seems to be working
>
> Daniel:
> - kTLS and sockmap integration WIP with John
>   - planning to post this week
> - bpf plumbers track submissions is closed
>
> JohnF:
> - Working on Cilium pieces for kTLS+sockmap
>
> Joe:
> - Socket lookup helpers were merged this week
>
> Jesper:
> - working on XDP paper with John and Daniel
> - Gave talk in Paris @Kernel Recipes
>   - 
> http://people.netfilter.org/hawk/presentations/KernelRecipes2018/XDP_Kernel_Recipes_2018.pdf
>
> Mauricio:
> - Working to address review comments this week on bpf queue and stackmap
>
> === Attendees ===
> Brenden Blanco
> Yonghong Song
> Jakub Kicinski
> Mauricio Vasquez
> JohnF
> Daniel Borkmann
> Joe Stringer
> William Tu
> Jesper Brouer
> Jiong Wang
> Nic Viljoen
> Quentin Monnet
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1492): https://lists.iovisor.org/g/iovisor-dev/message/1492
Mute This Topic: https://lists.iovisor.org/mt/26721077/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [PATCH RFC] bpf: Symbol sizes and types in object file

2018-09-19 Thread Yonghong Song
On Wed, Sep 19, 2018 at 1:20 AM Paul Chaignon  wrote:
>
> On Wed, Sept 19, 2018 at 8:17 AM, Yonghong Song wrote:
> > On 9/18/18 12:18 AM, Yutaro Hayakawa wrote:
> > > Hello,
> > >
> > > Paul, thank you for your help. I’m Yutaro a reporter of this issue.
> > >
> > >>
> > >> On Mon, Sep 17, 2018 at 3:46 PM Alexei Starovoitov
> > >> mailto:alexei.starovoi...@gmail.com>>
> > >> wrote:
> > >>>
> > >>> On Mon, Sep 17, 2018 at 11:29:13PM +0200, Paul Chaignon wrote:
> > >>>> I am sending here as an RFC instead of LLVM's mailing list as I'm
> > >>>> not sure
> > >>>> this is the intended behavior, and the "fix" may therefore not be
> > >>>> needed.
> > >>>>
> > >>>> Clang-compiled object files currently don't include the symbol sizes 
> > >>>> and
> > >>>> types.  Some tools however need that information.  For example,
> > >>>> ctfconvert
> > >>>> uses that information to generate FreeBSD's CTF representation from ELF
> > >>>> files.
> > >>>> With this patch, that information is included in object files.
> > >>>>
> > >>>> Signed-off-by: Paul Chaignon  > >>>> <mailto:paul.chaig...@orange.com>>
> > >>>> Reported-by: Yutaro Hayakawa  > >>>> <mailto:yhayakawa3...@gmail.com>>
> > >>>> ---
> > >>>> Index: lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h
> > >>>> ===
> > >>>> --- lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h(revision 341679)
> > >>>> +++ lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h(working copy)
> > >>>> @@ -30,8 +30,8 @@
> > >>>> WeakRefDirective = "\t.weak\t";
> > >>>>
> > >>>> UsesELFSectionDirectiveForBSS = true;
> > >>>> -HasSingleParameterDotFile = false;
> > >>>> -HasDotTypeDotSizeDirective = false;
> > >>>> +HasSingleParameterDotFile = true;
> > >>>> +HasDotTypeDotSizeDirective = true;
> > >>>>
> > >>>> SupportsDebugInformation = true;
> > >>>> ExceptionsType = ExceptionHandling::DwarfCFI;
> > >>>
> > >>> make sense to me, but I wonder why pahole's dwarf->btf converter
> > >>> doesn't have this issue.
> > >>> Could you describe what exactly are you trying to do with llvm
> > >>> generated elf file?
> > >>
> > >> Currently, BTF does not deal with symbols, so that is why it is okay.
> > >> llvm-objdump does not need symbol size either. It only needs symbol
> > >> offset.
> > >> that is why we did not have issues so far.
> > >>
> > >> In my future proposed func support in BTF, I also need offset only.
> > >>
> > >> In your use case, the size is merely to satisfy conversion to CTF or 
> > >> there
> > >> are some other needs? What did you do with CTF?
> > >
> > > Describing the background, I’m trying to port eBPF to FreeBSD
> > > (https://github.com/YutaroHayakawa/generic-ebpf)
> > >
> > > What I want to do with CTF is very similar to that of BTF, but I
> > > currently focusing on assisting verification. Since FreeBSD
> > > already has library to deal with CTF, we prefer to use CTF instead of BTF.
> > >
> > > In FreeBSD, we usually use tool called ctfconvert
> > > (https://www.freebsd.org/cgi/man.cgi?query=ctfconvert=1=0=FreeBSD+10.0-RELEASE
> > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.freebsd.org_cgi_man.cgi-3Fquery-3Dctfconvert-26sektion-3D1-26apropos-3D0-26manpath-3DFreeBSD-2B10.0-2DRELEASE=DwMFaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=iNlZa9ix8b8zAn4H53xuaTcSjVa5mlX0j4jNsEgF7gY=7c-EOY-jp0QTHfqm121Y9ODTslqckSpZBzQSrP7Qi6o=>)
> > > to generate CTF.
> > >
> > > And the problem is ctfconvert is aware of symbol type and size. It
> > > cannot generate correct CTF if they were missing.
> > >
> > > To solve this issue I need the change.
> >
> > Yutaro/Paul,
> >
> > Thanks for explanation. Now I understand your use case.
> > The change looks good to me. Please send me and Alexei a formal patch
> > through standard "git send-email" or just an email with
> > all necessary patch information, and I will apply
> > to llvm trunk.
>
> Don't we need to cc llvm-commits?  I can also send via Phabricator if
> that's fine with you.

You can also cc llvm-commits.

> >
> > Yonghong
> >
> > >
> > > Regards,
> > > Yutaro
> > >

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1483): https://lists.iovisor.org/g/iovisor-dev/message/1483
Mute This Topic: https://lists.iovisor.org/mt/25729284/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [PATCH RFC] bpf: Symbol sizes and types in object file

2018-09-17 Thread Yonghong Song
On Mon, Sep 17, 2018 at 3:46 PM Alexei Starovoitov
 wrote:
>
> On Mon, Sep 17, 2018 at 11:29:13PM +0200, Paul Chaignon wrote:
> > I am sending here as an RFC instead of LLVM's mailing list as I'm not sure
> > this is the intended behavior, and the "fix" may therefore not be needed.
> >
> > Clang-compiled object files currently don't include the symbol sizes and
> > types.  Some tools however need that information.  For example, ctfconvert
> > uses that information to generate FreeBSD's CTF representation from ELF
> > files.
> > With this patch, that information is included in object files.
> >
> > Signed-off-by: Paul Chaignon 
> > Reported-by: Yutaro Hayakawa 
> > ---
> > Index: lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h
> > ===
> > --- lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h(revision 341679)
> > +++ lib/Target/BPF/MCTargetDesc/BPFMCAsmInfo.h(working copy)
> > @@ -30,8 +30,8 @@
> >  WeakRefDirective = "\t.weak\t";
> >
> >  UsesELFSectionDirectiveForBSS = true;
> > -HasSingleParameterDotFile = false;
> > -HasDotTypeDotSizeDirective = false;
> > +HasSingleParameterDotFile = true;
> > +HasDotTypeDotSizeDirective = true;
> >
> >  SupportsDebugInformation = true;
> >  ExceptionsType = ExceptionHandling::DwarfCFI;
>
> make sense to me, but I wonder why pahole's dwarf->btf converter
> doesn't have this issue.
> Could you describe what exactly are you trying to do with llvm generated elf 
> file?

Currently, BTF does not deal with symbols, so that is why it is okay.
llvm-objdump does not need symbol size either. It only needs symbol offset.
that is why we did not have issues so far.

In my future proposed func support in BTF, I also need offset only.

In your use case, the size is merely to satisfy conversion to CTF or there
are some other needs? What did you do with CTF?

>
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1480): https://lists.iovisor.org/g/iovisor-dev/message/1480
Mute This Topic: https://lists.iovisor.org/mt/25729284/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [PATCH 0/2] tools: Skip backward time entries

2018-08-28 Thread Yonghong Song
Jiri,

The change looks good. Could you generate a pull request against the
bcc/iovisor github repository?

Thanks!

Yonghong
On Tue, Aug 28, 2018 at 5:01 AM Jiri Olsa  wrote:
>
> hi,
> on RHEL7 we are facing timing issues in some bcc-tools
> scripts. The time subsystem might return backward timestamp
> via bpf_ktime_get_ns and thus screw up latency computation.
>
> This seems to be RHEL7 kernel issue and needs to be fixed,
> however meanwhile it'd be nice if affected scripts display
> sane values despite the kernel issue.
>
> This patchset harden 2 scripts (xfsslower and ext4dist)
> we saw this behaviour so far, to ensure the latency is
> always > 0.
>
> thanks,
> jirka
>
>
> ---
> Jiri Olsa (2):
>   tools: Skip backward time entries in xfsslower
>   tools: Skip backward time entries in ext4dist
>
>  tools/ext4dist.py  | 10 --
>  tools/xfsslower.py |  9 -
>  2 files changed, 16 insertions(+), 3 deletions(-)
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1466): https://lists.iovisor.org/g/iovisor-dev/message/1466
Mute This Topic: https://lists.iovisor.org/mt/25038969/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [PATCH] BPF: helpers: New helper bpf_get_current_pidns_info to obtain namespace data from current task.

2018-08-08 Thread Yonghong Song
On Wed, Aug 8, 2018 at 6:32 PM, Carlos Neira  wrote:
> Yonghong Song,
> Let me know what you think of this last patch

Looks good. Only one comment in the below. After you addressed it, you can
send the patch to netdev with patch prefix [PATCH bpf-next].


>
> From 49d9dbcbd52c82dc44b6351ac1824538a6db155a Mon Sep 17 00:00:00 2001
> From: cneira 
> Date: Wed, 8 Aug 2018 17:40:54 -0400
> Subject: [PATCH] BPF: helpers: New helper to obtain namespace data from
>  current task
>
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
>
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not 
> work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
>
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
>
> For example a bcc script using bpf_get_current_pid_tgid() 
> (tools/funccount.py):
>
> u32 pid = bpf_get_current_pid_tgid() >> 32;
> if (pid != )
> return 0;
>
> Could be modified to use bpf_get_current_pidns_info() as follows:
>
> struct bpf_pidns pidns;
> bpf_get_current_pid_tgid(, sizeof(struct bpf_pidns));
> u32 pid = pidns.tgid;
> u32 nsid = pidns.nsid;
> if ((pid != ) && (nsid != ))
> return 0;
>
> To find out the name PID namespace id of a process, you could use this 
> command:
>
> $ ps -h -o pidns -p 
>
> Or this other command:
>
> $ ls -Li /proc//ns/pid
>
> Signed-off-by: Carlos Antonio Neira Bustos 
> ---
>  include/linux/bpf.h   |  1 +
>  include/uapi/linux/bpf.h  | 25 +++-
>  kernel/bpf/core.c |  1 +
>  kernel/bpf/helpers.c  | 64 
> +++
>  kernel/trace/bpf_trace.c  |  2 +
>  samples/bpf/Makefile  |  3 ++
>  samples/bpf/trace_ns_info_user.c  | 35 +
>  samples/bpf/trace_ns_info_user_kern.c | 45 ++
>  tools/include/uapi/linux/bpf.h| 25 +++-
>  tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
>  10 files changed, 202 insertions(+), 2 deletions(-)
>  create mode 100644 samples/bpf/trace_ns_info_user.c
>  create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index cd8790d2c6ed..3f4b999f7c99 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -787,6 +787,7 @@ extern const struct bpf_func_proto bpf_get_stack_proto;
>  extern const struct bpf_func_proto bpf_sock_map_update_proto;
>  extern const struct bpf_func_proto bpf_sock_hash_update_proto;
>  extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>
>  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index dd5758dc35d3..22fe6b0b6a74 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -173,6 +173,16 @@ enum bpf_attach_type {
> __MAX_BPF_ATTACH_TYPE
>  };
>
> +/* helper bpf_get_current_pidns_info will store the following
> + * data, dev will contain major/minor from /proc/self/pid.
> + */
> +struct bpf_pidns_info {
> +   __u32 dev;
> +   __u32 nsid;
> +   __u32 tgid;
> +   __u32 pid;
> +};

Please move this definition to the end of file.

> +
>  #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
>
>  /* cgroup-bpf attach flags used in BPF_PROG_ATTACH command
> @@ -2113,6 +2123,18 @@ union bpf_attr {
>   * the shared data.
>   * Return
>   * Pointer to the local storage area.
> + *
> + * int bpf_get_current_pidns(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + * Description
> + * Copies into *pidns* pid, namespace id and tgid as seen by the
> + * c

Re: [iovisor-dev] BTF examples or documentation?

2018-08-08 Thread Yonghong Song
On Wed, Aug 8, 2018 at 2:16 PM, Daniel Borkmann  wrote:
> On 08/08/2018 10:57 PM, Brendan Gregg wrote:
>> On Wed, Aug 8, 2018 at 1:28 PM, PJ Waskiewicz
>>  wrote:
>>> Hi folks,
>>>
>>> We (Maciek and I) were on the call today, and were experiencing some 
>>> audio/mic issues.  We're looking for any documentation or usage models that 
>>> people might have in flight already for BTF.  Outside of the kernel source, 
>>> we are drawing a blank.  Any pointers here would be most helpful.
>>
>> I expect to use BTF as a source for kernel struct information, as a
>> lightweight version of debuginfo (and hopefully one day BTF info will
>> be included in the kernel binary, so it's always there). I have no
>> docs to share yet. If you figure anything out, please share anywhere.
>> Here, blog post, talk, etc. :-)

Currently, we are still in the design and prototyping stage. Will
definitely share for comments
once we got some concept design and RFC implementation.

>
> There are some initial bits on BTF (mostly on the workflow for now) in
> the Cilium doc that can be found here:
>
>  - http://cilium.readthedocs.io/en/latest/bpf/#llvm
>  - http://cilium.readthedocs.io/en/latest/bpf/#bpftool
>
> Perhaps that helps to get started.
>
> Cheers,
> Daniel
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1452): https://lists.iovisor.org/g/iovisor-dev/message/1452
Mute This Topic: https://lists.iovisor.org/mt/24233525/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [PATCH] BPF: helpers: New helper bpf_get_current_pidns_info to obtain namespace data from current task.

2018-08-07 Thread Yonghong Song
On Tue, Aug 7, 2018 at 3:47 PM, Carlos Neira  wrote:
> Quentin,
>
> Thanks a lot for your feedback, I think I'm getting closer to finish this one.
> I have included the changes suggested, let me know if there is anything else 
> that needs to be changed or improved.
>
> From fceaa121bbff332e7ad4efd41660d96c006cbb3d Mon Sep 17 00:00:00 2001
> From: cneira 
> Date: Tue, 7 Aug 2018 18:43:20 -0400
> Subject: [PATCH] BPF: helpers: New helper to obtain namespace data from
>  current task
>
> This helper obtains the active namespace from current and returns pid, tgid,
> device major/minor and namespace as seen from that ns, allowing to instrument
> process inside a container.
> Major and  minor are obtained from /proc/self/pid, as in the future it's
> possible that different pid_ns files may belong to different devices, 
> according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
>
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not 
> work
> if bpf_get_current_pid_tgid is used. This helper addresses this limitation

bpf_get_current_pid_tgid()

> returning the pid as it's seen by the current namespace where the script is
> executing.
>
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
>
> For example a bcc script using bpf_get_current_pid_tgid() 
> (tools/funccount.py):
>
> u32 pid = bpf_get_current_pid_tgid() >> 32;
> if (pid != ) { return 0; }
>
> Could be modified to use bpf_get_current_pidns_info() as follows:
>
> struct bpf_pidns pidns;
> bpf_get_current_pid_tgid(, sizeof(struct bpf_pidns));
> u32 pid = pidns.tgid;
> if (pid != ) { return 0; }

This may not enough as different name spaces could have the same tgid.
Could you add comparison for pidns.nsid as well? You also need to tell how
user can find nsid inside the container for a particular pid.

>
> Signed-off-by: Carlos Antonio Neira Bustos 
> ---
>  include/linux/bpf.h   |  1 +
>  include/uapi/linux/bpf.h  | 25 +++-
>  kernel/bpf/core.c |  1 +
>  kernel/bpf/helpers.c  | 64 
> +++
>  kernel/trace/bpf_trace.c  |  2 +
>  samples/bpf/Makefile  |  3 ++
>  samples/bpf/trace_ns_info_user.c  | 35 +
>  samples/bpf/trace_ns_info_user_kern.c | 45 ++
>  tools/include/uapi/linux/bpf.h| 25 +++-
>  tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
>  10 files changed, 202 insertions(+), 2 deletions(-)
>  create mode 100644 samples/bpf/trace_ns_info_user.c
>  create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index cd8790d2c6ed..3f4b999f7c99 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -787,6 +787,7 @@ extern const struct bpf_func_proto bpf_get_stack_proto;
>  extern const struct bpf_func_proto bpf_sock_map_update_proto;
>  extern const struct bpf_func_proto bpf_sock_hash_update_proto;
>  extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>
>  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index dd5758dc35d3..22fe6b0b6a74 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -173,6 +173,16 @@ enum bpf_attach_type {
> __MAX_BPF_ATTACH_TYPE
>  };
>
> +/* helper bpf_get_current_pidns_info will store the following
> + * data, dev will contain major/minor from /proc/self/pid.
> + */
> +struct bpf_pidns_info {
> +   __u32 dev;
> +   __u32 nsid;
> +   __u32 tgid;
> +   __u32 pid;
> +};

It is awkward to put the definition here between bpf_attach_type
and the below MAX_BPF_ATTACH_TYPE. Please put this definition
at the end of bpf.h.

Remember to sync the include/uapi/linux/bpf.h to tools directory
after the patch.

> +
>  #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
>
>  /* cgroup-bpf attach flags used in BPF_PROG_ATTACH command
> @@ -2113,6 +2123,18 @@ union bpf_attr {
>   * the shared data.
>   * Return
>   * Pointer to the local storage area.
> + *
> + * int bpf_get_current_pidns(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + * Description
> + * Copies into *pidns* pid, namespace id and tgid as seen by the
> + * current namespace and also device from /proc/self/ns/pid.
> + * *size_of_pidns* must be the size of *pidns*
> + *
> + * This helper is 

Re: [iovisor-dev] Notification when an eBPF map is modified

2018-08-06 Thread Yonghong Song
On Mon, Aug 6, 2018 at 2:52 PM, Raffaele Sommese  wrote:
>> Okay, the htab_map_update_elem is indeed called, but you cannot trace it.
>> The following kernel code in kernel/bpf/syscall.c explained the reason:
>>
>> /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
>>  * inside bpf map update or delete otherwise deadlocks are possible
>>  */
>> preempt_disable();
>> __this_cpu_inc(bpf_prog_active);
>> if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
>> map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
>> err = bpf_percpu_hash_update(map, key, value, attr->flags);
>> } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>> err = bpf_percpu_array_update(map, key, value, attr->flags);
>> } else if (IS_FD_ARRAY(map)) {
>> rcu_read_lock();
>> err = bpf_fd_array_map_update_elem(map, f.file, key, value,
>>attr->flags);
>> rcu_read_unlock();
>> } else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
>> rcu_read_lock();
>> err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
>>   attr->flags);
>> rcu_read_unlock();
>> } else {
>> rcu_read_lock();
>> err = map->ops->map_update_elem(map, key, value, 
>> attr->flags);
>> rcu_read_unlock();
>> }
>> __this_cpu_dec(bpf_prog_active);
>> preempt_enable();
>>
>> The bpf_prog_active will prevent later kprobe for htab_map_update_elem.
>>
>> How can we solve this problem then? One possible solution is as follows:
>>   . disassemble vmlinux to find a proper place in function "map_update_elem"
>> where you can get the "map" (struct bpf_map *map) in a register, e.g.,
>> the insn offset inside map_update_elem is OFFSET and this OFFSET
>> should be outside the above preempt/__this_cpu_{inc/dec} region.
>>. improve trace.py to trace function+offset. the possible format could be
>>  trace.py 'map_update_elem+OFFSET ...'
>> The attach_kprobe API should already support function_name + offset format.
>
> I think that this way can be very tricky and platform depended, I have
> found another solution. If I attach my bpf program to bpf_map_new_fd
> with a kprobe and a kretprobe I can recover the mapping between (fd of
> map-pid) and id or the name of the map (and save it). I have tested it
> and it seems to work.
> Then I can trace map_update_elem syscall and read the data (I'm
> interested only into the key) from the userspace.
> I attach the code here, it can be helpful if other people that want to
> address this problem.
> https://gist.github.com/raffysommy/45cf0544f34eb0e5fbf533f4d9a3b955
> Thank you again for the support and for your time.

Yes, this approach should work too. I am thinking whether we could do
it with one invocation of trace.py...

> Raffaele
>
>
> --
> 
> Raffaele Sommese
> Mail:raffyso...@gmail.com
> About me:https://about.me/r4ffy
> Gpg Key:http://www.r4ffy.info/Openpgp.asc
> GPG key ID: 0x830b1428cf91db2a on http://pgp.mit.edu:11371/

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1435): https://lists.iovisor.org/g/iovisor-dev/message/1435
Mute This Topic: https://lists.iovisor.org/mt/21386293/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Notification when an eBPF map is modified

2018-08-06 Thread Yonghong Song
On Mon, Aug 6, 2018 at 11:53 AM, Raffaele Sommese  wrote:
> Il giorno lun 6 ago 2018 alle ore 19:40 Y Song  ha 
> scritto:
>>
>> On Mon, Aug 6, 2018 at 10:17 AM, Raffaele Sommese  
>> wrote:
>> >> bpf tracepoints have been removed from recent linux so the you need to
>> >> use kprobe to trace update/delete.
>> >>
>> >> typical map_update_elem and map_delete_elem first argument is
>> >> 'struct bpf_map *map', you can get name and id from there:
>> >>
>> > Hello again :)
>> > It seems that there is 2 function that can be traced inside the
>> > kernel, one is map_update_elem, and it is the syscall, the other one
>> > is the BPF helper.
>> > I have successful attach my ebpf code to the first one, but it doesn't
>> > have as parameter struct bpf_map *map (it have a union bpf_attr).
>> > If I attach my program to the bpf_map_update_elem (that I think is the
>> > function name of BPF helper), I don't receive any event.
>> > I'm using the last version of bcc and of kernel.
>> > I try also with kprobe program of perf kernel suite with the same results.
>> > I was looking for this helper BPF_CALL_4 (bpf_map_update_elem, struct
>> > bpf_map *, map, void *, key,  void *, value, u64, flags)
>>
>> Please directly use the map lookup function for the specific map.
>> For example, for hashmap, the verifier is smart enough to
>> change the byte code to call the underlying hashmap map lookup function.
>
> Thank you, right now I will try only to implement a solution for hashmap.
> I have detected a strange behavior for lookup I can receive the event
> when the map was looked, but for the updates, I don't receive
> anything.
> I have checked the kernel and there was the map_gen_lookup.
> The strange thing is that if I use kprobe tool I can see the event on
> htab_map_update_elem.
> Here is my test code: (I have tried with lookup and it works)
> https://gist.github.com/raffysommy/1dabe5bf9487d974f3acd1f7a32ed01c
> https://gist.github.com/raffysommy/587f61c14d3e157f86da1aadd07442b1

Okay, the htab_map_update_elem is indeed called, but you cannot trace it.
The following kernel code in kernel/bpf/syscall.c explained the reason:

/* must increment bpf_prog_active to avoid kprobe+bpf triggering from
 * inside bpf map update or delete otherwise deadlocks are possible
 */
preempt_disable();
__this_cpu_inc(bpf_prog_active);
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
err = bpf_percpu_hash_update(map, key, value, attr->flags);
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_update(map, key, value, attr->flags);
} else if (IS_FD_ARRAY(map)) {
rcu_read_lock();
err = bpf_fd_array_map_update_elem(map, f.file, key, value,
   attr->flags);
rcu_read_unlock();
} else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
rcu_read_lock();
err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
  attr->flags);
rcu_read_unlock();
} else {
rcu_read_lock();
err = map->ops->map_update_elem(map, key, value, attr->flags);
rcu_read_unlock();
}
__this_cpu_dec(bpf_prog_active);
preempt_enable();

The bpf_prog_active will prevent later kprobe for htab_map_update_elem.

How can we solve this problem then? One possible solution is as follows:
  . disassemble vmlinux to find a proper place in function "map_update_elem"
where you can get the "map" (struct bpf_map *map) in a register, e.g.,
the insn offset inside map_update_elem is OFFSET and this OFFSET
should be outside the above preempt/__this_cpu_{inc/dec} region.
   . improve trace.py to trace function+offset. the possible format could be
 trace.py 'map_update_elem+OFFSET ...'
The attach_kprobe API should already support function_name + offset format.

> Thanks again,
> Raffaele
>
> --
> 
> Raffaele Sommese
> Mail:raffyso...@gmail.com
> About me:https://about.me/r4ffy
> Gpg Key:http://www.r4ffy.info/Openpgp.asc
> GPG key ID: 0x830b1428cf91db2a on http://pgp.mit.edu:11371/

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1433): https://lists.iovisor.org/g/iovisor-dev/message/1433
Mute This Topic: https://lists.iovisor.org/mt/21386293/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor] [iovisor-dev] using BPF for driver memory leak detection

2018-08-06 Thread Yonghong Song
On Mon, Aug 6, 2018 at 2:44 AM, Krishna Chaitanya
 wrote:
> On Mon, Aug 6, 2018 at 3:12 PM Krishna Chaitanya
>  wrote:
>>
>> Hi,
>>
>> If we want to use BPF for memory leak for driver modules, we can only
>> print outstanding allocations which could potentially be freed at a
>> later time.
>>
>> Is there a simple way we could pre-load BPF (before driver insertion)
>> and check for the outstanding entries after the driver removal, which
>> are confirmed leaks?
>>
>> This is useful to debug memory allocations at insmod/rmmod.
>>
>> Currently, without insmod, BPF compilation fails as tracepoints are
>> not installed. And also if tracing is active, rmmod fails with
>> "resource in use".

Which tracepoint you tried to attach? For kernel memory, a few kmem
tracepoints are attached in memleak. If you want to attach a tracepoint
only defined by the module, yes, you may have to wait until the module
is inserted into the kernel.

>>
> +dev
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#2): https://lists.iovisor.org/g/main/message/2
Mute This Topic: https://lists.iovisor.org/mt/24211362/21656
Group Owner: main+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/main/leave/2589919/776971353/xyzzy  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [RFC PATCH v2 0/3] Implement bpf map queue

2018-08-03 Thread Yonghong Song
On Thu, Aug 2, 2018 at 10:29 AM, Mauricio Vasquez
 wrote:
> Bpf queue map is a new kind of map that provides a LIFO/FIFO queue
> implementation.
>
> In some applications, like a SNAT, it is necessary to keep track of
> a pool of free elemenets, network ports in this case, then a queue
> can be used for that purpose.

I did not check the detailed implementation for percpu freelist implementation,
but the general framework looks sound. I think the patch is ready to send
to netdev with "[PATCH bpf-next]" prefix for more review if bpf-next does
not close soon.

Thanks!

>
> v2:
> - precharge memory on map allocation
> - add percpu free list when prealloc is enabled
>
> Signed-off-by: Mauricio Vasquez B 
>
> ---
>
> Mauricio Vasquez B (3):
>   bpf: add bpf queue map
>   selftests/bpf: add test cases for BPF_MAP_TYPE_QUEUE
>   bpf: add sample for BPF_MAP_TYPE_QUEUE
>
>
>  include/linux/bpf_types.h   |1
>  include/uapi/linux/bpf.h|5 +
>  kernel/bpf/Makefile |2
>  kernel/bpf/queuemap.c   |  285 
> +++
>  kernel/bpf/syscall.c|   61 +--
>  kernel/bpf/verifier.c   |   10 +
>  samples/bpf/.gitignore  |1
>  samples/bpf/Makefile|3
>  samples/bpf/test_map_in_map_user.c  |9 -
>  samples/bpf/test_queuemap.sh|   37 
>  samples/bpf/test_queuemap_kern.c|   51 ++
>  samples/bpf/test_queuemap_user.c|   53 ++
>  tools/include/uapi/linux/bpf.h  |5 +
>  tools/testing/selftests/bpf/test_maps.c |   71 
>  14 files changed, 569 insertions(+), 25 deletions(-)
>  create mode 100644 kernel/bpf/queuemap.c
>  create mode 100755 samples/bpf/test_queuemap.sh
>  create mode 100644 samples/bpf/test_queuemap_kern.c
>  create mode 100644 samples/bpf/test_queuemap_user.c
>
> --
> Signature
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1426): https://lists.iovisor.org/g/iovisor-dev/message/1426
Mute This Topic: https://lists.iovisor.org/mt/24146803/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [RFC PATCH v2 1/3] bpf: add bpf queue map

2018-08-03 Thread Yonghong Song
On Thu, Aug 2, 2018 at 10:30 AM, Mauricio Vasquez
 wrote:
> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
>
> It allows to push an element to the queue by using the update operation
> and to pop an element from the queue by using the lookup operation.
>
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
>
> Signed-off-by: Mauricio Vasquez B 
> ---
>  include/linux/bpf_types.h |1
>  include/uapi/linux/bpf.h  |5 +
>  kernel/bpf/Makefile   |2
>  kernel/bpf/queuemap.c |  285 
> +
>  kernel/bpf/syscall.c  |   61 +++---
>  kernel/bpf/verifier.c |   10 +-
>  6 files changed, 346 insertions(+), 18 deletions(-)
>  create mode 100644 kernel/bpf/queuemap.c
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index c5700c2d5549..6c7a62f3fe43 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0ebaaf7f3568..2c171c40eb45 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -120,6 +120,7 @@ enum bpf_map_type {
> BPF_MAP_TYPE_CPUMAP,
> BPF_MAP_TYPE_XSKMAP,
> BPF_MAP_TYPE_SOCKHASH,
> +   BPF_MAP_TYPE_QUEUE,
>  };
>
>  enum bpf_prog_type {
> @@ -255,6 +256,10 @@ enum bpf_attach_type {
>  /* Flag for stack_map, store build_id+offset instead of pointer */
>  #define BPF_F_STACK_BUILD_ID   (1U << 5)
>
> +/* Flags for queue_map, type of queue */
> +#define BPF_F_QUEUE_FIFO   (1U << 16)
> +#define BPF_F_QUEUE_LIFO   (2U << 16)
> +
>  enum bpf_stack_build_id_status {
> /* user space need an empty entry to identify end of a trace */
> BPF_STACK_BUILD_ID_EMPTY = 0,
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f27f5496d6fe..30f02ef66635 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -2,7 +2,7 @@
>  obj-y := core.o
>
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
> bpf_lru_list.o lpm_trie.o map_in_map.o
> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
> bpf_lru_list.o lpm_trie.o map_in_map.o queuemap.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o
>  ifeq ($(CONFIG_NET),y)
> diff --git a/kernel/bpf/queuemap.c b/kernel/bpf/queuemap.c
> new file mode 100644
> index ..b4f870ac9f3a
> --- /dev/null
> +++ b/kernel/bpf/queuemap.c
> @@ -0,0 +1,285 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * queuemap.c: BPF queue map
> + *
> + * Copyright (c) 2018 Politecnico di Torino
> + */
> +#include 
> +#include 
> +#include 
> +#include "percpu_freelist.h"
> +
> +#define QUEUE_CREATE_FLAG_MASK \
> +   (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
> +BPF_F_QUEUE_FIFO | BPF_F_QUEUE_LIFO)
> +
> +enum queue_type {
> +   QUEUE_FIFO = (BPF_F_QUEUE_FIFO >> 16),
> +   QUEUE_LIFO = (BPF_F_QUEUE_LIFO >> 16),
> +};
> +
> +struct bpf_queue {
> +   struct bpf_map map;
> +   struct list_head head;
> +   struct pcpu_freelist freelist;
> +   void *nodes;
> +   enum queue_type type;
> +   raw_spinlock_t lock;
> +   atomic_t count;
> +   u32 node_size;
> +};
> +
> +struct queue_node {
> +   struct pcpu_freelist_node fnode;
> +   struct bpf_queue *queue;
> +   struct list_head list;
> +   struct rcu_head rcu;
> +   char element[0] __aligned(8);
> +};
> +
> +static bool queue_map_is_prealloc(struct bpf_queue *queue) {
> +   return !(queue->map.map_flags & BPF_F_NO_PREALLOC);
> +}
> +
> +/* Called from syscall */
> +static int queue_map_alloc_check(union bpf_attr *attr)
> +{
> +   /* check sanity of attributes */
> +   if (attr->max_entries == 0 || attr->key_size != 0 ||
> +   attr->value_size == 0 || attr->map_flags & 
> ~QUEUE_CREATE_FLAG_MASK)
> +   return -EINVAL;
> +
> +   if ((attr->map_flags >> 16) != QUEUE_FIFO &&
> +   (attr->map_flags >> 16) != QUEUE_LIFO) {
> +   return -EINVAL;
> +   }
> +
> +   if (attr->value_size > KMALLOC_MAX_SIZE)
> +   /* if value_size is bigger, the user space won't be able to
> +* access the elements.
> +*/
> +   return -E2BIG;
> +
> +   return 0;
> +}
> +
> +static int prealloc_init(struct bpf_queue *queue)
> +{
> +   u32 node_size = sizeof(struct queue_node) +
> +   round_up(queue->map.value_size, 8);
> +   u32 num_entries = queue->map.max_entries;
> +   int err;
> +
> +   queue->nodes = bpf_map_area_alloc(node_size * num_entries,
> +  

Re: [iovisor-dev] [RFC PATCH 3/3] bpf: add sample for BPF_MAP_TYPE_QUEUE

2018-08-03 Thread Yonghong Song
On Thu, Aug 2, 2018 at 10:22 AM, Mauricio Vasquez
 wrote:
>
>
> On 08/01/2018 12:41 AM, Y Song wrote:
>>
>> On Tue, Jul 31, 2018 at 1:53 PM, Mauricio Vasquez
>>  wrote:
>>>
>>> The example is made by two parts, a eBPF program that consumes elements
>>> from a FIFO queue and prints them in the screen and a user space
>>> application that inserts new elements into the queue each time this is
>>> executed.
>>>
>>> Signed-off-by: Mauricio Vasquez B 
>>> ---
>>>   samples/bpf/.gitignore   |1 +
>>>   samples/bpf/Makefile |3 ++
>>>   samples/bpf/test_queuemap.sh |   37 +
>>>   samples/bpf/test_queuemap_kern.c |   56
>>> ++
>>>   samples/bpf/test_queuemap_user.c |   53
>>> 
>>
>> Could you put the test in tools/testing/selftests/bpf? This way, it
>> will be executed by various
>> kernel testing infrastructure.
>
>
> I am not sure, it is more an example than a test, real tests are already in
> tools/testing/selftests/bpf/test_maps.c.
> Maybe we can rename it to queue_map_sample or something similar, or remove
> it if you don't see any value on it.

There are still some values as it has bpf program doing "pop" operation.

>
>> For /sys/fs/bpf mount, you can check whether it is available or not,
>> if not, your test script
>> can mount it and tear it down during cleanup.
>>
>>
>>>   5 files changed, 150 insertions(+)
>>>   create mode 100755 samples/bpf/test_queuemap.sh
>>>   create mode 100644 samples/bpf/test_queuemap_kern.c
>>>   create mode 100644 samples/bpf/test_queuemap_user.c
>>>
>>> diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
>>> index 8ae4940025f8..d7e518c1b3ed 100644
>>> --- a/samples/bpf/.gitignore
>>> +++ b/samples/bpf/.gitignore
>>> @@ -26,6 +26,7 @@ test_lru_dist
>>>   test_map_in_map
>>>   test_overhead
>>>   test_probe_write_user
>>> +test_queuemap
>>>   trace_event
>>>   trace_output
>>>   tracex1
>>> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
>>> index f88d5683d6ee..624f4f4b81db 100644
>>> --- a/samples/bpf/Makefile
>>> +++ b/samples/bpf/Makefile
>>> @@ -53,6 +53,7 @@ hostprogs-y += xdpsock
>>>   hostprogs-y += xdp_fwd
>>>   hostprogs-y += task_fd_query
>>>   hostprogs-y += xdp_sample_pkts
>>> +hostprogs-y += test_queuemap
>>>
>>>   # Libbpf dependencies
>>>   LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
>>> @@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
>>>   xdp_fwd-objs := xdp_fwd_user.o
>>>   task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
>>>   xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>>> +test_queuemap-objs := bpf_load.o test_queuemap_user.o
>>>
>>>   # Tell kbuild to always build the programs
>>>   always := $(hostprogs-y)
>>> @@ -166,6 +168,7 @@ always += xdpsock_kern.o
>>>   always += xdp_fwd_kern.o
>>>   always += task_fd_query_kern.o
>>>   always += xdp_sample_pkts_kern.o
>>> +always += test_queuemap_kern.o
>>>
>>>   HOSTCFLAGS += -I$(objtree)/usr/include
>>>   HOSTCFLAGS += -I$(srctree)/tools/lib/
>>> diff --git a/samples/bpf/test_queuemap.sh b/samples/bpf/test_queuemap.sh
>>> new file mode 100755
>>> index ..ed08c1fa8c2c
>>> --- /dev/null
>>> +++ b/samples/bpf/test_queuemap.sh
>>> @@ -0,0 +1,37 @@
>>> +#!/bin/bash
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +[[ -z $TC ]] && TC='tc'
>>> +[[ -z $IP ]] && IP='ip'
>>> +
>>> +TEST_QUEUE_USER='./test_queuemap'
>>> +TEST_QUEUE_BPF='./test_queuemap_kern.o'
>>> +
>>> +function config {
>>> +   $IP netns add ns1
>>> +   $IP link add ve1 type veth peer name vens1
>>> +   $IP link set dev ve1 up
>>> +   $IP link set dev ve1 mtu 1500
>>> +   $IP link set dev vens1 netns ns1
>>> +
>>> +   $IP -n ns1 link set dev lo up
>>> +   $IP -n ns1 link set dev vens1 up
>>> +   $IP -n ns1 addr add 10.1.1.101/24 dev vens1
>>> +
>>> +   $IP addr add 10.1.1.1/24 dev ve1
>>> +   $TC qdisc add dev ve1 clsact
>>> +   $TC filter add dev ve1 ingress bpf da obj $TEST_QUEUE_BPF sec
>>> test_queue
>>> +}
>>> +
>>> +function cleanup {
>>> +   set +e
>>> +   [[ -z $DEBUG ]] || set +x
>>> +   $IP netns delete ns1 >& /dev/null
>>> +   $IP link del ve1 >& /dev/null
>>> +   rm -f /sys/fs/bpf/tc/globals/queue
>>> +   [[ -z $DEBUG ]] || set -x
>>> +   set -e
>>> +}
>>> +
>>> +cleanup
>>> +config
>>> diff --git a/samples/bpf/test_queuemap_kern.c
>>> b/samples/bpf/test_queuemap_kern.c
>>> new file mode 100644
>>> index ..88d4795dd832
>>> --- /dev/null
>>> +++ b/samples/bpf/test_queuemap_kern.c
>>> @@ -0,0 +1,56 @@
>>> +/*
>>> + * Copyright (c) 2018 Politecnico di Torino
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of version 2 of the GNU General Public
>>> + * License as published by the Free Software Foundation.
>>> + */
>>> +#define KBUILD_MODNAME "foo"
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 

Re: [iovisor-dev] Notification when an eBPF map is modified

2018-08-01 Thread Yonghong Song
On Wed, Aug 1, 2018 at 2:36 AM, Raffaele Sommese  wrote:
> Hello everybody,
> I was looking for a similar mechanism,
> I need to trace an event on map update/delete, I have tried with
> tracepoint but I can recover only the file descriptor of map and I
> need the map id too (or the map name).
> Is there some other solution to trace this event and recover this data?

bpf tracepoints have been removed from recent linux so the you need to
use kprobe to trace update/delete.

typical map_update_elem and map_delete_elem first argument is
'struct bpf_map *map', you can get name and id from there:

struct bpf_map {
/* The first two cachelines with read-mostly members of which some
 * are also accessed in fast-path (e.g. ops, max_entries).
 */
const struct bpf_map_ops *ops cacheline_aligned;
struct bpf_map *inner_map_meta;
#ifdef CONFIG_SECURITY
void *security;
#endif
enum bpf_map_type map_type;
u32 key_size;
u32 value_size;
u32 max_entries;
u32 map_flags;
u32 pages;
u32 id;
int numa_node;
u32 btf_key_type_id;
u32 btf_value_type_id;
struct btf *btf;
bool unpriv_array;
/* 55 bytes hole */

/* The 3rd and 4th cacheline with misc members to avoid false sharing
 * particularly with refcounting.
 */
struct user_struct *user cacheline_aligned;
atomic_t refcnt;
atomic_t usercnt;
struct work_struct work;
char name[BPF_OBJ_NAME_LEN];
};


> I prefer to avoid to modify the kernel code.
> Thank You,
> Best Regards
> Raffaele
> Il giorno sab 17 feb 2018 alle ore 18:41 Jesper Dangaard Brouer via
> iovisor-dev  ha scritto:
>>
>>
>>
>> On Sat, 17 Feb 2018 13:49:22 + Teng Qin via iovisor-dev 
>>  wrote:
>>
>> > > We were looking for a mechanism transparent to the eBPF program, though.
>> > > A possible rational is to have an hot-standby copy of the program
>> > > (including the state) in some other location, but I don't want my
>> > > dataplane to be aware of that.
>> > > Thanks,
>> > >
>> > > fulvio
>> >
>> >
>> > You could also (use another BPF program or ftrace) to trace the
>> > bpf_map_update_elem Tracepoint. But in that case you get all update calls
>> > and would need to filter for the one you are interested on your own:)
>>
>> That is a good idea.
>>
>> Try it out via perf-record to see if it contains what you need:
>>
>>  $ perf record -e bpf:bpf_map_update_elem -a
>>
>>  $ perf script
>>  xdp_redirect_ma  2273 [011] 261187.968223: bpf:bpf_map_update_elem: map 
>> type= ufd=4 key=[00 00 00 00] val=[07 00 00 00]
>>
>>
>> Looking at the above output and tracepoint kernel code, we should
>> extend that with a map_id to easily identify/filter what map you are
>> interested in.
>>
>> See patch below signature (not even compile tested).
>>
>> Example for attaching to tracepoints see:
>>  samples/bpf/xdp_monitor_*.c
>>
>> --
>> Best regards,
>>   Jesper Dangaard Brouer
>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>   LinkedIn: http://www.linkedin.com/in/brouer
>>
>> tracepoint: add map id to bpf tracepoints
>>
>> From: Jesper Dangaard Brouer 
>>
>>
>> ---
>>  include/trace/events/bpf.h |   12 +---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h
>> index 150185647e6b..e6479ba45261 100644
>> --- a/include/trace/events/bpf.h
>> +++ b/include/trace/events/bpf.h
>> @@ -140,7 +140,7 @@ TRACE_EVENT(bpf_map_create,
>> __entry->flags   = map->map_flags;
>> __entry->ufd = ufd;
>> ),
>> -
>> +// TODO also add map_id here
>> TP_printk("map type=%s ufd=%d key=%u val=%u max=%u flags=%x",
>>   __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
>>   __entry->ufd, __entry->size_key, __entry->size_value,
>> @@ -199,15 +199,18 @@ DECLARE_EVENT_CLASS(bpf_obj_map,
>> __field(u32, type)
>> __field(int, ufd)
>> __string(path, pname->name)
>> +   __field(u32, map_id)
>> ),
>>
>> TP_fast_assign(
>> __assign_str(path, pname->name);
>> __entry->type = map->map_type;
>> __entry->ufd  = ufd;
>> +   __entry->map_id = map->id;
>> ),
>>
>> -   TP_printk("map type=%s ufd=%d path=%s",
>> +   TP_printk("map id=%u type=%s ufd=%d path=%s",
>> + __entry->map_id,
>>   __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
>>   __entry->ufd, __get_str(path))
>>  );
>> @@ -244,6 +247,7 @@ DECLARE_EVENT_CLASS(bpf_map_keyval,
>> __dynamic_array(u8, val, map->value_size)
>> __field(bool, val_trunc)
>> __field(int, ufd)
>> +   __field(u32, map_id)
>> ),
>>
>> TP_fast_assign(
>> @@ -255,9 

Re: [iovisor-dev] [RFC PATCH 1/3] bpf: add bpf queue map

2018-07-31 Thread Yonghong Song
On Tue, Jul 31, 2018 at 1:53 PM, Mauricio Vasquez
 wrote:
> Bpf queue implements a LIFO/FIFO data containers for ebpf programs.
>
> It allows to push an element to the queue by using the update operation
> and to pop an element from the queue by using the lookup operation.
>
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
>
> Signed-off-by: Mauricio Vasquez B 
> ---
>  include/linux/bpf_types.h |1
>  include/uapi/linux/bpf.h  |5 +
>  kernel/bpf/Makefile   |2
>  kernel/bpf/queuemap.c |  204 
> +
>  kernel/bpf/syscall.c  |   61 ++---
>  kernel/bpf/verifier.c |   10 ++
>  6 files changed, 265 insertions(+), 18 deletions(-)
>  create mode 100644 kernel/bpf/queuemap.c
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index c5700c2d5549..6c7a62f3fe43 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -58,3 +58,4 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0ebaaf7f3568..2c171c40eb45 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -120,6 +120,7 @@ enum bpf_map_type {
> BPF_MAP_TYPE_CPUMAP,
> BPF_MAP_TYPE_XSKMAP,
> BPF_MAP_TYPE_SOCKHASH,
> +   BPF_MAP_TYPE_QUEUE,
>  };
>
>  enum bpf_prog_type {
> @@ -255,6 +256,10 @@ enum bpf_attach_type {
>  /* Flag for stack_map, store build_id+offset instead of pointer */
>  #define BPF_F_STACK_BUILD_ID   (1U << 5)
>
> +/* Flags for queue_map, type of queue */
> +#define BPF_F_QUEUE_FIFO   (1U << 16)
> +#define BPF_F_QUEUE_LIFO   (2U << 16)
> +
>  enum bpf_stack_build_id_status {
> /* user space need an empty entry to identify end of a trace */
> BPF_STACK_BUILD_ID_EMPTY = 0,
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f27f5496d6fe..30f02ef66635 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -2,7 +2,7 @@
>  obj-y := core.o
>
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
> bpf_lru_list.o lpm_trie.o map_in_map.o
> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
> bpf_lru_list.o lpm_trie.o map_in_map.o queuemap.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o
>  ifeq ($(CONFIG_NET),y)
> diff --git a/kernel/bpf/queuemap.c b/kernel/bpf/queuemap.c
> new file mode 100644
> index ..af179562c9b7
> --- /dev/null
> +++ b/kernel/bpf/queuemap.c
> @@ -0,0 +1,204 @@
> +/* Copyright (c) 2018 Politecnico di Torino
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */

It is totally up to you, but you could replace the above licensing portion from
"This program" to 'more details" with
/* SPDX-License-Identifier: GPL-2.0 */

> +#include 
> +#include 
> +#include 
> +
> +#define QUEUE_CREATE_FLAG_MASK \
> +   (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \
> +BPF_F_QUEUE_FIFO | BPF_F_QUEUE_LIFO)
> +
> +enum queue_type {
> +   QUEUE_FIFO = (BPF_F_QUEUE_FIFO >> 16),
> +   QUEUE_LIFO = (BPF_F_QUEUE_LIFO >> 16),
> +};
> +
> +struct bpf_queue {
> +   struct bpf_map map;
> +   struct list_head head;
> +   enum queue_type type;
> +   raw_spinlock_t lock;
> +   atomic_t count;
> +};
> +
> +struct queue_node {
> +   struct list_head list;
> +   struct rcu_head rcu;
> +   char element[0] __aligned(8);
> +};
> +
> +/* Called from syscall */
> +static int queue_map_alloc_check(union bpf_attr *attr)
> +{
> +   /* check sanity of attributes */
> +   if (attr->max_entries == 0 || attr->key_size != 0 ||
> +   attr->value_size == 0 || attr->map_flags & 
> ~QUEUE_CREATE_FLAG_MASK)
> +   return -EINVAL;
> +
> +   if ((attr->map_flags >> 16) != QUEUE_FIFO &&
> +   (attr->map_flags >> 16) != QUEUE_LIFO) {
> +   return -EINVAL;
> +   }
> +
> +   if (attr->value_size > KMALLOC_MAX_SIZE)
> +   /* if value_size is bigger, the user space won't be able to
> +* access the elements.
> +*/
> +   return -E2BIG;
> +
> +   return 0;
> +}
> +
> +static struct bpf_map *queue_map_alloc(union bpf_attr *attr)
> +{
> +   struct bpf_queue *queue;
> +
> +   queue = 

Re: [iovisor-dev] Helper functions available for XDP?

2018-07-24 Thread Yonghong Song
On Tue, Jul 24, 2018 at 9:19 AM, Andrew Wang  wrote:
> Then in this case is it possible to replace TCP checksum on ingress with
> XDP? How do I test if the skb is writable?

There is no skb in XDP. The helper bpf_l4_csum_replace cannot be used.
You need to use other csum functions and can directly modify the packet data.

>
> On Tue, Jul 24, 2018 at 12:04 PM, Yonghong Song  wrote:
>>
>> On Tue, Jul 24, 2018 at 1:54 AM, François  wrote:
>> > On Mon, Jul 23, 2018 at 10:08:08PM +0200, Paul Chaignon wrote:
>> >> On Mon, Jul 23, 2018 at 02:21:05PM -0400, Andrew Wang wrote:
>> >> > Hi
>> >> >
>> >> > I am writing a bpf program for packet processing and have loaded my
>> >> > ingress
>> >> > function at BPF.XDP.
>> >> >
>> >> > I'm updating the destination IPv6 address and want to update the TCP
>> >> > checksum, but when I try to call the helper functions "bpf_csum_diff"
>> >> > or
>> >> > "bpf_l4_csum_replace", I get a "unknown func ". Are
>> >> > these
>> >> > functions not available for XDP program types?
>> >>
>> >> XDP programs cannot use bpf_l4_csum_replace, but they can use
>> >> bpf_csum_diff since commit 205c380 ("bpf: add csum_diff helper to xdp
>> >> as
>> >> well") which landed in v4.16-rc1.
>> >
>> > I was thinking maybe those helpers were not available because no skb
>> > was construct. But after checking, the helper you're mentionning is in
>> > the tc_cls. Also, the csum_diff helper lies in the same switch
>> > statement as l4_csum_replace.
>> >
>> > Can you elaborate on why one can be used and not the other?
>>
>> bpf_csum_diff only accesses packet data and can be used in both cls_act
>> and xdp.
>> bpf_l4_csum_replace accesses skb data structure, e.g., it needs to
>> replace the csum
>> and hence needs to test whether skb is writable. It also different
>> calculation based on
>> the existing checksum state (skb->ip_summed), so bpf_l4_csum_replace
>> cannot be
>> used in xdp.
>>
>> >
>> > Thanks!
>> >
>> >
>> >
>>
>> 
>>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1399): https://lists.iovisor.org/g/iovisor-dev/message/1399
Mute This Topic: https://lists.iovisor.org/mt/23795612/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Fixing stack trace function names by argument introspection

2018-07-24 Thread Yonghong Song
On Tue, Jul 24, 2018 at 9:06 AM, ma...@kevac.org  wrote:
>
>
> On Mon, Jul 23, 2018 at 7:48 AM, Y Song  wrote:
>>
>>
>> We did not have such an example in BCC. In Facebook, we have a bpf
>> program to catch
>> stack traces for python programs. It is very similar to what you want
>> to achieve in the above.
>
>
> Can you share it with me? Maybe I can use it as an example.

It is a piece of complicated software, let me see how much I can do.

>
>>
>> Basically, you need to walk the stack by yourself. Since verifier do
>> not support unbounded loops,
>> you need to have a fully-unrollable loop with progma unroll.
>
>
> I have never used pragma unroll before, but I understand what it is supposed
> to do.
> Quick search gives me usages for CUDA and several little known examples for
> gcc/clang.
> Are you talking about them?
> https://stackoverflow.com/questions/4071690/tell-gcc-to-specifically-unroll-a-loop

Yes, it is `#pragma unroll`.

>
>>
>> During each loop iteration, you can access the frame pointer, you need
>> some mechanism to
>> get the real function name based on that level frame pointer and then
>> you move on
>> to the next. In bpf program, you can access current task structure,
>> which contains some
>> data related to TLS which could be used by the bpf program.
>>
>
> As far as I understand it would work if my program is built with frame
> pointers. In that case going throigh stack trace shold be straightforward.
> Never done it before though :-)

Not 100% whether just frame pointers are enough for you or not.
Remeber, on the stack, typically only frame pointer (if available),
function return address, spills, and #7 and later arguments.
It is very likely the case that `fn_name` in `execute_fn(fn_name)` may
not on the stack and you need to find a different way to access it.

> But usually programs are build omitting frame pointers. In that case you
> need additional info from DWARF and code is much more complex. Right?
> Are you suggesting implementing all this?

We are talking about stack walking inside bpf programs, dwarf option is
certainly out of question. In that case, you may need to use
perf record dwarf...

>
> Sorry for newbie questions :-)

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1398): https://lists.iovisor.org/g/iovisor-dev/message/1398
Mute This Topic: https://lists.iovisor.org/mt/23787733/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Helper functions available for XDP?

2018-07-24 Thread Yonghong Song
On Tue, Jul 24, 2018 at 1:54 AM, François  wrote:
> On Mon, Jul 23, 2018 at 10:08:08PM +0200, Paul Chaignon wrote:
>> On Mon, Jul 23, 2018 at 02:21:05PM -0400, Andrew Wang wrote:
>> > Hi
>> >
>> > I am writing a bpf program for packet processing and have loaded my ingress
>> > function at BPF.XDP.
>> >
>> > I'm updating the destination IPv6 address and want to update the TCP
>> > checksum, but when I try to call the helper functions "bpf_csum_diff" or
>> > "bpf_l4_csum_replace", I get a "unknown func ". Are these
>> > functions not available for XDP program types?
>>
>> XDP programs cannot use bpf_l4_csum_replace, but they can use
>> bpf_csum_diff since commit 205c380 ("bpf: add csum_diff helper to xdp as
>> well") which landed in v4.16-rc1.
>
> I was thinking maybe those helpers were not available because no skb
> was construct. But after checking, the helper you're mentionning is in
> the tc_cls. Also, the csum_diff helper lies in the same switch
> statement as l4_csum_replace.
>
> Can you elaborate on why one can be used and not the other?

bpf_csum_diff only accesses packet data and can be used in both cls_act and xdp.
bpf_l4_csum_replace accesses skb data structure, e.g., it needs to
replace the csum
and hence needs to test whether skb is writable. It also different
calculation based on
the existing checksum state (skb->ip_summed), so bpf_l4_csum_replace cannot be
used in xdp.

>
> Thanks!
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1395): https://lists.iovisor.org/g/iovisor-dev/message/1395
Mute This Topic: https://lists.iovisor.org/mt/23795612/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Accessing pinned eBPF map from the kernel

2018-07-19 Thread Yonghong Song
On Thu, Jul 19, 2018 at 2:49 PM, Hyunseok  wrote:
> Thanks for your reply.
>
> BPF_TABLE("extern") seems to work only if the eBPF program is loaded by the
> same userspace process which creates the map, like in this example.

No. The example uses a locally-created map to illustrate the process,
but a pinned map
works in a similar way.

The user space application:
   . using bpf_obj_get to get a FD for the pinned map.
   . create TableDesc with the FD.
   . Add TableDesc to the local_ts.
   . Create a BPF object with the local_ts.
   . ...

In the bpf program itself, declare the pinned map with "extern" type.

>
> But, what if a map is created (and pinned) by process A, and an eBPF program
> is loaded by another process B?
> Is there any way for the eBPF program to access the pinned map?

Just follow the above process, it should work. Facebook has used the
same mechanism for a while to access externally-pinned maps.

>
> bpf_obj_get cannot be used by eBPF programs as it is a userspace API.

bpf_obj_get() intends to be used in userspace, not kernel.

>
>
> On Thu, Jul 19, 2018 at 3:33 PM, Y Song  wrote:
>>
>> The following is an example in C++ to import an external map to BPF
>> modules.
>> https://github.com/iovisor/bcc/blob/master/examples/cpp/UseExternalMap.cc
>> You can use libbpf function `bpf_obj_get` to get a map fd in the above
>> example.
>>
>> On Wed, Jul 18, 2018 at 11:48 AM,   wrote:
>> > Hi,
>> >
>> > I have an eBPF map created and pinned by a userspace process.
>> >
>> > Now I would like several eBPF programs to access this pinned eBPF map.
>> >
>> > Is there any bcc APIs that can be used?
>> >
>> > BPF_TABLE(), etc creates a new eBPF map, not loads an existing pinned
>> > map.
>> >
>> > Thanks,
>> > -hs
>> > 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1389): https://lists.iovisor.org/g/iovisor-dev/message/1389
Mute This Topic: https://lists.iovisor.org/mt/23673879/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Accessing pinned eBPF map from the kernel

2018-07-19 Thread Yonghong Song
The following is an example in C++ to import an external map to BPF modules.
https://github.com/iovisor/bcc/blob/master/examples/cpp/UseExternalMap.cc
You can use libbpf function `bpf_obj_get` to get a map fd in the above example.

On Wed, Jul 18, 2018 at 11:48 AM,   wrote:
> Hi,
>
> I have an eBPF map created and pinned by a userspace process.
>
> Now I would like several eBPF programs to access this pinned eBPF map.
>
> Is there any bcc APIs that can be used?
>
> BPF_TABLE(), etc creates a new eBPF map, not loads an existing pinned map.
>
> Thanks,
> -hs
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1386): https://lists.iovisor.org/g/iovisor-dev/message/1386
Mute This Topic: https://lists.iovisor.org/mt/23673879/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] Verifier error: variable stack access var_off

2018-07-18 Thread Yonghong Song
The kernel needs to a constant offset from the stack for write. The
corresponding kernel verifier code below:

/* stack accesses must be at a fixed offset, so that we can
 * determine what type of data were returned.
 * See check_stack_read().
 */
if (!tnum_is_const(reg->var_off)) {
char tn_buf[48];

tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
verbose(env, "variable stack access var_off=%s
off=%d size=%d",
tn_buf, off, size);
return -EACCES;
}

You can initialize the array with ' ' to workaround the issue:
struct data_t data;
uint64_t max = sizeof(data.argv);
const char *argp = NULL;
memset(, ' ', sizeof(data));
bpf_probe_read(, sizeof(argp), (void *)&__argv[0]);
uint64_t len = bpf_probe_read_str(, max, argp);
len &= 0x; // to avoid: "math between fp pointer and register errs"
bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed valid

// data.argv[len] = ' ';


On Sun, Jul 1, 2018 at 6:20 PM, Teng Qin  wrote:
> Firstly, note that bpf_probe_read_str adds an extra \0 to the read string.
> So your max should be sizeof(data.argv) - 1 instead in order for
> data.argv[len] = ' ' to work (from Verifier's perspective, logically you
> don't need that extra delimiter~)
>
> Then, Yonghong had a patch a few month ago addressing very similar issue.
> See the example in patch series
> bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics
> Does your Kernel have those patches?
>
> However, even with all those the data.argv[len] = ' ' part still fails with
> something about stack offset not being fixed. I will try debug more to see
> how to fix that. For now, you can use a per-CPU array of size 1 for the data
> instead of allocating it on the stack. The following works for me:
> #include 
> #include 
> #include 
>
> #define ARGSIZE  128
>
> struct data_t {
> char argv[ARGSIZE];
> };
>
> BPF_PERF_OUTPUT(events);
> BPF_PERCPU_ARRAY(mem, struct data_t, 1);
> //
> // Here's what I'm trying to do. Let's say this has:
> //   __argv[0] = "ls"
> //   __argv[1] = "-l"
> // I'm trying to create a buffer with "ls -l", by doing bpf_probe_read_str()
> for
> // each element into the buffer, while keeping track of the length of
> // the previous read so I can insert a space delimiter at that offset,
> // and begin the next read after the delimiter.
> //
> int on_event(struct pt_regs *ctx,
> const char __user *filename,
> const char __user *const __user *__argv,
> const char __user *const __user *__envp)
> {
> int zero = 0;
> struct data_t* data = mem.lookup();
> if (!data)
>   return 0;
>
> uint64_t max = sizeof(data->argv) - 1;
> const char *argp = NULL;
> bpf_probe_read(, sizeof(argp), (void *)&__argv[0]);
> uint64_t len = bpf_probe_read_str(&(data->argv), max, argp);
> len &= 0x; // to avoid: "math between fp pointer and register
> errs"
> bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed
> valid
>
> data->argv[len] = ' ';
>
> events.perf_submit(ctx, data, len);
> out:
> return 0;
> }
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1385): https://lists.iovisor.org/g/iovisor-dev/message/1385
Mute This Topic: https://lists.iovisor.org/mt/22908709/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] GCOV coverage on BPF program

2018-06-19 Thread Yonghong Song
On Tue, Jun 19, 2018 at 7:54 AM, William Tu  wrote:
> Hi,
>
> I'm trying to run coverage tests on my bpf program and it compiles fails.
> I guess BPF does not support '--coverage' flags?

This will not work.
'-coverage' flag will introduce some additional global functions called
by the program, e.g., __llvm_gcov_init,  __llvm_gcov_flush, etc.
global memset functions are
called from one of these functions, global variable __llvm_gcov_ctr is
also introduced.

New clang compiler is much more tolerant on global functions in order
to support function calls in bpf bytecode.

Even if you compile successfully, you may not load the program as
the program compiled with '-coverage' will reference to some
global variables where the loader cannot resolve.

>
> root@boxes:~/ovs# clang-4.0 -Wstrict-prototypes -Wall -Wextra
> -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security
> -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align
> -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes
> -Wmissing-field-initializers -fno-strict-aliasing -Wswitch-bool
> -Wlogical-not-parentheses -Wsizeof-array-argument --coverage
> -D__NR_CPUS__=4 -O2 -Wall -Werror -emit-llvm -I./include -I./include
> -Wno-error=pointer-arith -I/include/uapi/ -c bpf/datapath.c -o - |
> llc-4.0 -march=bpf -filetype=obj -o bpf/datapath.o
> LLVM ERROR: Cannot select: 0x8b2e90: ch,glue = BPFISD::CALL 0x8b2f60,
> TargetExternalSymbol:i64'memset', Register:i64 %R1, Register:i64 %R2,
> Register:i64 %R3, 0x8b2f60:1
>   0x8b2ef8: i64 = TargetExternalSymbol'memset'
>   0x8b3168: i64 = Register %R1
>   0x8b3098: i64 = Register %R2
>   0x8b2fc8: i64 = Register %R3
>   0x8b2f60: ch,glue = CopyToReg 0x8b3030, Register:i64 %R3,
> Constant:i64<1408>, 0x8b3030:1
> 0x8b2fc8: i64 = Register %R3
> 0x8b3308: i64 = Constant<1408>
> 0x8b3030: ch,glue = CopyToReg 0x8b3100, Register:i64 %R2,
> Constant:i64<0>, 0x8b3100:1
>   0x8b3098: i64 = Register %R2
>   0x8b3238: i64 = Constant<0>
>   0x8b3100: ch,glue = CopyToReg 0x8b31d0, Register:i64 %R1, 0x73f0f8
> 0x8b3168: i64 = Register %R1
> 0x73f0f8: i64 = BPFISD::Wrapper TargetGlobalAddress:i64<[176 x
> i64]* @__llvm_gcov_ctr> 0
>   0x8b2cb8: i64 = TargetGlobalAddress<[176 x i64]* @__llvm_gcov_ctr> 0
> In function: __llvm_gcov_flush
>
> Thanks
> William
>
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#1337): https://lists.iovisor.org/g/iovisor-dev/message/1337
Mute This Topic: https://lists.iovisor.org/mt/22433570/21656
Group Owner: iovisor-dev+ow...@lists.iovisor.org
Unsubscribe: https://lists.iovisor.org/g/iovisor-dev/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [iovisor-dev] [oss-drivers] Re: [PATCH RFC 0/4] Initial 32-bit eBPF encoding support

2017-09-22 Thread Yonghong Song via iovisor-dev



On 9/22/17 9:24 AM, Jakub Kicinski wrote:

On Thu, 21 Sep 2017 11:56:55 -0700, Alexei Starovoitov wrote:

On Wed, Sep 20, 2017 at 12:20:40AM +0100, Jiong Wang via iovisor-dev wrote:

On 18/09/2017 22:29, Daniel Borkmann wrote:

On 09/18/2017 10:47 PM, Jiong Wang wrote:

Hi,

    Currently, LLVM eBPF backend always generate code in 64-bit mode,
this may
cause troubles when JITing to 32-bit targets.

    For example, it is quite common for XDP eBPF program to access
some packet
fields through base + offset that the default eBPF will generate
BPF_ALU64 for
the address formation, later when JITing to 32-bit hardware,
BPF_ALU64 needs
to be expanded into 32 bit ALU sequences even though the address
space is
32-bit that the high bits is not significant.

    While a complete 32-bit mode implemention may need an new ABI
(something like
-target-abi=ilp32), this patch set first add some initial code so we
could
construct 32-bit eBPF tests through hand-written assembly.

    A new 32-bit register set is introduced, its name is with "w"
prefix and LLVM
assembler will encode statements like "w1 += w2" into the following
8-bit code
field:

  BPF_ADD | BPF_X | BPF_ALU

BPF_ALU will be used instead of BPF_ALU64.

    NOTE, currently you can only use "w" register with ALU
statements, not with
others like branches etc as they don't have different encoding for
32-bit
target.


Great to see work in this direction! Can we also enable to use / emit
all the 32bit BPF_ALU instructions whenever possible for the currently
available bpf targets while at it (which only use BPF_ALU64 right now)?


Hi Daniel,

    Thanks for the feedback.

    I think we could also enable the use of all the 32bit BPF_ALU under
currently
available bpf targets.  As we now have 32bit register set support, we could
make
i32 type as legal type to prevent it be promoted into i64, then hook it up
with i32
ALU patterns, will look into this.


I don't think we need to gate 32bit alu generation with a flag.
Though interpreter and JITs support 32-bit since day one, the verifier
never seen such programs before, so some valid programs may get
rejected. After some time passes and we're sure that all progs
still work fine when they're optimized with 32-bit alu, we can flip
the switch in llvm and make it default.


Thinking about next steps - do we expect the 32b operations to clear the
upper halves of the registers?  The interpreter does it, and so does
x86.  I don't think we can load 32bit-only programs on 64bit hosts, so
we would need some form of data flow analysis in the kernel to prune
the zeroing for 32bit offload targets.  Is that correct?


Could you contrive an example to show the problem? If I understand 
correctly, you most worried that some natural sign extension is gone

with "clearing the upper 32-bit register" and such clearing may make
some operation, esp. memory operation not correct in 64-bit machine?







___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev