Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Masami Hiramatsu
On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 6:34 PM, Masami Hiramatsu wrote:
> > On Wed, 27 Dec 2017 14:46:24 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On 12/26/17 9:56 PM, Masami Hiramatsu wrote:
> >>> On Tue, 26 Dec 2017 17:57:32 -0800
> >>> Alexei Starovoitov  wrote:
> >>>
>  On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:
> > Check whether error injectable event is on function entry or not.
> > Currently it checks the event is ftrace-based kprobes or not,
> > but that is wrong. It should check if the event is on the entry
> > of target function. Since error injection will override a function
> > to just return with modified return value, that operation must
> > be done before the target function starts making stackframe.
> >
> > As a side effect, bpf error injection is no need to depend on
> > function-tracer. It can work with sw-breakpoint based kprobe
> > events too.
> >
> > Signed-off-by: Masami Hiramatsu 
> > ---
> >  kernel/trace/Kconfig|2 --
> >  kernel/trace/bpf_trace.c|6 +++---
> >  kernel/trace/trace_kprobe.c |8 +---
> >  kernel/trace/trace_probe.h  |   12 ++--
> >  4 files changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index ae3a2d519e50..6400e1bf97c5 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -533,9 +533,7 @@ config FUNCTION_PROFILER
> >  config BPF_KPROBE_OVERRIDE
> > bool "Enable BPF programs to override a kprobed function"
> > depends on BPF_EVENTS
> > -   depends on KPROBES_ON_FTRACE
> > depends on HAVE_KPROBE_OVERRIDE
> > -   depends on DYNAMIC_FTRACE_WITH_REGS
> > default n
> > help
> >  Allows BPF to override the execution of a probed function and
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index f6d2327ecb59..d663660f8392 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event 
> > *event,
> > int ret = -EEXIST;
> >
> > /*
> > -* Kprobe override only works for ftrace based kprobes, and 
> > only if they
> > -* are on the opt-in list.
> > +* Kprobe override only works if they are on the function entry,
> > +* and only if they are on the opt-in list.
> >  */
> > if (prog->kprobe_override &&
> > -   (!trace_kprobe_ftrace(event->tp_event) ||
> > +   (!trace_kprobe_on_func_entry(event->tp_event) ||
> >  !trace_kprobe_error_injectable(event->tp_event)))
> > return -EINVAL;
> >
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index 91f4b57dab82..265e3e27e8dc 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
> > trace_kprobe_nhit(struct trace_kprobe *tk)
> > return nhit;
> >  }
> >
> > -int trace_kprobe_ftrace(struct trace_event_call *call)
> > +bool trace_kprobe_on_func_entry(struct trace_event_call *call)
> >  {
> > struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> > -   return kprobe_ftrace(>rp.kp);
> > +
> > +   return kprobe_on_func_entry(tk->rp.kp.addr, 
> > tk->rp.kp.symbol_name,
> > +   tk->rp.kp.offset);
> 
>  That would be nice, but did you test this?
> >>>
> >>> Yes, because the jprobe, which was only official user of modifying 
> >>> execution
> >>> path using kprobe, did same way to check. (and kretprobe also does it)
> >>>
>  My understanding that kprobe will restore all regs and
>  here we need to override return ip _and_ value.
> >>>
> >>> yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.
> >>>
>  Could you add a patch with the test the way Josef did
>  or describe the steps to test this new mode?
> >>>
> >>> Would you mean below patch? If so, it should work without any change.
> >>>
> >>>  [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return
> >>
> >> yeah. I expect bpf_override_return test to work as-is.
> >> I'm asking for the test for new functionality added by this patch.
> >> In particular kprobe on func entry without ftrace.
> >> How did you test it?
> >
> > This function is used in kretprobe and jprobe. Jprobe was the user of
> > "modifying instruction pointer to another function" in kprobes.
> > If it doesn't work, jprobe also doesn't work, this means you can not
> > modify IP by 

Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Masami Hiramatsu
On Wed, 27 Dec 2017 19:49:28 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 5:38 PM, Masami Hiramatsu wrote:
> > On Wed, 27 Dec 2017 14:49:46 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On 12/27/17 12:09 AM, Masami Hiramatsu wrote:
> >>> On Tue, 26 Dec 2017 18:12:56 -0800
> >>> Alexei Starovoitov  wrote:
> >>>
>  On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:
> > Support in-kernel fault-injection framework via debugfs.
> > This allows you to inject a conditional error to specified
> > function using debugfs interfaces.
> >
> > Signed-off-by: Masami Hiramatsu 
> > ---
> >  Documentation/fault-injection/fault-injection.txt |5 +
> >  kernel/Makefile   |1
> >  kernel/fail_function.c|  169 
> > +
> >  lib/Kconfig.debug |   10 +
> >  4 files changed, 185 insertions(+)
> >  create mode 100644 kernel/fail_function.c
> >
> > diff --git a/Documentation/fault-injection/fault-injection.txt 
> > b/Documentation/fault-injection/fault-injection.txt
> > index 918972babcd8..6243a588dd71 100644
> > --- a/Documentation/fault-injection/fault-injection.txt
> > +++ b/Documentation/fault-injection/fault-injection.txt
> > @@ -30,6 +30,11 @@ o fail_mmc_request
> >injects MMC data errors on devices permitted by setting
> >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
> >
> > +o fail_function
> > +
> > +  injects error return on specific functions by setting debugfs entries
> > +  under /sys/kernel/debug/fail_function. No boot option supported.
> 
>  I like it.
>  Could you document it a bit better?
> >>>
> >>> Yes, I will do in next series.
> >>>
>  In particular retval is configurable, but without an example no one
>  will be able to figure out how to use it.
> >>>
> >>> Ah, right. BTW, as I pointed in the covermail, should we store the
> >>> expected error value range into the injectable list? e.g.
> >>>
> >>> ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)
> >>>
> >>> And provide APIs to check/get it.
> >>
> >> I'm afraid such check would be too costly.
> >> Right now we have only two functions marked but I expect hundreds more
> >> will be added in the near future as soon as developers realize the
> >> potential of such error injection.
> >> All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
> >> Multiple by 1k and we have 8k of data spent on marks.
> >> If we add max/min range marks that doubles it for very little use.
> >> I think marking function only is enough.
> >
> > Sorry, I don't think so.
> > Even if it takes 16 bytes more for each points, I don't think it is
> > any overhead for machines in these days. Even if so, we can provide
> > a kconfig to reduce it.
> > I mean, we are living in GB-order memory are, and it will be bigger
> > in the future. Why we have to worry about hundreds of 16bytes memory
> > pieces? It will take a few KB, and even if we mark thousands of
> > functions, it never reaches 1MB, in GB memory pool. :)
> >
> > Of course, for many small-footprint embedded devices (like having
> > less than 128MB memory), this feature can be a overhead. But they
> > can cut off the table by kconfig.
> 
> I still disagree on wasting 16-byte * num_of_funcs of .data here.
> The trade-off of usability vs memory just not worth it. Sorry.
> Please focus on testing your changes instead.

Then what happen if the user set invalid retval to those functions?
even if we limit the injectable functions, it can cause a problem,

for example, 

 obj = func_return_object();
 if (!obj) {
handling_error...;
 }
 obj->field = x;

In this case, obviously func_return_object() must return NULL if there is
an error, not -ENOMEM. But without the correct retval information, how would
you check the BPF code doesn't cause a trouble?
Currently it seems you are expecting only the functions which return error code.

 ret = func_return_state();
 if (ret < 0) {
handling_error...;
 }

But how we can distinguish those?

If we have the error range for each function, we can ensure what is
*correct* error code, NULL or errno, or any other error numbers. :)

At least fail_function needs this feature because it can check
return value when setting it up.

Thank you,

-- 
Masami Hiramatsu 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Alexei Starovoitov

On 12/27/17 8:16 PM, Steven Rostedt wrote:

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:


I don't think that's the case. My reading of current
trace_kprobe_ftrace() -> arch_check_ftrace_location()
is that it will not be true for old mcount case.


In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.


correct. I was saying that trace_kprobe_ftrace() won't let us do
bpf_override_return with old mcount.



As far as the rest of your arguments it very much puzzles me that
you claim that this patch suppose to work based on historical
reasoning whereas you did NOT test it.


I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?


exactly. No doubt that old code works.
But it doesn't mean that bpf_override_return() will continue to
work in kprobes that are not ftrace based.
I suspect Josef's existing test case will cover this situation.
Probably only special .config is needed to disable ftrace, so
"kprobe on entry but not ftrace" check will kick in.
But I didn't get an impression that this situation was tested.
Instead I see only logical reasoning that it's _supposed_ to work.
That's not enough.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Steven Rostedt
On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:

> I don't think that's the case. My reading of current
> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> is that it will not be true for old mcount case.

In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.

> 
> As far as the rest of your arguments it very much puzzles me that
> you claim that this patch suppose to work based on historical
> reasoning whereas you did NOT test it.

I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Alexei Starovoitov

On 12/27/17 5:38 PM, Masami Hiramatsu wrote:

On Wed, 27 Dec 2017 14:49:46 -0800
Alexei Starovoitov  wrote:


On 12/27/17 12:09 AM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 18:12:56 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:

Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Signed-off-by: Masami Hiramatsu 
---
 Documentation/fault-injection/fault-injection.txt |5 +
 kernel/Makefile   |1
 kernel/fail_function.c|  169 +
 lib/Kconfig.debug |   10 +
 4 files changed, 185 insertions(+)
 create mode 100644 kernel/fail_function.c

diff --git a/Documentation/fault-injection/fault-injection.txt 
b/Documentation/fault-injection/fault-injection.txt
index 918972babcd8..6243a588dd71 100644
--- a/Documentation/fault-injection/fault-injection.txt
+++ b/Documentation/fault-injection/fault-injection.txt
@@ -30,6 +30,11 @@ o fail_mmc_request
   injects MMC data errors on devices permitted by setting
   debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request

+o fail_function
+
+  injects error return on specific functions by setting debugfs entries
+  under /sys/kernel/debug/fail_function. No boot option supported.


I like it.
Could you document it a bit better?


Yes, I will do in next series.


In particular retval is configurable, but without an example no one
will be able to figure out how to use it.


Ah, right. BTW, as I pointed in the covermail, should we store the
expected error value range into the injectable list? e.g.

ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)

And provide APIs to check/get it.


I'm afraid such check would be too costly.
Right now we have only two functions marked but I expect hundreds more
will be added in the near future as soon as developers realize the
potential of such error injection.
All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
Multiple by 1k and we have 8k of data spent on marks.
If we add max/min range marks that doubles it for very little use.
I think marking function only is enough.


Sorry, I don't think so.
Even if it takes 16 bytes more for each points, I don't think it is
any overhead for machines in these days. Even if so, we can provide
a kconfig to reduce it.
I mean, we are living in GB-order memory are, and it will be bigger
in the future. Why we have to worry about hundreds of 16bytes memory
pieces? It will take a few KB, and even if we mark thousands of
functions, it never reaches 1MB, in GB memory pool. :)

Of course, for many small-footprint embedded devices (like having
less than 128MB memory), this feature can be a overhead. But they
can cut off the table by kconfig.


I still disagree on wasting 16-byte * num_of_funcs of .data here.
The trade-off of usability vs memory just not worth it. Sorry.
Please focus on testing your changes instead.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Alexei Starovoitov

On 12/27/17 6:34 PM, Masami Hiramatsu wrote:

On Wed, 27 Dec 2017 14:46:24 -0800
Alexei Starovoitov  wrote:


On 12/26/17 9:56 PM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 17:57:32 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:

Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/Kconfig|2 --
 kernel/trace/bpf_trace.c|6 +++---
 kernel/trace/trace_kprobe.c |8 +---
 kernel/trace/trace_probe.h  |   12 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ae3a2d519e50..6400e1bf97c5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -533,9 +533,7 @@ config FUNCTION_PROFILER
 config BPF_KPROBE_OVERRIDE
bool "Enable BPF programs to override a kprobed function"
depends on BPF_EVENTS
-   depends on KPROBES_ON_FTRACE
depends on HAVE_KPROBE_OVERRIDE
-   depends on DYNAMIC_FTRACE_WITH_REGS
default n
help
 Allows BPF to override the execution of a probed function and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f6d2327ecb59..d663660f8392 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
int ret = -EEXIST;

/*
-* Kprobe override only works for ftrace based kprobes, and only if they
-* are on the opt-in list.
+* Kprobe override only works if they are on the function entry,
+* and only if they are on the opt-in list.
 */
if (prog->kprobe_override &&
-   (!trace_kprobe_ftrace(event->tp_event) ||
+   (!trace_kprobe_on_func_entry(event->tp_event) ||
 !trace_kprobe_error_injectable(event->tp_event)))
return -EINVAL;

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 91f4b57dab82..265e3e27e8dc 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
trace_kprobe_nhit(struct trace_kprobe *tk)
return nhit;
 }

-int trace_kprobe_ftrace(struct trace_event_call *call)
+bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
-   return kprobe_ftrace(>rp.kp);
+
+   return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
+   tk->rp.kp.offset);


That would be nice, but did you test this?


Yes, because the jprobe, which was only official user of modifying execution
path using kprobe, did same way to check. (and kretprobe also does it)


My understanding that kprobe will restore all regs and
here we need to override return ip _and_ value.


yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.


Could you add a patch with the test the way Josef did
or describe the steps to test this new mode?


Would you mean below patch? If so, it should work without any change.

 [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return


yeah. I expect bpf_override_return test to work as-is.
I'm asking for the test for new functionality added by this patch.
In particular kprobe on func entry without ftrace.
How did you test it?


This function is used in kretprobe and jprobe. Jprobe was the user of
"modifying instruction pointer to another function" in kprobes.
If it doesn't work, jprobe also doesn't work, this means you can not
modify IP by kprobes anymore.
Anyway, until linux-4.13, that was well tested by kprobe smoke test.


and how I can repeat the test?
I'm still not sure that it works correctly.


That works correctly because it checks given address is on the entry
point (the 1st instruction) of a function, using kallsyms.

The reason why I made another flag for ftrace was, there are 2 modes
for ftrace dynamic instrumentation, fentry and mcount.
With new fentry mode, ftrace will be put on the first instruction
of the function, so it will work as you expected.
With traditional gcc mcount, ftrace will be called after making call
frame for _mcount(). This means if you modify ip, it will not work
or cause a trouble because _mcount call frame is still on the stack.

So, current ftrace-based checker doesn't work, it depends on the case.
Of course, in most 

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Masami Hiramatsu
On Wed, 27 Dec 2017 14:46:24 -0800
Alexei Starovoitov  wrote:

> On 12/26/17 9:56 PM, Masami Hiramatsu wrote:
> > On Tue, 26 Dec 2017 17:57:32 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:
> >>> Check whether error injectable event is on function entry or not.
> >>> Currently it checks the event is ftrace-based kprobes or not,
> >>> but that is wrong. It should check if the event is on the entry
> >>> of target function. Since error injection will override a function
> >>> to just return with modified return value, that operation must
> >>> be done before the target function starts making stackframe.
> >>>
> >>> As a side effect, bpf error injection is no need to depend on
> >>> function-tracer. It can work with sw-breakpoint based kprobe
> >>> events too.
> >>>
> >>> Signed-off-by: Masami Hiramatsu 
> >>> ---
> >>>  kernel/trace/Kconfig|2 --
> >>>  kernel/trace/bpf_trace.c|6 +++---
> >>>  kernel/trace/trace_kprobe.c |8 +---
> >>>  kernel/trace/trace_probe.h  |   12 ++--
> >>>  4 files changed, 14 insertions(+), 14 deletions(-)
> >>>
> >>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> >>> index ae3a2d519e50..6400e1bf97c5 100644
> >>> --- a/kernel/trace/Kconfig
> >>> +++ b/kernel/trace/Kconfig
> >>> @@ -533,9 +533,7 @@ config FUNCTION_PROFILER
> >>>  config BPF_KPROBE_OVERRIDE
> >>>   bool "Enable BPF programs to override a kprobed function"
> >>>   depends on BPF_EVENTS
> >>> - depends on KPROBES_ON_FTRACE
> >>>   depends on HAVE_KPROBE_OVERRIDE
> >>> - depends on DYNAMIC_FTRACE_WITH_REGS
> >>>   default n
> >>>   help
> >>>Allows BPF to override the execution of a probed function and
> >>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> >>> index f6d2327ecb59..d663660f8392 100644
> >>> --- a/kernel/trace/bpf_trace.c
> >>> +++ b/kernel/trace/bpf_trace.c
> >>> @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event 
> >>> *event,
> >>>   int ret = -EEXIST;
> >>>
> >>>   /*
> >>> -  * Kprobe override only works for ftrace based kprobes, and only if they
> >>> -  * are on the opt-in list.
> >>> +  * Kprobe override only works if they are on the function entry,
> >>> +  * and only if they are on the opt-in list.
> >>>*/
> >>>   if (prog->kprobe_override &&
> >>> - (!trace_kprobe_ftrace(event->tp_event) ||
> >>> + (!trace_kprobe_on_func_entry(event->tp_event) ||
> >>>!trace_kprobe_error_injectable(event->tp_event)))
> >>>   return -EINVAL;
> >>>
> >>> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> >>> index 91f4b57dab82..265e3e27e8dc 100644
> >>> --- a/kernel/trace/trace_kprobe.c
> >>> +++ b/kernel/trace/trace_kprobe.c
> >>> @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
> >>> trace_kprobe_nhit(struct trace_kprobe *tk)
> >>>   return nhit;
> >>>  }
> >>>
> >>> -int trace_kprobe_ftrace(struct trace_event_call *call)
> >>> +bool trace_kprobe_on_func_entry(struct trace_event_call *call)
> >>>  {
> >>>   struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> >>> - return kprobe_ftrace(>rp.kp);
> >>> +
> >>> + return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
> >>> + tk->rp.kp.offset);
> >>
> >> That would be nice, but did you test this?
> >
> > Yes, because the jprobe, which was only official user of modifying execution
> > path using kprobe, did same way to check. (and kretprobe also does it)
> >
> >> My understanding that kprobe will restore all regs and
> >> here we need to override return ip _and_ value.
> >
> > yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.
> >
> >> Could you add a patch with the test the way Josef did
> >> or describe the steps to test this new mode?
> >
> > Would you mean below patch? If so, it should work without any change.
> >
> >  [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return
> 
> yeah. I expect bpf_override_return test to work as-is.
> I'm asking for the test for new functionality added by this patch.
> In particular kprobe on func entry without ftrace.
> How did you test it?

This function is used in kretprobe and jprobe. Jprobe was the user of
"modifying instruction pointer to another function" in kprobes.
If it doesn't work, jprobe also doesn't work, this means you can not
modify IP by kprobes anymore.
Anyway, until linux-4.13, that was well tested by kprobe smoke test.

> and how I can repeat the test?
> I'm still not sure that it works correctly.

That works correctly because it checks given address is on the entry
point (the 1st instruction) of a function, using kallsyms.

The reason why I made another flag for ftrace was, there are 2 modes
for ftrace dynamic instrumentation, fentry and mcount.
With new fentry mode, ftrace will be put on the first instruction
of the function, so it will work as you 

Re: Hand Patching a BTRFS Superblock?

2017-12-27 Thread Qu Wenruo


On 2017年12月28日 09:46, Stirling Westrup wrote:
> Here's my situation: I have a network file server containing a 12TB
> BTRFS spread out over four devices (sda-sdd) which I am trying to
> recover. I do have a backup, but it's about 3 months old, and while I
> could certainly rebuild everything from that if I really had to, I
> would far rather not have to rerip my latest DVDs. So, I am willing to
> experiment if it might save me a few hundred hours of reconstruction.
> I don't currently have another 12 TB of space anywhere for making a
> scratch copy.
> 
> A few days ago sdb developed hard errors and I can no longer mount the
> filesystem. sdb is no longer even recognized as a valid btrfs drive.
> However, when I ran ddrescue over the drive I managed to make a clone
> (sdf) which contains all but 12K of the original drive. However, those
> missing 12K are all in the various superblocks, so the cloned drive is
> still unreadable.
> 
> In the hopes that I was only missing a few bits of the superblocks, I
> started out by dd-ing the first 4M of sdd into sdf in the hopes that
> ddrescue would overwrite much of the superblocks, and the final bits
> from sdd would make things usable.
> 
> No such luck. I now have a drive sdf which claims to be identical to
> sdd but which is a clone of sdb. In case it matters, sda and sdc are
> each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
> and sdf is a 2TB clone of sdb.
> 
> What I need to do is to somehow patch sdf's primary superblock so it
> contains the correct device number and UUID_SUB for sdb, so that I can
> attempt some sort of recovery. Right now my linux is (understandably)
> quite confused by the situation:

Did you tried "btrfs rescue super-recover"?

Remember to use the devel branch from git, as there is a small bug
prevent it report correct result.

super-recover will try to use the backup superblock to recover the
primary one.

Thanks,
Qu

> 
> videon:~ # uname -a
> Linux videon 4.4.103-18.41-default #1 SMP Wed Dec 13 14:06:33 UTC 2017
> (f66c68c) x86_64 x86_64 x86_64 GNU/Linux
> 
> videon:~ # btrfs --version
> btrfs-progs v4.5.3+20160729
> 
> videon:~ # btrfs fi show
> Label: 'Storage'  uuid: 33d2890d-f07d-4ba8-b1fc-7b4f14463b1f
> Total devices 4 FS bytes used 10.69TiB
> devid1 size 1.82TiB used 1.82TiB path /dev/sdd
> devid3 size 3.64TiB used 3.54TiB path /dev/sdc
> devid4 size 3.64TiB used 3.54TiB path /dev/sda
> *** Some devices missing
> 
> Any suggestions on how to proceed would be appreciated.
> 



signature.asc
Description: OpenPGP digital signature


Hand Patching a BTRFS Superblock?

2017-12-27 Thread Stirling Westrup
Here's my situation: I have a network file server containing a 12TB
BTRFS spread out over four devices (sda-sdd) which I am trying to
recover. I do have a backup, but it's about 3 months old, and while I
could certainly rebuild everything from that if I really had to, I
would far rather not have to rerip my latest DVDs. So, I am willing to
experiment if it might save me a few hundred hours of reconstruction.
I don't currently have another 12 TB of space anywhere for making a
scratch copy.

A few days ago sdb developed hard errors and I can no longer mount the
filesystem. sdb is no longer even recognized as a valid btrfs drive.
However, when I ran ddrescue over the drive I managed to make a clone
(sdf) which contains all but 12K of the original drive. However, those
missing 12K are all in the various superblocks, so the cloned drive is
still unreadable.

In the hopes that I was only missing a few bits of the superblocks, I
started out by dd-ing the first 4M of sdd into sdf in the hopes that
ddrescue would overwrite much of the superblocks, and the final bits
from sdd would make things usable.

No such luck. I now have a drive sdf which claims to be identical to
sdd but which is a clone of sdb. In case it matters, sda and sdc are
each 4TB while sdb and sdd are each 2TB drives; sde is my boot drive
and sdf is a 2TB clone of sdb.

What I need to do is to somehow patch sdf's primary superblock so it
contains the correct device number and UUID_SUB for sdb, so that I can
attempt some sort of recovery. Right now my linux is (understandably)
quite confused by the situation:

videon:~ # uname -a
Linux videon 4.4.103-18.41-default #1 SMP Wed Dec 13 14:06:33 UTC 2017
(f66c68c) x86_64 x86_64 x86_64 GNU/Linux

videon:~ # btrfs --version
btrfs-progs v4.5.3+20160729

videon:~ # btrfs fi show
Label: 'Storage'  uuid: 33d2890d-f07d-4ba8-b1fc-7b4f14463b1f
Total devices 4 FS bytes used 10.69TiB
devid1 size 1.82TiB used 1.82TiB path /dev/sdd
devid3 size 3.64TiB used 3.54TiB path /dev/sdc
devid4 size 3.64TiB used 3.54TiB path /dev/sda
*** Some devices missing

Any suggestions on how to proceed would be appreciated.

-- 
Stirling Westrup
Programmer, Entrepreneur.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Masami Hiramatsu
On Wed, 27 Dec 2017 14:49:46 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 12:09 AM, Masami Hiramatsu wrote:
> > On Tue, 26 Dec 2017 18:12:56 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:
> >>> Support in-kernel fault-injection framework via debugfs.
> >>> This allows you to inject a conditional error to specified
> >>> function using debugfs interfaces.
> >>>
> >>> Signed-off-by: Masami Hiramatsu 
> >>> ---
> >>>  Documentation/fault-injection/fault-injection.txt |5 +
> >>>  kernel/Makefile   |1
> >>>  kernel/fail_function.c|  169 
> >>> +
> >>>  lib/Kconfig.debug |   10 +
> >>>  4 files changed, 185 insertions(+)
> >>>  create mode 100644 kernel/fail_function.c
> >>>
> >>> diff --git a/Documentation/fault-injection/fault-injection.txt 
> >>> b/Documentation/fault-injection/fault-injection.txt
> >>> index 918972babcd8..6243a588dd71 100644
> >>> --- a/Documentation/fault-injection/fault-injection.txt
> >>> +++ b/Documentation/fault-injection/fault-injection.txt
> >>> @@ -30,6 +30,11 @@ o fail_mmc_request
> >>>injects MMC data errors on devices permitted by setting
> >>>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
> >>>
> >>> +o fail_function
> >>> +
> >>> +  injects error return on specific functions by setting debugfs entries
> >>> +  under /sys/kernel/debug/fail_function. No boot option supported.
> >>
> >> I like it.
> >> Could you document it a bit better?
> >
> > Yes, I will do in next series.
> >
> >> In particular retval is configurable, but without an example no one
> >> will be able to figure out how to use it.
> >
> > Ah, right. BTW, as I pointed in the covermail, should we store the
> > expected error value range into the injectable list? e.g.
> >
> > ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)
> >
> > And provide APIs to check/get it.
> 
> I'm afraid such check would be too costly.
> Right now we have only two functions marked but I expect hundreds more
> will be added in the near future as soon as developers realize the
> potential of such error injection.
> All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
> Multiple by 1k and we have 8k of data spent on marks.
> If we add max/min range marks that doubles it for very little use.
> I think marking function only is enough.

Sorry, I don't think so.
Even if it takes 16 bytes more for each points, I don't think it is
any overhead for machines in these days. Even if so, we can provide
a kconfig to reduce it.
I mean, we are living in GB-order memory are, and it will be bigger
in the future. Why we have to worry about hundreds of 16bytes memory
pieces? It will take a few KB, and even if we mark thousands of
functions, it never reaches 1MB, in GB memory pool. :)

Of course, for many small-footprint embedded devices (like having
less than 128MB memory), this feature can be a overhead. But they
can cut off the table by kconfig.

Thank you,

-- 
Masami Hiramatsu 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-27 Thread Qu Wenruo


On 2017年12月28日 06:39, Timofey Titovets wrote:
> Currently btrfs raid1/10 balancer blance requests to mirrors,
> based on pid % num of mirrors.
> 
> Update logic and make it understood if underline device are non rotational.
> 
> If one of mirrors are non rotational, then all read requests will be moved to
> non rotational device.
> 
> If both of mirrors are non rotational, calculate sum of
> pending and in flight request for queue on that bdev and use
> device with least queue leght.
> 
> P.S.
> Inspired by md-raid1 read balancing
> 
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/volumes.c | 59 
> ++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9a04245003ab..98bc2433a920 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>   return ret;
>  }
>  
> +static inline int bdev_get_queue_len(struct block_device *bdev)
> +{
> + int sum = 0;
> + struct request_queue *rq = bdev_get_queue(bdev);
> +
> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
> +
> + /*
> +  * Try prevent switch for every sneeze
> +  * By roundup output num by 2
> +  */
> + return ALIGN(sum, 2);
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first, int num,
>   int optimal, int dev_replace_is_ongoing)
>  {
>   int i;
>   int tolerance;
> + struct block_device *bdev;
>   struct btrfs_device *srcdev;
> + bool all_bdev_nonrot = true;
>  
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   else
>   srcdev = NULL;
>  
> + /*
> +  * Optimal expected to be pid % num
> +  * That's generaly ok for spinning rust drives
> +  * But if one of mirror are non rotating,
> +  * that bdev can show better performance
> +  *
> +  * if one of disks are non rotating:
> +  *  - set optimal to non rotating device
> +  * if both disk are non rotating
> +  *  - set optimal to bdev with least queue
> +  * If both disks are spinning rust:
> +  *  - leave old pid % nu,

And I'm wondering why this case can't use the same bdev queue length?

Any special reason spinning disk can't benifit from a shorter queue?

Thanks,
Qu

> +  */
> + for (i = 0; i < num; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + if (!bdev)
> + continue;
> + if (blk_queue_nonrot(bdev_get_queue(bdev)))
> + optimal = i;
> + else
> + all_bdev_nonrot = false;
> + }
> +
> + if (all_bdev_nonrot) {
> + int qlen;
> + /* Forse following logic choise by init with some big number */
> + int optimal_dev_rq_count = 1 << 24;
> +
> + for (i = 0; i < num; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + if (!bdev)
> + continue;
> +
> + qlen = bdev_get_queue_len(bdev);
> +
> + if (qlen < optimal_dev_rq_count) {
> + optimal = i;
> + optimal_dev_rq_count = qlen;
> + }
> + }
> + }
> +
>   /*
>* try to avoid the drive that is the source drive for a
>* dev-replace procedure, only choose it if no other non-missing
> 



signature.asc
Description: OpenPGP digital signature


Re: btrfs balance problems

2017-12-27 Thread Duncan
James Courtier-Dutton posted on Wed, 27 Dec 2017 21:39:30 + as
excerpted:

> Thank you for your suggestion.

Please put your reply in standard list quote/reply-in-context order.  It 
makes further replies, /in/ /context/, far easier.  I've moved the rest 
of your reply to do that, but I shouldn't have to...

>> On 23 December 2017 at 11:56, Alberto Bursi 
>> wrote:
>>>
>>> On 12/23/2017 12:19 PM, James Courtier-Dutton wrote:

 During a btrfs balance, the process hogs all CPU.
 Or, to be exact, any other program that wishes to use the SSD during
 a btrfs balance is blocked for long periods. Long periods being more
 than 5 seconds.

Blocking disk access isn't hogging the CPU, it's hogging the disk IO.

Tho FWIW we don't have many complaints about btrfs hogging /ssd/ 
access[1], tho we do have some complaining about problems on legacy 
spinning-rust.

 Is there any way to multiplex SSD access while btrfs balance is
 operating, so that other applications can still access the SSD with
 relatively low latency?

 My guess is that btrfs is doing a transaction with a large number of
 SSD blocks at a time, and thus blocking other applications.

 This makes for atrocious user interactivity as well as applications
 failing because they cannot access the disk in a relatively low
 latent manner.
 For, example, this is causing a High Definition network CCTV
 application to fail.

That sort of low-latency is outside my own use-case, but I do have some 
suggestions...

 What I would really like, is for some way to limit SSD bandwidths to
 applications.
 For example the CCTV app always gets the bandwidth it needs, and all
 other applications can still access the SSD, but are rate limited.
 This would fix my particular problem.
 We have rate limiting for network applications, why not disk access
 also?

>>> On most I/O intensive programs in Linux you can use "ionice" tool to
>>> change the disk access priority of a process. [1]

AFAIK, ionice only works for some IO schedulers, not all.  It does work 
with the default CFQ scheduler, but I don't /believe/ it works with 
deadline, certainly not with noop, and I'd /guess/ it doesn't work with 
block-multiqueue (and thus not with bfq or kyber) at all, tho it's 
possible it does in the latest kernels, since multi-queue is targeted to 
eventually replace, at least as default, the older single-queue options.

So which scheduler are you using and are you on multi-queue or not?

Meanwhile, where ionice /does/ work, using normal nice 19 should place 
the process in low-priority batch mode, which should automatically lower 
the ionice priority (increasing nice), as well.  That's what I normally 
use for such things here, on gentoo, where I schedule my package builds 
at nice 19, tho I also do the actual builds on tmpfs, so they don't 
actually touch anything but memory for the build itself, only fetching 
the sources, storing the built binpkg, and installing it to the main 
system.

>>> This allows me to run I/O intensive background scripts in servers
>>> without the users noticing slowdowns or lagging, of course this means
>>> the process doing heavy I/O will run more slowly or get outright
>>> paused if higher-priority processes need a lot of access to the disk.
>>>
>>> It works on btrfs balance too, see (commandline example) [2].

There's a problem with that example.  See below.

>>> If you don't start the process with ionice as in [2], you can always
>>> change the priority later if you get the get the process ID. I use
>>> iotop [3], which also supports commandline arguments to integrate its
>>> output in scripts.
>>>
>>> For btrfs scrub it seems to be possible to specify the ionice options
>>> directly, while btrfs balance does not seem to have them (would be
>>> nice to add them imho). [4]
>>>
>>> For the sake of completeness, there is also "nice" tool for CPU usage
>>> priority (also used in my scripts on servers to keep the scripts from
>>> hogging the CPU for what is just a background process, and seen in [2]
>>> commandline too). [5]
>>>
>>> 1. http://man7.org/linux/man-pages/man1/ionice.1.html
>>> 2. https://unix.stackexchange.com/questions/390480/nice-and-ionice-
which-one-should-come-first
>>> 3. http://man7.org/linux/man-pages/man8/iotop.8.html
>>> 4. https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub
>>> 5. http://man7.org/linux/man-pages/man1/nice.1.html

> It does not help at all.
> btrfs balance's behaviour seems to be unchanged by ionice.
> It still takes 100% while working and starves all other processes of
> disk access.

100% CPU, or 100% IO?  How are you measuring?  If iotop, 99% of time 
waiting on IO for an IO-bound process isn't bad, and doesn't mean nothing 
else can do IO first (tho 99% for that CCTV process /could/ be a problem, 
if it's normally much lower and only 99% because btrfs is taking what it 
needs).

100% of 

Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices

2017-12-27 Thread Qu Wenruo


On 2017年12月28日 06:39, Timofey Titovets wrote:
> Currently btrfs raid1/10 balancer blance requests to mirrors,
> based on pid % num of mirrors.
> 
> Update logic and make it understood if underline device are non rotational.
> 
> If one of mirrors are non rotational, then all read requests will be moved to
> non rotational device.
> 
> If both of mirrors are non rotational, calculate sum of
> pending and in flight request for queue on that bdev and use
> device with least queue leght.
> 
> P.S.
> Inspired by md-raid1 read balancing

Any benchmark?

It would be more persuasive.

Thanks,
Qu

> 
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/volumes.c | 59 
> ++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9a04245003ab..98bc2433a920 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>   return ret;
>  }
>  
> +static inline int bdev_get_queue_len(struct block_device *bdev)
> +{
> + int sum = 0;
> + struct request_queue *rq = bdev_get_queue(bdev);
> +
> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC];
> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC];
> +
> + /*
> +  * Try prevent switch for every sneeze
> +  * By roundup output num by 2
> +  */
> + return ALIGN(sum, 2);
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first, int num,
>   int optimal, int dev_replace_is_ongoing)
>  {
>   int i;
>   int tolerance;
> + struct block_device *bdev;
>   struct btrfs_device *srcdev;
> + bool all_bdev_nonrot = true;
>  
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   else
>   srcdev = NULL;
>  
> + /*
> +  * Optimal expected to be pid % num
> +  * That's generaly ok for spinning rust drives
> +  * But if one of mirror are non rotating,
> +  * that bdev can show better performance
> +  *
> +  * if one of disks are non rotating:
> +  *  - set optimal to non rotating device
> +  * if both disk are non rotating
> +  *  - set optimal to bdev with least queue
> +  * If both disks are spinning rust:
> +  *  - leave old pid % nu,
> +  */
> + for (i = 0; i < num; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + if (!bdev)
> + continue;
> + if (blk_queue_nonrot(bdev_get_queue(bdev)))
> + optimal = i;
> + else
> + all_bdev_nonrot = false;
> + }
> +
> + if (all_bdev_nonrot) {
> + int qlen;
> + /* Forse following logic choise by init with some big number */
> + int optimal_dev_rq_count = 1 << 24;
> +
> + for (i = 0; i < num; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + if (!bdev)
> + continue;
> +
> + qlen = bdev_get_queue_len(bdev);
> +
> + if (qlen < optimal_dev_rq_count) {
> + optimal = i;
> + optimal_dev_rq_count = qlen;
> + }
> + }
> + }
> +
>   /*
>* try to avoid the drive that is the source drive for a
>* dev-replace procedure, only choose it if no other non-missing
> 



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Alexei Starovoitov

On 12/27/17 12:09 AM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 18:12:56 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:

Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Signed-off-by: Masami Hiramatsu 
---
 Documentation/fault-injection/fault-injection.txt |5 +
 kernel/Makefile   |1
 kernel/fail_function.c|  169 +
 lib/Kconfig.debug |   10 +
 4 files changed, 185 insertions(+)
 create mode 100644 kernel/fail_function.c

diff --git a/Documentation/fault-injection/fault-injection.txt 
b/Documentation/fault-injection/fault-injection.txt
index 918972babcd8..6243a588dd71 100644
--- a/Documentation/fault-injection/fault-injection.txt
+++ b/Documentation/fault-injection/fault-injection.txt
@@ -30,6 +30,11 @@ o fail_mmc_request
   injects MMC data errors on devices permitted by setting
   debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request

+o fail_function
+
+  injects error return on specific functions by setting debugfs entries
+  under /sys/kernel/debug/fail_function. No boot option supported.


I like it.
Could you document it a bit better?


Yes, I will do in next series.


In particular retval is configurable, but without an example no one
will be able to figure out how to use it.


Ah, right. BTW, as I pointed in the covermail, should we store the
expected error value range into the injectable list? e.g.

ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)

And provide APIs to check/get it.


I'm afraid such check would be too costly.
Right now we have only two functions marked but I expect hundreds more
will be added in the near future as soon as developers realize the
potential of such error injection.
All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
Multiple by 1k and we have 8k of data spent on marks.
If we add max/min range marks that doubles it for very little use.
I think marking function only is enough.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Alexei Starovoitov

On 12/26/17 9:56 PM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 17:57:32 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:

Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/Kconfig|2 --
 kernel/trace/bpf_trace.c|6 +++---
 kernel/trace/trace_kprobe.c |8 +---
 kernel/trace/trace_probe.h  |   12 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ae3a2d519e50..6400e1bf97c5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -533,9 +533,7 @@ config FUNCTION_PROFILER
 config BPF_KPROBE_OVERRIDE
bool "Enable BPF programs to override a kprobed function"
depends on BPF_EVENTS
-   depends on KPROBES_ON_FTRACE
depends on HAVE_KPROBE_OVERRIDE
-   depends on DYNAMIC_FTRACE_WITH_REGS
default n
help
 Allows BPF to override the execution of a probed function and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f6d2327ecb59..d663660f8392 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
int ret = -EEXIST;

/*
-* Kprobe override only works for ftrace based kprobes, and only if they
-* are on the opt-in list.
+* Kprobe override only works if they are on the function entry,
+* and only if they are on the opt-in list.
 */
if (prog->kprobe_override &&
-   (!trace_kprobe_ftrace(event->tp_event) ||
+   (!trace_kprobe_on_func_entry(event->tp_event) ||
 !trace_kprobe_error_injectable(event->tp_event)))
return -EINVAL;

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 91f4b57dab82..265e3e27e8dc 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
trace_kprobe_nhit(struct trace_kprobe *tk)
return nhit;
 }

-int trace_kprobe_ftrace(struct trace_event_call *call)
+bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
-   return kprobe_ftrace(>rp.kp);
+
+   return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
+   tk->rp.kp.offset);


That would be nice, but did you test this?


Yes, because the jprobe, which was only official user of modifying execution
path using kprobe, did same way to check. (and kretprobe also does it)


My understanding that kprobe will restore all regs and
here we need to override return ip _and_ value.


yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.


Could you add a patch with the test the way Josef did
or describe the steps to test this new mode?


Would you mean below patch? If so, it should work without any change.

 [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return


yeah. I expect bpf_override_return test to work as-is.
I'm asking for the test for new functionality added by this patch.
In particular kprobe on func entry without ftrace.
How did you test it?
and how I can repeat the test?
I'm still not sure that it works correctly.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance problems

2017-12-27 Thread James Courtier-Dutton
Hi,

Thank you for your suggestion.
It does not help at all.
btrfs balance's behaviour seems to be unchanged by ionice.
It still takes 100% while working and starves all other processes of
disk access.

I can I get btrfs balance to work in the background, without adversely
affecting other applications?

>
> On 23 December 2017 at 11:56, Alberto Bursi  wrote:
>>
>>
>> On 12/23/2017 12:19 PM, James Courtier-Dutton wrote:
>>> Hi,
>>>
>>> During a btrfs balance, the process hogs all CPU.
>>> Or, to be exact, any other program that wishes to use the SSD during a
>>> btrfs balance is blocked for long periods. Long periods being more
>>> than 5 seconds.
>>> Is there any way to multiplex SSD access while btrfs balance is
>>> operating, so that other applications can still access the SSD with
>>> relatively low latency?
>>>
>>> My guess is that btrfs is doing a transaction with a large number of
>>> SSD blocks at a time, and thus blocking other applications.
>>>
>>> This makes for atrocious user interactivity as well as applications
>>> failing because they cannot access the disk in a relatively low latent
>>> manner.
>>> For, example, this is causing a High Definition network CCTV
>>> application to fail.
>>>
>>> What I would really like, is for some way to limit SSD bandwidths to
>>> applications.
>>> For example the CCTV app always gets the bandwidth it needs, and all
>>> other applications can still access the SSD, but are rate limited.
>>> This would fix my particular problem.
>>> We have rate limiting for network applications, why not disk access also?
>>>
>>> Kind Regards
>>>
>>> James
>>>
>>
>> On most I/O intensive programs in Linux you can use "ionice" tool to
>> change the disk access priority of a process. [1]
>> This allows me to run I/O intensive background scripts in servers
>> without the users noticing slowdowns or lagging, of course this means
>> the process doing heavy I/O will run more slowly or get outright paused
>> if higher-priority processes need a lot of access to the disk.
>>
>> It works on btrfs balance too, see (commandline example) [2].
>>
>> If you don't start the process with ionice as in [2], you can always
>> change the priority later if you get the get the process ID. I use iotop
>> [3], which also supports commandline arguments to integrate its output
>> in scripts.
>>
>> For btrfs scrub it seems to be possible to specify the ionice options
>> directly, while btrfs balance does not seem to have them (would be nice
>> to add them imho). [4]
>>
>> For the sake of completeness, there is also "nice" tool for CPU usage
>> priority (also used in my scripts on servers to keep the scripts from
>> hogging the CPU for what is just a background process, and seen in [2]
>> commandline too). [5]
>>
>> 1. http://man7.org/linux/man-pages/man1/ionice.1.html
>> 2.
>> https://unix.stackexchange.com/questions/390480/nice-and-ionice-which-one-should-come-first
>> 3. http://man7.org/linux/man-pages/man8/iotop.8.html
>> 4. https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub
>> 5. http://man7.org/linux/man-pages/man1/nice.1.html
>>
>> -Alberto
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 06/78] xarray: Change definition of sibling entries

2017-12-27 Thread Kirill A. Shutemov
On Tue, Dec 26, 2017 at 07:13:26PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 26, 2017 at 08:21:53PM +0300, Kirill A. Shutemov wrote:
> > > +/**
> > > + * xa_is_internal() - Is the entry an internal entry?
> > > + * @entry: Entry retrieved from the XArray
> > > + *
> > > + * Return: %true if the entry is an internal entry.
> > > + */
> > 
> > What does it mean "internal entry"? Is it just a term for non-value and
> > non-data pointer entry? Do we allow anybody besides xarray implementation to
> > use internal entires?
> > 
> > Do we have it documented?
> 
> We do!  include/linux/radix-tree.h has it documented right now:

Looks good. Thanks.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 05/78] xarray: Replace exceptional entries

2017-12-27 Thread Kirill A. Shutemov
On Tue, Dec 26, 2017 at 07:05:34PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 26, 2017 at 08:15:42PM +0300, Kirill A. Shutemov wrote:
> > >  28 files changed, 249 insertions(+), 240 deletions(-)
> > 
> > Everything looks fine to me after quick scan, but hat's a lot of changes for
> > one patch...
> 
> Yeah.  It's pretty mechanical though.
> 
> > > - if (radix_tree_exceptional_entry(page)) {
> > > + if (xa_is_value(page)) {
> > >   if (!invalidate_exceptional_entry2(mapping,
> > >  index, page))
> > >   ret = -EBUSY;
> > 
> > invalidate_exceptional_entry? Are we going to leave the terminology here as 
> > is?
> 
> That is a great question.  If the page cache wants to call its value
> entries exceptional entries, it can continue to do that.  I think there's
> a better name for them, but I'm not sure what it is.  Right now, the
> page cache uses value entries to store:
> 
> 1. Shadow entries (for workingset)
> 2. Swap entries (for shmem)
> 3. DAX entries
> 
> I can't come up with a good name for these three things.  'nonpage' is
> the only thing which hasn't immediately fallen off my ideas list.

Yeah, naming problem...

> But I think renaming exceptional entries in the page cache is a great idea,
> and I don't want to do it as part of this patch set ;-)

Fair enough.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: You will definetely be interested...

2017-12-27 Thread Sra. Angel Rania
Hi Dear,

Reading your profile has given me courage in search of a reasponsable
and trust worthy Fellow. The past has treated me so awfully but now I
am ready to move on despite of my health condition. I will like to
have a sincere and important discussion with you that will be in your
favor likewise to you and your environment especially to your close
family. Endeavor to reply me and I have attached my picture in case
you long to know who emailed you. I will be waiting to hear from you
as soon as possble.
Thanks for paying attention to my mail and will appreciate so much if
I receive a reply from you for understable details.

Thanks,

Mrs. Rania Hassan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root

2017-12-27 Thread Kirill A. Shutemov
On Tue, Dec 26, 2017 at 07:58:15PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 26, 2017 at 07:43:40PM -0800, Matthew Wilcox wrote:
> > Also add the xa_lock() and xa_unlock() family of wrappers to make it
> > easier to use the lock.  If we could rely on -fplan9-extensions in
> > the compiler, we could avoid all of this syntactic sugar, but that
> > wasn't added until gcc 4.6.
> 
> Oh, in case anyone's wondering, here's how I'd do it with plan9 extensions:
> 
> struct xarray {
> spinlock_t;
> int xa_flags;
> void *xa_head;
> };
> 
> ...
> spin_lock_irqsave(>pages, flags);
> __delete_from_page_cache(page, NULL);
> spin_unlock_irqrestore(>pages, flags);
> ...
> 
> The plan9 extensions permit passing a pointer to a struct which has an
> unnamed element to a function which is expecting a pointer to the type
> of that element.  The compiler does any necessary arithmetic to produce 
> a pointer.  It's exactly as if I had written:
> 
> spin_lock_irqsave(>pages.xa_lock, flags);
> __delete_from_page_cache(page, NULL);
> spin_unlock_irqrestore(>pages.xa_lock, flags);
> 
> More details here: https://9p.io/sys/doc/compiler.html

Yeah, that's neat.

Dealing with old compilers is frustrating...

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root

2017-12-27 Thread Kirill A. Shutemov
On Tue, Dec 26, 2017 at 07:43:40PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 26, 2017 at 07:54:40PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Dec 15, 2017 at 02:03:35PM -0800, Matthew Wilcox wrote:
> > > From: Matthew Wilcox 
> > > 
> > > This results in no change in structure size on 64-bit x86 as it fits in
> > > the padding between the gfp_t and the void *.
> > 
> > The patch does more than described in the subject and commit message. At 
> > first
> > I was confused why do you need to touch idr here. It took few minutes to 
> > figure
> > it out.
> > 
> > Could you please add more into commit message about lockname and xa_ locking
> > interface since you introduce it here?
> 
> Sure!  How's this?
> 
> xarray: Add the xa_lock to the radix_tree_root
> 
> This results in no change in structure size on 64-bit x86 as it fits in
> the padding between the gfp_t and the void *.
> 
> Initialising the spinlock requires a name for the benefit of lockdep,
> so RADIX_TREE_INIT() now needs to know the name of the radix tree it's
> initialising, and so do IDR_INIT() and IDA_INIT().
> 
> Also add the xa_lock() and xa_unlock() family of wrappers to make it
> easier to use the lock.  If we could rely on -fplan9-extensions in
> the compiler, we could avoid all of this syntactic sugar, but that
> wasn't added until gcc 4.6.
> 

Looks great, thanks.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Masami Hiramatsu
On Tue, 26 Dec 2017 18:12:56 -0800
Alexei Starovoitov  wrote:

> On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:
> > Support in-kernel fault-injection framework via debugfs.
> > This allows you to inject a conditional error to specified
> > function using debugfs interfaces.
> > 
> > Signed-off-by: Masami Hiramatsu 
> > ---
> >  Documentation/fault-injection/fault-injection.txt |5 +
> >  kernel/Makefile   |1 
> >  kernel/fail_function.c|  169 
> > +
> >  lib/Kconfig.debug |   10 +
> >  4 files changed, 185 insertions(+)
> >  create mode 100644 kernel/fail_function.c
> > 
> > diff --git a/Documentation/fault-injection/fault-injection.txt 
> > b/Documentation/fault-injection/fault-injection.txt
> > index 918972babcd8..6243a588dd71 100644
> > --- a/Documentation/fault-injection/fault-injection.txt
> > +++ b/Documentation/fault-injection/fault-injection.txt
> > @@ -30,6 +30,11 @@ o fail_mmc_request
> >injects MMC data errors on devices permitted by setting
> >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
> >  
> > +o fail_function
> > +
> > +  injects error return on specific functions by setting debugfs entries
> > +  under /sys/kernel/debug/fail_function. No boot option supported.
> 
> I like it.
> Could you document it a bit better?

Yes, I will do in next series.

> In particular retval is configurable, but without an example no one
> will be able to figure out how to use it.

Ah, right. BTW, as I pointed in the covermail, should we store the
expected error value range into the injectable list? e.g.

ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)

And provide APIs to check/get it.

const struct error_range *ei_get_error_range(unsigned long addr);


> 
> I think you can drop RFC tag from the next version of these patches.
> Thanks!

Thank you, I'll fix some errors came from configurations, and resend it.


Thanks!


-- 
Masami Hiramatsu 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html