Re: [PATCH] libbpf: Use the correct fd when attaching to perf events

2021-03-12 Thread Sultan Alsawaf
On Fri, Mar 12, 2021 at 06:33:01PM -0800, Andrii Nakryiko wrote:
> On Fri, Mar 12, 2021 at 6:22 PM Sultan Alsawaf  wrote:
> >
> > On Fri, Mar 12, 2021 at 05:31:14PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Mar 12, 2021 at 1:43 PM Sultan Alsawaf  
> > > wrote:
> > > >
> > > > From: Sultan Alsawaf 
> > > >
> > > > We should be using the program fd here, not the perf event fd.
> > >
> > > Why? Can you elaborate on what issue you ran into with the current code?
> >
> > bpf_link__pin() would fail with -EINVAL when using tracepoints, kprobes, or
> > uprobes. The failure would happen inside the kernel, in 
> > bpf_link_get_from_fd()
> > right here:
> > if (f.file->f_op != _link_fops) {
> > fdput(f);
> > return ERR_PTR(-EINVAL);
> > }
> 
> kprobe/tracepoint/perf_event attachments behave like bpf_link (so
> libbpf uses user-space high-level bpf_link APIs for it), but they are
> not bpf_link-based in the kernel. So bpf_link__pin() won't work for
> such types of programs until we actually have bpf_link-backed
> attachment support in the kernel itself. I never got to implementing
> this because we already had auto-detachment properties from perf_event
> FD itself. But it would be nice to have that done as a real bpf_link
> in the kernel (with all the observability, program update,
> force-detach support).
> 
> Looking for volunteers to make this happen ;)
> 
> 
> >
> > Since bpf wasn't looking for the perf event fd, I swapped it for the 
> > program fd
> > and bpf_link__pin() worked.
> 
> But you were pinning the BPF program, not a BPF link. Which is not
> what should have happen.

This is the code in question:
link = bpf_program__attach(prog);
// make sure `link` is valid, blah blah...
bpf_link__pin(link, some_path);

Are you saying that this usage is incorrect?

Sultan


Re: [PATCH] libbpf: Use the correct fd when attaching to perf events

2021-03-12 Thread Sultan Alsawaf
On Fri, Mar 12, 2021 at 05:31:14PM -0800, Andrii Nakryiko wrote:
> On Fri, Mar 12, 2021 at 1:43 PM Sultan Alsawaf  wrote:
> >
> > From: Sultan Alsawaf 
> >
> > We should be using the program fd here, not the perf event fd.
> 
> Why? Can you elaborate on what issue you ran into with the current code?

bpf_link__pin() would fail with -EINVAL when using tracepoints, kprobes, or
uprobes. The failure would happen inside the kernel, in bpf_link_get_from_fd()
right here:
if (f.file->f_op != _link_fops) {
fdput(f);
return ERR_PTR(-EINVAL);
}

Since bpf wasn't looking for the perf event fd, I swapped it for the program fd
and bpf_link__pin() worked.

Sultan


[PATCH] libbpf: Use the correct fd when attaching to perf events

2021-03-12 Thread Sultan Alsawaf
From: Sultan Alsawaf 

We should be using the program fd here, not the perf event fd.

Fixes: 63f2f5ee856ba ("libbpf: add ability to attach/detach BPF program to perf 
event")
Signed-off-by: Sultan Alsawaf 
---
 tools/lib/bpf/libbpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index d43cc3f29dae..3d20d57d4af5 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -9538,7 +9538,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct 
bpf_program *prog,
if (!link)
return ERR_PTR(-ENOMEM);
link->detach = _link__detach_perf_event;
-   link->fd = pfd;
+   link->fd = prog_fd;
 
if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) {
err = -errno;
-- 
2.30.2



Re: [PATCH v2 0/4] i2c-hid: Save power by reducing i2c xfers with block reads

2020-10-16 Thread Sultan Alsawaf
On Fri, Oct 16, 2020 at 01:16:18PM +0200, Hans de Goede wrote:
> Hi,
> 
> On 9/22/20 11:19 AM, Jiri Kosina wrote:
> > On Wed, 16 Sep 2020, Sultan Alsawaf wrote:
> > 
> >> From: Sultan Alsawaf 
> >>
> >> This is a fixed resubmission of "[PATCH 0/2] i2c-hid: Save power by 
> >> reducing i2c
> >> xfers with block reads". That original patchset did not have enough fixes 
> >> for
> >> the designware i2c adapter's I2C_M_RECV_LEN feature, which is documented
> >> extensively in the original email thread.
> >>
> >> Here is the original cover letter, which still applies:
> >> "I noticed on my Dell Precision 15 5540 with an i9-9880H that simply 
> >> putting my
> >> finger on the touchpad would increase my system's power consumption by 4W, 
> >> which
> >> is quite considerable. Resting my finger on the touchpad would generate 
> >> roughly
> >> 4000 i2c irqs per second, or roughly 20 i2c irqs per touchpad irq.
> >>
> >> Upon closer inspection, I noticed that the i2c-hid driver would always 
> >> transfer
> >> the maximum report size over i2c (which is 60 bytes for my touchpad), but 
> >> all of
> >> my touchpad's normal touch events are only 32 bytes long according to the 
> >> length
> >> byte contained in the buffer sequence.
> >>
> >> Therefore, I was able to save about 2W of power by passing the 
> >> I2C_M_RECV_LEN
> >> flag in i2c-hid, which says to look for the payload length in the first 
> >> byte of
> >> the transfer buffer and adjust the i2c transaction accordingly. The only 
> >> problem
> >> though is that my i2c controller's driver allows bytes other than the 
> >> first one
> >> to be used to retrieve the payload length, which is incorrect according to 
> >> the
> >> SMBus spec, and would break my i2c-hid change since not *all* of the 
> >> reports
> >> from my touchpad are conforming SMBus block reads.
> >>
> >> This patchset fixes the I2C_M_RECV_LEN behavior in the designware i2c 
> >> driver and
> >> modifies i2c-hid to use I2C_M_RECV_LEN to save quite a bit of power. Even 
> >> if the
> >> peripheral controlled by i2c-hid doesn't support block reads, the i2c 
> >> controller
> >> drivers should cope with this and proceed with the i2c transfer using the
> >> original requested length."
> >>
> >> Sultan
> >>
> >> Sultan Alsawaf (4):
> >>   i2c: designware: Fix transfer failures for invalid SMBus block reads
> >>   i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads
> >>   i2c: designware: Allow SMBus block reads up to 255 bytes in length
> >>   HID: i2c-hid: Use block reads when possible to save power
> >>
> >>  drivers/hid/i2c-hid/i2c-hid-core.c |  5 -
> >>  drivers/i2c/busses/i2c-designware-master.c | 15 +--
> >>  2 files changed, 13 insertions(+), 7 deletions(-)
> > 
> > Hans, Benjamin, could you please give this patchset some smoke-testing? It 
> > looks good to me, but I'd like it to get some testing from your testing 
> > machinery before merging.
> 
> Sorry for being slow to respond to this. I have not gotten around to testing
> this, but I saw another email that this breaks things on at least AMD
> platforms, so I guess that this is on hold for now ?
> 
> Regards,
> 
> Hans

Hi,

Only the i2c-hid-core.c hunk is on hold. It is on hold until every i2c adapter
gets updated for proper block read functionality up to the new block read limit
of 255 bytes. This is not really a surprise.

The designware patches are good to go. Please let me know what you think of
them.

Sultan


Re: [PATCH v2 0/4] i2c-hid: Save power by reducing i2c xfers with block reads

2020-09-23 Thread Sultan Alsawaf
On Tue, Sep 22, 2020 at 09:59:44PM +0200, Jiri Kosina wrote:
> On Tue, 22 Sep 2020, Wolfram Sang wrote:
> 
> > > Hans, Benjamin, could you please give this patchset some smoke-testing? 
> > > It 
> > > looks good to me, but I'd like it to get some testing from your testing 
> > > machinery before merging.
> > 
> > Please give me some more days. I am not fully convinced yet that this
> > use of I2C_M_RECV_LEN is not broken on some controllers.
> > 
> > Plus, I'd favor if this could go via I2C tree. It is within I2C where
> > the non-trivial changes are. The HID part is just the final bit. Can we
> > agree on that?
> 
> Absolutely no problem with that. But I'd like to have this ran through 
> Benjamin/Hans first too.
> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
> 

I suppose the HID part does need to be held off until all the adapters are
updated with functional I2C_M_RECV_LEN bits.

I just got a Ryzen laptop which panics when using I2C_M_RECV_LEN.

So it looks like only the designware changes can be considered for merging now.

Sultan


Re: [PATCH v2 3/4] i2c: designware: Allow SMBus block reads up to 255 bytes in length

2020-09-17 Thread Sultan Alsawaf
On Thu, Sep 17, 2020 at 10:57:04PM +0200, Wolfram Sang wrote:
> On Wed, Sep 16, 2020 at 10:22:55PM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > According to the SMBus 3.0 protocol specification, block transfer limits
> > were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte
> > limitation.
> 
> Sadly, it is not that easy. We are trying to extend BLOCK_MAX to 255
> (SMBus 3 specs) but there are various things to be considered,
> especially with buffers and when passing it to userspace. Check here for
> the discussion (and you are welcome to join, of course):
> 
> http://patchwork.ozlabs.org/project/linux-i2c/list/?submitter=79741=*
> 
> > 
> > Signed-off-by: Sultan Alsawaf 
> > ---
> >  drivers/i2c/busses/i2c-designware-master.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/i2c/busses/i2c-designware-master.c 
> > b/drivers/i2c/busses/i2c-designware-master.c
> > index 22f28516bca7..5bd64bd17d94 100644
> > --- a/drivers/i2c/busses/i2c-designware-master.c
> > +++ b/drivers/i2c/busses/i2c-designware-master.c
> > @@ -433,7 +433,7 @@ i2c_dw_read(struct dw_i2c_dev *dev)
> > regmap_read(dev->map, DW_IC_DATA_CMD, );
> > if (flags & I2C_M_RECV_LEN) {
> > /* Ensure length byte is a valid value */
> > -   if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0)
> > +   if (tmp > 0)
> > len = i2c_dw_recv_len(dev, tmp);
> > else
> > len = i2c_dw_recv_len(dev, len);
> > -- 
> > 2.28.0
> > 

Yes, it is not that easy to make the change on a global scale. However, in the
case of the designware adapter, it really *is* that easy. This change covers the
designware adapter, and others can follow later with the much more invasive
changes that are needed.

Sultan


[PATCH v2 3/4] i2c: designware: Allow SMBus block reads up to 255 bytes in length

2020-09-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

According to the SMBus 3.0 protocol specification, block transfer limits
were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte
limitation.

Signed-off-by: Sultan Alsawaf 
---
 drivers/i2c/busses/i2c-designware-master.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index 22f28516bca7..5bd64bd17d94 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -433,7 +433,7 @@ i2c_dw_read(struct dw_i2c_dev *dev)
regmap_read(dev->map, DW_IC_DATA_CMD, );
if (flags & I2C_M_RECV_LEN) {
/* Ensure length byte is a valid value */
-   if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0)
+   if (tmp > 0)
len = i2c_dw_recv_len(dev, tmp);
else
len = i2c_dw_recv_len(dev, len);
-- 
2.28.0



[PATCH v2 2/4] i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads

2020-09-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

The point of adding a byte to len in i2c_dw_recv_len() is to make sure
that tx_buf_len is nonzero, so that i2c_dw_xfer_msg() can let the i2c
controller know that the i2c transaction can end. Otherwise, the i2c
controller will think that the transaction can never end for block
reads, which results in the stop-detection bit never being set and thus
the transaction timing out.

Adding a byte to len is not a reliable way to do this though; sometimes
it lets tx_buf_len become zero, which results in the scenario described
above. Therefore, just directly ensure tx_buf_len cannot be zero to fix
the issue.

Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read 
and write")
Signed-off-by: Sultan Alsawaf 
---
 drivers/i2c/busses/i2c-designware-master.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index d78f48ca4886..22f28516bca7 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -395,8 +395,9 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
 * Adjust the buffer length and mask the flag
 * after receiving the first byte.
 */
-   len += (flags & I2C_CLIENT_PEC) ? 2 : 1;
-   dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstanding);
+   if (flags & I2C_CLIENT_PEC)
+   len++;
+   dev->tx_buf_len = len - min_t(u8, len - 1, dev->rx_outstanding);
msgs[dev->msg_read_idx].len = len;
msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN;
 
-- 
2.28.0



[PATCH v2 1/4] i2c: designware: Fix transfer failures for invalid SMBus block reads

2020-09-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

SMBus block reads can be broken because the read function will just skip
over bytes it doesn't like until reaching a byte that conforms to the
length restrictions for block reads. This is problematic when it isn't
known if the incoming payload is indeed a conforming block read.

According to the SMBus specification, block reads will only send the
payload length in the first byte, so we can fix this by only considering
the first byte in a sequence for block read length purposes.

In addition, when the length byte is invalid, the original transfer
length still needs to be adjusted to avoid a controller timeout.

Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read 
and write")
Signed-off-by: Sultan Alsawaf 
---
 drivers/i2c/busses/i2c-designware-master.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index d6425ad6e6a3..d78f48ca4886 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -430,10 +430,12 @@ i2c_dw_read(struct dw_i2c_dev *dev)
u32 flags = msgs[dev->msg_read_idx].flags;
 
regmap_read(dev->map, DW_IC_DATA_CMD, );
-   /* Ensure length byte is a valid value */
-   if (flags & I2C_M_RECV_LEN &&
-   tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) {
-   len = i2c_dw_recv_len(dev, tmp);
+   if (flags & I2C_M_RECV_LEN) {
+   /* Ensure length byte is a valid value */
+   if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0)
+   len = i2c_dw_recv_len(dev, tmp);
+   else
+   len = i2c_dw_recv_len(dev, len);
}
*buf++ = tmp;
dev->rx_outstanding--;
-- 
2.28.0



[PATCH v2 0/4] i2c-hid: Save power by reducing i2c xfers with block reads

2020-09-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

This is a fixed resubmission of "[PATCH 0/2] i2c-hid: Save power by reducing i2c
xfers with block reads". That original patchset did not have enough fixes for
the designware i2c adapter's I2C_M_RECV_LEN feature, which is documented
extensively in the original email thread.

Here is the original cover letter, which still applies:
"I noticed on my Dell Precision 15 5540 with an i9-9880H that simply putting my
finger on the touchpad would increase my system's power consumption by 4W, which
is quite considerable. Resting my finger on the touchpad would generate roughly
4000 i2c irqs per second, or roughly 20 i2c irqs per touchpad irq.

Upon closer inspection, I noticed that the i2c-hid driver would always transfer
the maximum report size over i2c (which is 60 bytes for my touchpad), but all of
my touchpad's normal touch events are only 32 bytes long according to the length
byte contained in the buffer sequence.

Therefore, I was able to save about 2W of power by passing the I2C_M_RECV_LEN
flag in i2c-hid, which says to look for the payload length in the first byte of
the transfer buffer and adjust the i2c transaction accordingly. The only problem
though is that my i2c controller's driver allows bytes other than the first one
to be used to retrieve the payload length, which is incorrect according to the
SMBus spec, and would break my i2c-hid change since not *all* of the reports
from my touchpad are conforming SMBus block reads.

This patchset fixes the I2C_M_RECV_LEN behavior in the designware i2c driver and
modifies i2c-hid to use I2C_M_RECV_LEN to save quite a bit of power. Even if the
peripheral controlled by i2c-hid doesn't support block reads, the i2c controller
drivers should cope with this and proceed with the i2c transfer using the
original requested length."

Sultan

Sultan Alsawaf (4):
  i2c: designware: Fix transfer failures for invalid SMBus block reads
  i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads
  i2c: designware: Allow SMBus block reads up to 255 bytes in length
  HID: i2c-hid: Use block reads when possible to save power

 drivers/hid/i2c-hid/i2c-hid-core.c |  5 -
 drivers/i2c/busses/i2c-designware-master.c | 15 +--
 2 files changed, 13 insertions(+), 7 deletions(-)

-- 
2.28.0



[PATCH v2 4/4] HID: i2c-hid: Use block reads when possible to save power

2020-09-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

We have no way of knowing how large an incoming payload is going to be,
so the only strategy available up until now has been to always retrieve
the maximum possible report length over i2c, which can be quite
inefficient. For devices that send reports in block read format, the i2c
controller driver can read the payload length on the fly and terminate
the i2c transaction early, resulting in considerable power savings.

On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
touchpad causes psys power readings to go up by about 4W and hover there
until I remove my finger. With this patch, my psys readings go from 4.7W
down to 3.1W, yielding about 1.6W in savings. This is because my
touchpad's max report length is 60 bytes, but all of the regular reports
it sends for touch events are only 32 bytes, so the i2c transfer is
roughly halved for the common case.

Acked-by: Jiri Kosina 
Signed-off-by: Sultan Alsawaf 
---
 drivers/hid/i2c-hid/i2c-hid-core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c 
b/drivers/hid/i2c-hid/i2c-hid-core.c
index dbd04492825d..66950f472122 100644
--- a/drivers/hid/i2c-hid/i2c-hid-core.c
+++ b/drivers/hid/i2c-hid/i2c-hid-core.c
@@ -476,11 +476,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid)
int ret;
u32 ret_size;
int size = le16_to_cpu(ihid->hdesc.wMaxInputLength);
+   u16 flags;
 
if (size > ihid->bufsize)
size = ihid->bufsize;
 
-   ret = i2c_master_recv(ihid->client, ihid->inbuf, size);
+   /* Try to do a block read if the size fits in one byte */
+   flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN;
+   ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags);
if (ret != size) {
if (ret < 0)
return;
-- 
2.28.0



Re: [PATCH v3] i2c: Squash of SMBus block read patchset to save power

2020-09-15 Thread Sultan Alsawaf
On Tue, Sep 15, 2020 at 02:55:48PM +0300, Jarkko Nikula wrote:
> Hi
> 
> On 9/14/20 3:15 AM, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > This is a squash of the following:
> > 
> > i2c: designware: Fix transfer failures for invalid SMBus block reads
> > 
> > SMBus block reads can be broken because the read function will just skip
> > over bytes it doesn't like until reaching a byte that conforms to the
> > length restrictions for block reads. This is problematic when it isn't
> > known if the incoming payload is indeed a conforming block read.
> > 
> > According to the SMBus specification, block reads will only send the
> > payload length in the first byte, so we can fix this by only considering
> > the first byte in a sequence for block read length purposes.
> > 
> > In addition, when the length byte is invalid, the original transfer
> > length still needs to be adjusted to avoid a controller timeout.
> > 
> > Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block 
> > read and write")
> > Signed-off-by: Sultan Alsawaf 
> > 
> > i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads
> > 
> > The point of adding a byte to len in i2c_dw_recv_len() is to make sure
> > that tx_buf_len is nonzero, so that i2c_dw_xfer_msg() can let the i2c
> > controller know that the i2c transaction can end. Otherwise, the i2c
> > controller will think that the transaction can never end for block
> > reads, which results in the stop-detection bit never being set and thus
> > the transaction timing out.
> > 
> > Adding a byte to len is not a reliable way to do this though; sometimes
> > it lets tx_buf_len become zero, which results in the scenario described
> > above. Therefore, just directly ensure tx_buf_len cannot be zero to fix
> > the issue.
> > 
> > Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block 
> > read and write")
> > Signed-off-by: Sultan Alsawaf 
> > 
> > i2c: designware: Allow SMBus block reads up to 255 bytes in length
> > 
> > According to the SMBus 3.0 protocol specification, block transfer limits
> > were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte
> > limitation.
> > 
> > Signed-off-by: Sultan Alsawaf 
> > 
> > HID: i2c-hid: Use block reads when possible to save power
> > 
> > We have no way of knowing how large an incoming payload is going to be,
> > so the only strategy available up until now has been to always retrieve
> > the maximum possible report length over i2c, which can be quite
> > inefficient. For devices that send reports in block read format, the i2c
> > controller driver can read the payload length on the fly and terminate
> > the i2c transaction early, resulting in considerable power savings.
> > 
> > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
> > touchpad causes psys power readings to go up by about 4W and hover there
> > until I remove my finger. With this patch, my psys readings go from 4.7W
> > down to 3.1W, yielding about 1.6W in savings. This is because my
> > touchpad's max report length is 60 bytes, but all of the regular reports
> > it sends for touch events are only 32 bytes, so the i2c transfer is
> > roughly halved for the common case.
> > 
> > Signed-off-by: Sultan Alsawaf 
> > ---
> > Hi Jarkko,
> > 
> > Sorry for the delayed response. Life gets in the way of the things that 
> > really
> > matter, like kernel hacking ;)
> > 
> > I fixed the issue with the i2c block reads on 5.8. I've squashed all 4 of 
> > my i2c
> > commits into this email for simplicity; please apply this patch on either 
> > 5.8 or
> > 5.9 (it applies cleanly to both) and let me know if it works with your 
> > i2c-hid
> > touchscreen. If all is well, I will resubmit these patches individually in 
> > one
> > patchset, in a new thread.
> > 
> I tested this on top of fc4f28bb3daf ("Merge tag 'for-5.9-rc5-tag' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux") and seems to be
> working fine. What was the key change compared to previous version that was
> regressing for me?

This change fixed your issue (and my issue with 5.8):
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -395,8 +395,9 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
 * Adjust the buffer length and mask the flag
 * after receiving the first byte.
 */
-   len += (flags & I2C_CLIEN

Re: [PATCH 1/2] locking/mutex: Don't hog RCU read lock while optimistically spinning

2020-09-13 Thread Sultan Alsawaf
On Mon, Sep 07, 2020 at 05:20:31PM +0100, Will Deacon wrote:
> On Fri, Aug 07, 2020 at 12:16:35PM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > There's no reason to hold an RCU read lock the entire time while
> > optimistically spinning for a mutex lock. This can needlessly lengthen
> > RCU grace periods and slow down synchronize_rcu() when it doesn't brute
> > force the RCU grace period via rcupdate.rcu_expedited=1.
> 
> Would be good to demonstrate this with numbers if you can.

I could simulate the worst possible case, which would stall synchronize_rcu() by
one jiffy, which could be 10ms with CONFIG_HZ=100. The way that would happen is
when the mutex owner does a lot of non-sleeping work while the lock is held, and
while another CPU tries to acquire the lock.

This is a dummy example of the scenario I have in mind:
  CPU0 CPU1
--
mutex_lock(locky)
mdelay(100)  mutex_lock(locky)
mutex_unlock(locky)

In this case, CPU1 could spin in mutex_lock() for up to a jiffy (until CPU0
releases locky, which won't happen for 100ms, or until CPU1's task needs to
reschedule). While the spinning occurs, the RCU read lock will be held the whole
time, and then synchronize_rcu() will be stalled.

One could argue that most mutex-locked critical sections probably wouldn't spend
so long working on something without scheduling (at least, not intentionally),
but on slower SMP CPUs I suspect that this is common.

> > Signed-off-by: Sultan Alsawaf 
> > ---
> >  kernel/locking/mutex.c | 25 +
> >  1 file changed, 17 insertions(+), 8 deletions(-)
> > 
> > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> > index 5352ce50a97e..cc5676712458 100644
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -552,21 +552,31 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
> > task_struct *owner,
> >  {
> > bool ret = true;
> >  
> > -   rcu_read_lock();
> > -   while (__mutex_owner(lock) == owner) {
> > +   for (;;) {
> > +   unsigned int cpu;
> > +   bool same_owner;
> > +
> > /*
> > -* Ensure we emit the owner->on_cpu, dereference _after_
> > -* checking lock->owner still matches owner. If that fails,
> > +* Ensure lock->owner still matches owner. If that fails,
> >  * owner might point to freed memory. If it still matches,
> >  * the rcu_read_lock() ensures the memory stays valid.
> >  */
> > -   barrier();
> > +   rcu_read_lock();
> > +   same_owner = __mutex_owner(lock) == owner;
> > +   if (same_owner) {
> > +   ret = owner->on_cpu;
> > +   if (ret)
> > +   cpu = task_cpu(owner);
> > +   }
> > +   rcu_read_unlock();
> 
> Are you sure this doesn't break the ww mutex spinning? That thing also goes
> and looks at the owner, but now it's called outside of the read-side
> critical section.

Yes, it's safe because it's not dereferencing the owner pointer.

Sultan


[PATCH v3] i2c: Squash of SMBus block read patchset to save power

2020-09-13 Thread Sultan Alsawaf
From: Sultan Alsawaf 

This is a squash of the following:

i2c: designware: Fix transfer failures for invalid SMBus block reads

SMBus block reads can be broken because the read function will just skip
over bytes it doesn't like until reaching a byte that conforms to the
length restrictions for block reads. This is problematic when it isn't
known if the incoming payload is indeed a conforming block read.

According to the SMBus specification, block reads will only send the
payload length in the first byte, so we can fix this by only considering
the first byte in a sequence for block read length purposes.

In addition, when the length byte is invalid, the original transfer
length still needs to be adjusted to avoid a controller timeout.

Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read 
and write")
Signed-off-by: Sultan Alsawaf 

i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads

The point of adding a byte to len in i2c_dw_recv_len() is to make sure
that tx_buf_len is nonzero, so that i2c_dw_xfer_msg() can let the i2c
controller know that the i2c transaction can end. Otherwise, the i2c
controller will think that the transaction can never end for block
reads, which results in the stop-detection bit never being set and thus
the transaction timing out.

Adding a byte to len is not a reliable way to do this though; sometimes
it lets tx_buf_len become zero, which results in the scenario described
above. Therefore, just directly ensure tx_buf_len cannot be zero to fix
the issue.

Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read 
and write")
Signed-off-by: Sultan Alsawaf 

i2c: designware: Allow SMBus block reads up to 255 bytes in length

According to the SMBus 3.0 protocol specification, block transfer limits
were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte
limitation.

Signed-off-by: Sultan Alsawaf 

HID: i2c-hid: Use block reads when possible to save power

We have no way of knowing how large an incoming payload is going to be,
so the only strategy available up until now has been to always retrieve
the maximum possible report length over i2c, which can be quite
inefficient. For devices that send reports in block read format, the i2c
controller driver can read the payload length on the fly and terminate
the i2c transaction early, resulting in considerable power savings.

On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
touchpad causes psys power readings to go up by about 4W and hover there
until I remove my finger. With this patch, my psys readings go from 4.7W
down to 3.1W, yielding about 1.6W in savings. This is because my
touchpad's max report length is 60 bytes, but all of the regular reports
it sends for touch events are only 32 bytes, so the i2c transfer is
roughly halved for the common case.

Signed-off-by: Sultan Alsawaf 
---
Hi Jarkko,

Sorry for the delayed response. Life gets in the way of the things that really
matter, like kernel hacking ;)

I fixed the issue with the i2c block reads on 5.8. I've squashed all 4 of my i2c
commits into this email for simplicity; please apply this patch on either 5.8 or
5.9 (it applies cleanly to both) and let me know if it works with your i2c-hid
touchscreen. If all is well, I will resubmit these patches individually in one
patchset, in a new thread.

Thanks,
Sultan
 drivers/hid/i2c-hid/i2c-hid-core.c |  5 -
 drivers/i2c/busses/i2c-designware-master.c | 15 +--
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c 
b/drivers/hid/i2c-hid/i2c-hid-core.c
index dbd04492825d..66950f472122 100644
--- a/drivers/hid/i2c-hid/i2c-hid-core.c
+++ b/drivers/hid/i2c-hid/i2c-hid-core.c
@@ -476,11 +476,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid)
int ret;
u32 ret_size;
int size = le16_to_cpu(ihid->hdesc.wMaxInputLength);
+   u16 flags;
 
if (size > ihid->bufsize)
size = ihid->bufsize;
 
-   ret = i2c_master_recv(ihid->client, ihid->inbuf, size);
+   /* Try to do a block read if the size fits in one byte */
+   flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN;
+   ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags);
if (ret != size) {
if (ret < 0)
return;
diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index d6425ad6e6a3..5bd64bd17d94 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -395,8 +395,9 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
 * Adjust the buffer length and mask the flag
 * after receiving the first byte.
 */
-   len += (flags & I2C_CLIENT_PEC) ? 2 : 1;
-   dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstandi

Re: [PATCH] x86/msr: do not warn on writes to OC_MAILBOX

2020-09-08 Thread Sultan Alsawaf
On Tue, Sep 08, 2020 at 08:01:12PM +0200, Borislav Petkov wrote:
> On Tue, Sep 08, 2020 at 07:42:12PM +0200, Jason A. Donenfeld wrote:
> > Are you prepared to track down all the MSRs that might maybe do
> > something naughty?
> 
> I'm not prepared - that's why this MSR filtering. To block *all* direct
> MSR accesses from userspace in the future.
> 
> > Does `dd` warn when you run `dd if=/dev/zero of=/dev/sda`?
> 
> Yah, because that's the same as bricking your hardware. Geez.
> 
> > Probably not possible. Optimal values are related to the "silicon
> > lottery" that occurs when you buy a new CPU. Different optimal values
> > for different individual chips.
> 
> Let's wait for what Srinivas finds out. I'd let Intel decide what they
> wanna do.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

I'd like to point out that on Intel's recent 14nm parts, undervolting is not so
much for squeezing every last drop of performance out of the SoC as it is for
necessity.

I have an i9-9880H laptop which, at 4000 MHz on all CPUs while compiling a
kernel, uses 107W with the stock configuration. Upon undervolting, this figure
goes down to 88W, a ~22% reduction in power consumption at the maximum pstate.
The point of this is not primarily to reduce the power footprint of my machine,
but to get the thermals of this CPU under control. Without undervolting, my
i9-9880H is quite useless, and quickly reaches 100 degC with the fans at max
speed when placed under load.

The 22% reduction in power consumption also makes it clear that these parts use
very heavy-handed voltages out-of-the-box. I consider this to be a defect in the
part; I've never encountered any other CPUs that run, by default, at a voltage
so high that I could achieve a 22% reduction in power consumption by
undervolting.

I've tested multiple 14nm Intel CPUs, 8th gen and above, and they all exhibit
this behavior. On my i5-8265U laptop, compiling a kernel causes the CPU to hit
its 35W power limit and run at about 3300 MHz instead of its maximum all-core
frequency of 3700 MHz. After undervolting, when compiling a kernel, my i5-8265U
runs at 3700 MHz and uses 27W; not only was power consumption reduced by ~30%,
but also performance was improved. Not to mention, my i5-8265U laptop's thermal
design cannot dissipate 35W for very long, but it can sustain 27W without
reaching 100 degC quite well.

If Intel's recent 14nm CPUs didn't guzzle power, then we would have no need to
undervolt. But they do, and here we are.

Sultan


[PATCH 1/2] locking/mutex: Don't hog RCU read lock while optimistically spinning

2020-08-07 Thread Sultan Alsawaf
From: Sultan Alsawaf 

There's no reason to hold an RCU read lock the entire time while
optimistically spinning for a mutex lock. This can needlessly lengthen
RCU grace periods and slow down synchronize_rcu() when it doesn't brute
force the RCU grace period via rcupdate.rcu_expedited=1.

Signed-off-by: Sultan Alsawaf 
---
 kernel/locking/mutex.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5352ce50a97e..cc5676712458 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -552,21 +552,31 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner,
 {
bool ret = true;
 
-   rcu_read_lock();
-   while (__mutex_owner(lock) == owner) {
+   for (;;) {
+   unsigned int cpu;
+   bool same_owner;
+
/*
-* Ensure we emit the owner->on_cpu, dereference _after_
-* checking lock->owner still matches owner. If that fails,
+* Ensure lock->owner still matches owner. If that fails,
 * owner might point to freed memory. If it still matches,
 * the rcu_read_lock() ensures the memory stays valid.
 */
-   barrier();
+   rcu_read_lock();
+   same_owner = __mutex_owner(lock) == owner;
+   if (same_owner) {
+   ret = owner->on_cpu;
+   if (ret)
+   cpu = task_cpu(owner);
+   }
+   rcu_read_unlock();
+
+   if (!ret || !same_owner)
+   break;
 
/*
 * Use vcpu_is_preempted to detect lock holder preemption issue.
 */
-   if (!owner->on_cpu || need_resched() ||
-   vcpu_is_preempted(task_cpu(owner))) {
+   if (need_resched() || vcpu_is_preempted(cpu)) {
ret = false;
break;
}
@@ -578,7 +588,6 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner,
 
cpu_relax();
}
-   rcu_read_unlock();
 
return ret;
 }
-- 
2.28.0



[PATCH 2/2] locking/rwsem: Don't hog RCU read lock while optimistically spinning

2020-08-07 Thread Sultan Alsawaf
From: Sultan Alsawaf 

There's no reason to hold an RCU read lock the entire time while
optimistically spinning for a rwsem. This can needlessly lengthen RCU
grace periods and slow down synchronize_rcu() when it doesn't brute
force the RCU grace period via rcupdate.rcu_expedited=1.

Signed-off-by: Sultan Alsawaf 
---
 kernel/locking/rwsem.c | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f11b9bd3431d..a1e3ceb254d1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -723,8 +723,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned 
long nonspinnable)
if (state != OWNER_WRITER)
return state;
 
-   rcu_read_lock();
for (;;) {
+   bool same_owner;
+
+   rcu_read_lock();
/*
 * When a waiting writer set the handoff flag, it may spin
 * on the owner as well. Once that writer acquires the lock,
@@ -732,27 +734,32 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned 
long nonspinnable)
 * handoff bit is set.
 */
new = rwsem_owner_flags(sem, _flags);
-   if ((new != owner) || (new_flags != flags)) {
-   state = rwsem_owner_state(new, new_flags, nonspinnable);
-   break;
-   }
 
/*
-* Ensure we emit the owner->on_cpu, dereference _after_
-* checking sem->owner still matches owner, if that fails,
+* Ensure sem->owner still matches owner. If that fails,
 * owner might point to free()d memory, if it still matches,
 * the rcu_read_lock() ensures the memory stays valid.
 */
-   barrier();
+   same_owner = new == owner && new_flags == flags;
+   if (same_owner && !owner_on_cpu(owner))
+   state = OWNER_NONSPINNABLE;
+   rcu_read_unlock();
 
-   if (need_resched() || !owner_on_cpu(owner)) {
+   if (!same_owner) {
+   state = rwsem_owner_state(new, new_flags, nonspinnable);
+   break;
+   }
+
+   if (state == OWNER_NONSPINNABLE)
+   break;
+
+   if (need_resched()) {
state = OWNER_NONSPINNABLE;
break;
}
 
cpu_relax();
}
-   rcu_read_unlock();
 
return state;
 }
-- 
2.28.0



Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power

2020-07-01 Thread Sultan Alsawaf
On Wed, Jul 01, 2020 at 11:04:01AM +0300, Jarkko Nikula wrote:
> On 6/29/20 8:43 PM, Sultan Alsawaf wrote:
> > Hmm, for some reason in 5.8 I get the same problem, but 5.7 is fine. Could 
> > you
> > try this on 5.7 and see if it works?
> > 
> > In the meantime I'll bisect 5.8 to see why it's causing problems for me...
> > 
> I see the same issue on top of v5.7:

Try reverting my "i2c: designware: Only check the first byte for SMBus block
read length" patch and apply the following change instead:

--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -394,10 +394,12 @@ i2c_dw_read(struct dw_i2c_dev *dev)
u32 flags = msgs[dev->msg_read_idx].flags;
 
*buf = dw_readl(dev, DW_IC_DATA_CMD);
-   /* Ensure length byte is a valid value */
-   if (flags & I2C_M_RECV_LEN &&
-   *buf <= I2C_SMBUS_BLOCK_MAX && *buf > 0) {
-   len = i2c_dw_recv_len(dev, *buf);
+   if (flags & I2C_M_RECV_LEN) {
+   /* Ensure length byte is a valid value */
+   if (*buf <= I2C_SMBUS_BLOCK_MAX && *buf > 0)
+   len = i2c_dw_recv_len(dev, *buf);
+   else
+   len = i2c_dw_recv_len(dev, len);
}
buf++;
dev->rx_outstanding--;


Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power

2020-06-29 Thread Sultan Alsawaf
On Wed, Jun 17, 2020 at 02:17:19PM +0300, Jarkko Nikula wrote:
> On 6/16/20 6:49 PM, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > We have no way of knowing how large an incoming payload is going to be,
> > so the only strategy available up until now has been to always retrieve
> > the maximum possible report length over i2c, which can be quite
> > inefficient. For devices that send reports in block read format, the i2c
> > controller driver can read the payload length on the fly and terminate
> > the i2c transaction early, resulting in considerable power savings.
> > 
> > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
> > touchpad causes psys power readings to go up by about 4W and hover there
> > until I remove my finger. With this patch, my psys readings go from 4.7W
> > down to 3.1W, yielding about 1.6W in savings. This is because my
> > touchpad's max report length is 60 bytes, but all of the regular reports
> > it sends for touch events are only 32 bytes, so the i2c transfer is
> > roughly halved for the common case.
> > 
> > Signed-off-by: Sultan Alsawaf 
> > ---
> > Jarkko, could you try this?
> >   drivers/hid/i2c-hid/i2c-hid-core.c | 5 -
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c 
> > b/drivers/hid/i2c-hid/i2c-hid-core.c
> > index 294c84e136d7..739dccfc57e1 100644
> > --- a/drivers/hid/i2c-hid/i2c-hid-core.c
> > +++ b/drivers/hid/i2c-hid/i2c-hid-core.c
> > @@ -472,11 +472,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid)
> > int ret;
> > u32 ret_size;
> > int size = le16_to_cpu(ihid->hdesc.wMaxInputLength);
> > +   u16 flags;
> > if (size > ihid->bufsize)
> > size = ihid->bufsize;
> > -   ret = i2c_master_recv(ihid->client, ihid->inbuf, size);
> > +   /* Try to do a block read if the size fits in one byte */
> > +   flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN;
> > +   ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags);
> > if (ret != size) {
> > if (ret < 0)
> > return;
> 
> This still causes a regression for me.

Hmm, for some reason in 5.8 I get the same problem, but 5.7 is fine. Could you
try this on 5.7 and see if it works?

In the meantime I'll bisect 5.8 to see why it's causing problems for me...

Sultan


Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power

2020-06-16 Thread Sultan Alsawaf
On Tue, Jun 16, 2020 at 09:02:54PM +0300, Andi Shyti wrote:
> Hi Sultan,
> 
> > > > > so the only strategy available up until now has been to always 
> > > > > retrieve
> > > > > the maximum possible report length over i2c, which can be quite
> > > > > inefficient. For devices that send reports in block read format, the 
> > > > > i2c
> > > > > controller driver can read the payload length on the fly and terminate
> > > > > the i2c transaction early, resulting in considerable power savings.
> > > > > 
> > > > > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
> > > > > touchpad causes psys power readings to go up by about 4W and hover 
> > > > > there
> > > > > until I remove my finger. With this patch, my psys readings go from 
> > > > > 4.7W
> > > > > down to 3.1W, yielding about 1.6W in savings. This is because my
> > > > > touchpad's max report length is 60 bytes, but all of the regular 
> > > > > reports
> > > > > it sends for touch events are only 32 bytes, so the i2c transfer is
> > > > > roughly halved for the common case.
> > > > 
> > > > > + /* Try to do a block read if the size fits in one byte */
> > > > > + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN;
> > > > 
> > > > AFAIR SMBus specification tells about 256. Why 255?
> > > > 
> > > > Andi, am I correct?
> > > 
> > > Actually the SMBUS 3.0 protocol from 2015[*] says 255:
> > > 
> > > "
> > > D.6 255 Bytes in Process Call
> > > 
> > > The maximum number of bytes allowed in the Block Write-Block Read
> > > Process Call (Section 6.5.8) was increased from 32 to 255.
> > > "
> > > 
> > > But why does it matter... I see the patch is detatching itself
> > > from smbus.
> > > 
> > > And, actually, I wonder if this is the right way to fix it, isn't
> > > it better to fix smbus instead?
> > 
> > I think the best solution would be to modify the i2c api to allow passing 
> > in a
> > function pointer and a payload size length, to specify how to interpret the 
> > size
> > of the incoming payload, so the adapter could handle both the HID over i2c
> > transfer spec and SMBus block reads without needing to read more bytes than
> > needed.
> 
> Can't you do that by specifying the xfer function?
> 
> When you use smbus_read/write in block or byte or whatever, smbus
> always checks if there is an xfer function specified and uses
> that.
> 
> If it's not specified it uses the default smbus functions with
> the limitations that come with it.

The xfer functions are specified on a per-adapter basis. In the case of i2c-hid,
we need to tell the adapter to interpret the payload size in a specific way,
which I *think* is only specific to HID over i2c (i.e., using 16 bits to store
the length and then checking it for device quirks).

> > For example, for an SMBus block read, the payload size is specified in the 
> > first
> > byte and it is limited to 32 bytes. However, for HID over i2c, the payload 
> > size
> > is specified in the first two bytes, and there are also some device quirks
> > involved to reinterpret the reported size.
> 
> which is wrong. The 32 bytes limitation is outdated: in the link
> that I gave before (i.e.  this one [*]), the new SMBUS specifies
> 255 maximum for read/write block.

Oops. But still, for SMBus block reads, the size is limited to 8 bits. For HID
over i2c, it can be 16 bits. I don't see how we can handle this without some api
cooperation to tell the adapter what the caller is expecting to see.

> > A nice solution would be to pass in how many bytes the i2c payload size can
> > contain, as well as a function pointer to evaluate the reported payload 
> > size in
> > a way that the caller wants. This would require modifying every i2c adapter
> > driver to add this functionality, but it would fix the efficiency problem 
> > faced
> > by i2c-hid and perhaps others.
> > 
> > > I have a patch ready that fixes the smbus transfer size, perhaps
> > > I should rebase, test and send it.
> > 
> > For the i2c-hid driver?
> 
> No, sorry, for smbus.
> 
> Now... here you are replacing "i2c_master_recv" with
> "i2c_transfer_buffer_flags". I do not really like this change,
> although I understand it's necessary, because we are bypassing
> the real issue that is that the smbus implementation is outdated.
> 
> I have a patch for that that for a matter of time I never sent.

Can it handle block reads (8 bit size) and HID over i2c (16 bit size)?

Sultan


Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power

2020-06-16 Thread Sultan Alsawaf
On Tue, Jun 16, 2020 at 08:18:54PM +0300, Andi Shyti wrote:
> Hi Andy,
> 
> > > so the only strategy available up until now has been to always retrieve
> > > the maximum possible report length over i2c, which can be quite
> > > inefficient. For devices that send reports in block read format, the i2c
> > > controller driver can read the payload length on the fly and terminate
> > > the i2c transaction early, resulting in considerable power savings.
> > > 
> > > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
> > > touchpad causes psys power readings to go up by about 4W and hover there
> > > until I remove my finger. With this patch, my psys readings go from 4.7W
> > > down to 3.1W, yielding about 1.6W in savings. This is because my
> > > touchpad's max report length is 60 bytes, but all of the regular reports
> > > it sends for touch events are only 32 bytes, so the i2c transfer is
> > > roughly halved for the common case.
> > 
> > > + /* Try to do a block read if the size fits in one byte */
> > > + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN;
> > 
> > AFAIR SMBus specification tells about 256. Why 255?
> > 
> > Andi, am I correct?
> 
> Actually the SMBUS 3.0 protocol from 2015[*] says 255:
> 
> "
> D.6 255 Bytes in Process Call
> 
> The maximum number of bytes allowed in the Block Write-Block Read
> Process Call (Section 6.5.8) was increased from 32 to 255.
> "
> 
> But why does it matter... I see the patch is detatching itself
> from smbus.
> 
> And, actually, I wonder if this is the right way to fix it, isn't
> it better to fix smbus instead?

I think the best solution would be to modify the i2c api to allow passing in a
function pointer and a payload size length, to specify how to interpret the size
of the incoming payload, so the adapter could handle both the HID over i2c
transfer spec and SMBus block reads without needing to read more bytes than
needed.

For example, for an SMBus block read, the payload size is specified in the first
byte and it is limited to 32 bytes. However, for HID over i2c, the payload size
is specified in the first two bytes, and there are also some device quirks
involved to reinterpret the reported size.

A nice solution would be to pass in how many bytes the i2c payload size can
contain, as well as a function pointer to evaluate the reported payload size in
a way that the caller wants. This would require modifying every i2c adapter
driver to add this functionality, but it would fix the efficiency problem faced
by i2c-hid and perhaps others.

> I have a patch ready that fixes the smbus transfer size, perhaps
> I should rebase, test and send it.

For the i2c-hid driver?

Sultan


[PATCH v2] HID: i2c-hid: Use block reads when possible to save power

2020-06-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

We have no way of knowing how large an incoming payload is going to be,
so the only strategy available up until now has been to always retrieve
the maximum possible report length over i2c, which can be quite
inefficient. For devices that send reports in block read format, the i2c
controller driver can read the payload length on the fly and terminate
the i2c transaction early, resulting in considerable power savings.

On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
touchpad causes psys power readings to go up by about 4W and hover there
until I remove my finger. With this patch, my psys readings go from 4.7W
down to 3.1W, yielding about 1.6W in savings. This is because my
touchpad's max report length is 60 bytes, but all of the regular reports
it sends for touch events are only 32 bytes, so the i2c transfer is
roughly halved for the common case.

Signed-off-by: Sultan Alsawaf 
---
Jarkko, could you try this?
 drivers/hid/i2c-hid/i2c-hid-core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c 
b/drivers/hid/i2c-hid/i2c-hid-core.c
index 294c84e136d7..739dccfc57e1 100644
--- a/drivers/hid/i2c-hid/i2c-hid-core.c
+++ b/drivers/hid/i2c-hid/i2c-hid-core.c
@@ -472,11 +472,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid)
int ret;
u32 ret_size;
int size = le16_to_cpu(ihid->hdesc.wMaxInputLength);
+   u16 flags;
 
if (size > ihid->bufsize)
size = ihid->bufsize;
 
-   ret = i2c_master_recv(ihid->client, ihid->inbuf, size);
+   /* Try to do a block read if the size fits in one byte */
+   flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN;
+   ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags);
if (ret != size) {
if (ret < 0)
return;
-- 
2.27.0



[PATCH v2] i2c: designware: Only check the first byte for SMBus block read length

2020-06-16 Thread Sultan Alsawaf
From: Sultan Alsawaf 

SMBus block reads can be broken because the read function will just skip
over bytes it doesn't like until reaching a byte that conforms to the
length restrictions for block reads. This is problematic when it isn't
known if the incoming payload is indeed a conforming block read.

According to the SMBus specification, block reads will only send the
payload length in the first byte, so we can fix this by only considering
the first byte in a sequence for block read length purposes.

Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read 
and write")
Signed-off-by: Sultan Alsawaf 
---
 drivers/i2c/busses/i2c-designware-master.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index d6425ad6e6a3..d22271438869 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -391,14 +391,10 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
struct i2c_msg *msgs = dev->msgs;
u32 flags = msgs[dev->msg_read_idx].flags;
 
-   /*
-* Adjust the buffer length and mask the flag
-* after receiving the first byte.
-*/
+   /* Adjust the buffer length */
len += (flags & I2C_CLIENT_PEC) ? 2 : 1;
dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstanding);
msgs[dev->msg_read_idx].len = len;
-   msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN;
 
return len;
 }
@@ -430,10 +426,12 @@ i2c_dw_read(struct dw_i2c_dev *dev)
u32 flags = msgs[dev->msg_read_idx].flags;
 
regmap_read(dev->map, DW_IC_DATA_CMD, );
-   /* Ensure length byte is a valid value */
-   if (flags & I2C_M_RECV_LEN &&
-   tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) {
-   len = i2c_dw_recv_len(dev, tmp);
+   if (flags & I2C_M_RECV_LEN) {
+   /* Ensure length byte is a valid value */
+   if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0)
+   len = i2c_dw_recv_len(dev, tmp);
+   /* Mask the flag after receiving the first byte 
*/
+   msgs[dev->msg_read_idx].flags &= 
~I2C_M_RECV_LEN;
}
*buf++ = tmp;
dev->rx_outstanding--;
-- 
2.27.0



Re: [PATCH 1/2] i2c: designware: Only check the first byte for SMBus block read length

2020-06-15 Thread Sultan Alsawaf
On Mon, Jun 15, 2020 at 07:07:42PM +0300, Andy Shevchenko wrote:
> On Mon, Jun 15, 2020 at 7:06 PM Sultan Alsawaf  wrote:
> > On Mon, Jun 15, 2020 at 12:40:19PM +0300, Andy Shevchenko wrote:
> > > On Sun, Jun 14, 2020 at 02:02:54PM -0700, Sultan Alsawaf wrote:
> > > > From: Sultan Alsawaf 
> > > >
> > > > SMBus block reads can be broken because the read function will just skip
> > > > over bytes it doesn't like until reaching a byte that conforms to the
> > > > length restrictions for block reads. This is problematic when it isn't
> > > > known if the incoming payload is indeed a conforming block read.
> > > >
> > > > According to the SMBus specification, block reads will only send the
> > > > payload length in the first byte, so we can fix this by only considering
> > > > the first byte in a sequence for block read length purposes.
> > >
> > > I'm wondering if this overlaps with [1]. AFAIU that one is also makes 
> > > sure that
> > > the length is not a garbage.
> > >
> > > [1]: 
> > > https://lore.kernel.org/linux-i2c/20200613104109.2989-1-m...@mansr.com/T/#u
> >
> > No overlap.
> 
> Thanks for clarifying.
> 
> > That looks like a similar bug for a different driver. In my case,
> > the adapter provides native SMBus support, so emulation is never used. This 
> > is
> > clear to see by looking at i2c_transfer_buffer_flags(), which only uses the
> > master_xfer functions provided by the adapter; it doesn't call the emulation
> > path at all.
> 
> But do we get an advantage if this can be done in the i2c core instead
> (once for all)?

We can't, because the adapter driver needs to know mid-transfer to look for the
payload length in the first byte, and then alter the transfer size on-the-fly.
That can't be done in the i2c core, sadly. The problem is that we don't know if
a transfer is going to be a block read or not beforehand. And altering the
transfer size mid-transfer is definitely a controller specific task.

Sultan


Re: [PATCH 1/2] i2c: designware: Only check the first byte for SMBus block read length

2020-06-15 Thread Sultan Alsawaf
On Mon, Jun 15, 2020 at 12:40:19PM +0300, Andy Shevchenko wrote:
> On Sun, Jun 14, 2020 at 02:02:54PM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > SMBus block reads can be broken because the read function will just skip
> > over bytes it doesn't like until reaching a byte that conforms to the
> > length restrictions for block reads. This is problematic when it isn't
> > known if the incoming payload is indeed a conforming block read.
> > 
> > According to the SMBus specification, block reads will only send the
> > payload length in the first byte, so we can fix this by only considering
> > the first byte in a sequence for block read length purposes.
> 
> I'm wondering if this overlaps with [1]. AFAIU that one is also makes sure 
> that
> the length is not a garbage.
> 
> [1]: 
> https://lore.kernel.org/linux-i2c/20200613104109.2989-1-m...@mansr.com/T/#u

No overlap. That looks like a similar bug for a different driver. In my case,
the adapter provides native SMBus support, so emulation is never used. This is
clear to see by looking at i2c_transfer_buffer_flags(), which only uses the
master_xfer functions provided by the adapter; it doesn't call the emulation
path at all.

Sultan


[PATCH 0/2] i2c-hid: Save power by reducing i2c xfers with block reads

2020-06-14 Thread Sultan Alsawaf
From: Sultan Alsawaf 

Hi,

I noticed on my Dell Precision 15 5540 with an i9-9880H that simply putting my
finger on the touchpad would increase my system's power consumption by 4W, which
is quite considerable. Resting my finger on the touchpad would generate roughly
4000 i2c irqs per second, or roughly 20 i2c irqs per touchpad irq.

Upon closer inspection, I noticed that the i2c-hid driver would always transfer
the maximum report size over i2c (which is 60 bytes for my touchpad), but all of
my touchpad's normal touch events are only 32 bytes long according to the length
byte contained in the buffer sequence.

Therefore, I was able to save about 2W of power by passing the I2C_M_RECV_LEN
flag in i2c-hid, which says to look for the payload length in the first byte of
the transfer buffer and adjust the i2c transaction accordingly. The only problem
though is that my i2c controller's driver allows bytes other than the first one
to be used to retrieve the payload length, which is incorrect according to the
SMBus spec, and would break my i2c-hid change since not *all* of the reports
from my touchpad are conforming SMBus block reads.

This patchset fixes the I2C_M_RECV_LEN behavior in the designware i2c driver and
modifies i2c-hid to use I2C_M_RECV_LEN to save quite a bit of power. Even if the
peripheral controlled by i2c-hid doesn't support block reads, the i2c controller
drivers should cope with this and proceed with the i2c transfer using the
original requested length.

Sultan

Sultan Alsawaf (2):
  i2c: designware: Only check the first byte for SMBus block read length
  HID: i2c-hid: Use block reads when possible to save power

 drivers/hid/i2c-hid/i2c-hid-core.c |  3 ++-
 drivers/i2c/busses/i2c-designware-master.c | 10 +-
 2 files changed, 7 insertions(+), 6 deletions(-)

-- 
2.27.0



[PATCH 2/2] HID: i2c-hid: Use block reads when possible to save power

2020-06-14 Thread Sultan Alsawaf
From: Sultan Alsawaf 

We have no way of knowing how large an incoming payload is going to be,
so the only strategy available up until now has been to always retrieve
the maximum possible report length over i2c, which can be quite
inefficient. For devices that send reports in block read format, the i2c
controller driver can read the payload length on the fly and terminate
the i2c transaction early, resulting in considerable power savings.

On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the
touchpad causes psys power readings to go up by about 4W and hover there
until I remove my finger. With this patch, my psys readings go from 4.7W
down to 3.1W, yielding about 1.6W in savings. This is because my
touchpad's max report length is 60 bytes, but all of the regular reports
it sends for touch events are only 32 bytes, so the i2c transfer is
roughly halved for the common case.

Signed-off-by: Sultan Alsawaf 
---
 drivers/hid/i2c-hid/i2c-hid-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c 
b/drivers/hid/i2c-hid/i2c-hid-core.c
index 294c84e136d7..4b507de48d70 100644
--- a/drivers/hid/i2c-hid/i2c-hid-core.c
+++ b/drivers/hid/i2c-hid/i2c-hid-core.c
@@ -476,7 +476,8 @@ static void i2c_hid_get_input(struct i2c_hid *ihid)
if (size > ihid->bufsize)
size = ihid->bufsize;
 
-   ret = i2c_master_recv(ihid->client, ihid->inbuf, size);
+   ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size,
+   I2C_M_RD | I2C_M_RECV_LEN);
if (ret != size) {
if (ret < 0)
return;
-- 
2.27.0



[PATCH 1/2] i2c: designware: Only check the first byte for SMBus block read length

2020-06-14 Thread Sultan Alsawaf
From: Sultan Alsawaf 

SMBus block reads can be broken because the read function will just skip
over bytes it doesn't like until reaching a byte that conforms to the
length restrictions for block reads. This is problematic when it isn't
known if the incoming payload is indeed a conforming block read.

According to the SMBus specification, block reads will only send the
payload length in the first byte, so we can fix this by only considering
the first byte in a sequence for block read length purposes.

Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read 
and write")
Signed-off-by: Sultan Alsawaf 
---
 drivers/i2c/busses/i2c-designware-master.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index d6425ad6e6a3..16d38b8fc19a 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -398,7 +398,6 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len)
len += (flags & I2C_CLIENT_PEC) ? 2 : 1;
dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstanding);
msgs[dev->msg_read_idx].len = len;
-   msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN;
 
return len;
 }
@@ -430,10 +429,11 @@ i2c_dw_read(struct dw_i2c_dev *dev)
u32 flags = msgs[dev->msg_read_idx].flags;
 
regmap_read(dev->map, DW_IC_DATA_CMD, );
-   /* Ensure length byte is a valid value */
-   if (flags & I2C_M_RECV_LEN &&
-   tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) {
-   len = i2c_dw_recv_len(dev, tmp);
+   if (flags & I2C_M_RECV_LEN) {
+   /* Ensure length byte is a valid value */
+   if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0)
+   len = i2c_dw_recv_len(dev, tmp);
+   msgs[dev->msg_read_idx].flags &= 
~I2C_M_RECV_LEN;
}
*buf++ = tmp;
dev->rx_outstanding--;
-- 
2.27.0



[PATCH v2] locking/Kconfig: Don't forcefully uninline spin_unlock for PREEMPT

2020-05-29 Thread Sultan Alsawaf
From: Sultan Alsawaf 

This change was originally done in 2005 without any justification in
commit bda98685b855 ("[PATCH] x86: inline spin_unlock if
!CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT"). Perhaps the reasoning at
the time was that PREEMPT was still considered unstable and needed extra
debugging; however, this is no longer the case, so remove the artificial
limitation.

Signed-off-by: Sultan Alsawaf 
---
 kernel/Kconfig.locks   | 10 +-
 kernel/Kconfig.preempt |  1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 3de8fd11873b..2d992f5c6ec3 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -139,7 +139,7 @@ config INLINE_SPIN_UNLOCK_BH
 
 config INLINE_SPIN_UNLOCK_IRQ
def_bool y
-   depends on !PREEMPTION || ARCH_INLINE_SPIN_UNLOCK_IRQ
+   depends on ARCH_INLINE_SPIN_UNLOCK_IRQ
 
 config INLINE_SPIN_UNLOCK_IRQRESTORE
def_bool y
@@ -168,7 +168,7 @@ config INLINE_READ_LOCK_IRQSAVE
 
 config INLINE_READ_UNLOCK
def_bool y
-   depends on !PREEMPTION || ARCH_INLINE_READ_UNLOCK
+   depends on ARCH_INLINE_READ_UNLOCK
 
 config INLINE_READ_UNLOCK_BH
def_bool y
@@ -176,7 +176,7 @@ config INLINE_READ_UNLOCK_BH
 
 config INLINE_READ_UNLOCK_IRQ
def_bool y
-   depends on !PREEMPTION || ARCH_INLINE_READ_UNLOCK_IRQ
+   depends on ARCH_INLINE_READ_UNLOCK_IRQ
 
 config INLINE_READ_UNLOCK_IRQRESTORE
def_bool y
@@ -205,7 +205,7 @@ config INLINE_WRITE_LOCK_IRQSAVE
 
 config INLINE_WRITE_UNLOCK
def_bool y
-   depends on !PREEMPTION || ARCH_INLINE_WRITE_UNLOCK
+   depends on ARCH_INLINE_WRITE_UNLOCK
 
 config INLINE_WRITE_UNLOCK_BH
def_bool y
@@ -213,7 +213,7 @@ config INLINE_WRITE_UNLOCK_BH
 
 config INLINE_WRITE_UNLOCK_IRQ
def_bool y
-   depends on !PREEMPTION || ARCH_INLINE_WRITE_UNLOCK_IRQ
+   depends on ARCH_INLINE_WRITE_UNLOCK_IRQ
 
 config INLINE_WRITE_UNLOCK_IRQRESTORE
def_bool y
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..5a9e0409c844 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -39,7 +39,6 @@ config PREEMPT
bool "Preemptible Kernel (Low-Latency Desktop)"
depends on !ARCH_NO_PREEMPT
select PREEMPTION
-   select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
help
  This option reduces the latency of the kernel by making
  all kernel code (that is not executing in a critical section)
-- 
2.26.2



[PATCH] locking/Kconfig: Don't forcefully uninline spin_unlock for PREEMPT

2020-05-29 Thread Sultan Alsawaf
From: Sultan Alsawaf 

This change was originally done in 2005 without any justification in
commit bda98685b855 ("[PATCH] x86: inline spin_unlock if
!CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT"). Perhaps the reasoning at
the time was that PREEMPT was still considered unstable and needed extra
debugging; however, this is no longer the case, so remove the artificial
limitation.

Signed-off-by: Sultan Alsawaf 
---
 kernel/Kconfig.preempt | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..5a9e0409c844 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -39,7 +39,6 @@ config PREEMPT
bool "Preemptible Kernel (Low-Latency Desktop)"
depends on !ARCH_NO_PREEMPT
select PREEMPTION
-   select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
help
  This option reduces the latency of the kernel by making
  all kernel code (that is not executing in a critical section)
-- 
2.26.2



[PATCH] drm/i915: Don't enable WaIncreaseLatencyIPCEnabled when IPC is disabled

2020-04-30 Thread Sultan Alsawaf
From: Sultan Alsawaf 

In commit 5a7d202b1574, a logical AND was erroneously changed to an OR,
causing WaIncreaseLatencyIPCEnabled to be enabled unconditionally for
kabylake and coffeelake, even when IPC is disabled. Fix the logic so
that WaIncreaseLatencyIPCEnabled is only used when IPC is enabled.

Fixes: 5a7d202b1574 ("drm/i915: Drop WaIncreaseLatencyIPCEnabled/1140 for cnl")
Cc: sta...@vger.kernel.org # 5.3.x+
Signed-off-by: Sultan Alsawaf 
---
 drivers/gpu/drm/i915/intel_pm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 8375054ba27d..a52986a9e7a6 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4992,7 +4992,7 @@ static void skl_compute_plane_wm(const struct 
intel_crtc_state *crtc_state,
 * WaIncreaseLatencyIPCEnabled: kbl,cfl
 * Display WA #1141: kbl,cfl
 */
-   if ((IS_KABYLAKE(dev_priv) || IS_COFFEELAKE(dev_priv)) ||
+   if ((IS_KABYLAKE(dev_priv) || IS_COFFEELAKE(dev_priv)) &&
dev_priv->ipc_enabled)
latency += 4;
 
-- 
2.26.2



Re: [PATCH] rt-tests: backfire: Don't include asm/uaccess.h directly

2019-09-17 Thread Sultan Alsawaf
On Mon, Sep 16, 2019 at 11:57:32PM +0200, John Kacur wrote:
> Signed-off-by: John Kacur 
> But please in the future
> 1. Don't cc lkml on this
> 2. Include the maintainers in your patch

Hi,

Thanks for the sign-off. I was following the instructions listed here:
https://wiki.linuxfoundation.org/realtime/communication/send_rt_patches

I couldn't find any documentation of how to send patches for rt-tests. Is there
a different set of patch submission instructions on a wiki somewhere I missed?

Thanks,
Sultan


[PATCH] rt-tests: backfire: Don't include asm/uaccess.h directly

2019-09-03 Thread Sultan Alsawaf
From: Sultan Alsawaf 

Architecture-specific uaccess.h headers can have dependencies on
linux/uaccess.h (i.e., VERIFY_WRITE), so it cannot be included directly.
Since linux/uaccess.h includes asm/uaccess.h, just do that instead.

This fixes compile errors with certain kernels and architectures.

Signed-off-by: Sultan Alsawaf 
---
 src/backfire/backfire.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backfire/backfire.c b/src/backfire/backfire.c
index aaf9c4a..a8ac9f5 100644
--- a/src/backfire/backfire.c
+++ b/src/backfire/backfire.c
@@ -30,8 +30,8 @@
 #include 
 #include 
 #include 
+#include 
 
-#include 
 #include 
 
 #define BACKFIRE_MINOR MISC_DYNAMIC_MINOR
-- 
2.23.0



Re: [PATCH] mbcache: Speed up cache entry creation

2019-07-23 Thread Sultan Alsawaf
On Tue, Jul 23, 2019 at 10:56:05AM -0600, Andreas Dilger wrote:
> Do you have any kind of performance metrics that show this is an actual
> improvement in performance?  This would be either macro-level benchmarks
> (e.g. fio, but this seems unlikely to show any benefit), or micro-level
> measurements (e.g. flame graph) that show a net reduction in CPU cycles,
> lock contention, etc. in this part of the code.

Hi Andreas,

Here are some basic micro-benchmark results:

Before:
[3.162896] mb_cache_entry_create: AVG cycles: 75
[3.054701] mb_cache_entry_create: AVG cycles: 78
[3.152321] mb_cache_entry_create: AVG cycles: 77

After:
[3.043380] mb_cache_entry_create: AVG cycles: 68
[3.194321] mb_cache_entry_create: AVG cycles: 71
[3.038100] mb_cache_entry_create: AVG cycles: 69

The performance difference is probably more drastic when free memory is low,
since an unnecessary call to kmem_cache_alloc() can result in a long wait for
pages to be freed.

The micro-benchmark code is attached.

Thanks,
Sultan
---
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 289f3664061e..e0f22ff8fab8 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -82,7 +82,7 @@ static inline struct mb_bucket *mb_cache_entry_bucket(struct 
mb_cache *cache,
  * -EBUSY if entry with the same key and value already exists in cache.
  * Otherwise 0 is returned.
  */
-int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
+static int __mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
  u64 value, bool reusable)
 {
struct mb_cache_entry *entry, *dup;
@@ -148,6 +148,29 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t 
mask, u32 key,
 
return 0;
 }
+
+int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
+ u64 value, bool reusable)
+{
+   static unsigned long count, sum;
+   static DEFINE_MUTEX(lock);
+   volatile cycles_t start, delta;
+   int ret;
+
+   mutex_lock();
+   local_irq_disable();
+   start = get_cycles();
+   ret = __mb_cache_entry_create(cache, mask, key, value, reusable);
+   delta = get_cycles() - start;
+   local_irq_enable();
+
+   sum += delta;
+   if (++count == 1000)
+   printk("%s: AVG cycles: %lu\n", __func__, sum / count);
+   mutex_unlock();
+
+   return ret;
+}
 EXPORT_SYMBOL(mb_cache_entry_create);
 
 void __mb_cache_entry_free(struct mb_cache_entry *entry)


[PATCH] mbcache: Speed up cache entry creation

2019-07-22 Thread Sultan Alsawaf
From: Sultan Alsawaf 

In order to prevent redundant entry creation by racing against itself,
mb_cache_entry_create scans through a hash-list of all current entries
in order to see if another allocation for the requested new entry has
been made. Furthermore, it allocates memory for a new entry before
scanning through this hash-list, which results in that allocated memory
being discarded when the requested new entry is already present.

Speed up cache entry creation by keeping a small linked list of
requested new entries in progress, and scanning through that first
instead of the large hash-list. And don't allocate memory for a new
entry until it's known that the allocated memory will be used.

Signed-off-by: Sultan Alsawaf 
---
 fs/mbcache.c | 82 
 1 file changed, 57 insertions(+), 25 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 97c54d3a2227..289f3664061e 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -25,9 +25,14 @@
  * size hash table is used for fast key lookups.
  */
 
+struct mb_bucket {
+   struct hlist_bl_head hash;
+   struct list_head req_list;
+};
+
 struct mb_cache {
/* Hash table of entries */
-   struct hlist_bl_head*c_hash;
+   struct mb_bucket*c_bucket;
/* log2 of hash table size */
int c_bucket_bits;
/* Maximum entries in cache to avoid degrading hash too much */
@@ -42,15 +47,21 @@ struct mb_cache {
struct work_struct  c_shrink_work;
 };
 
+struct mb_cache_req {
+   struct list_head lnode;
+   u32 key;
+   u64 value;
+};
+
 static struct kmem_cache *mb_entry_cache;
 
 static unsigned long mb_cache_shrink(struct mb_cache *cache,
 unsigned long nr_to_scan);
 
-static inline struct hlist_bl_head *mb_cache_entry_head(struct mb_cache *cache,
-   u32 key)
+static inline struct mb_bucket *mb_cache_entry_bucket(struct mb_cache *cache,
+ u32 key)
 {
-   return >c_hash[hash_32(key, cache->c_bucket_bits)];
+   return >c_bucket[hash_32(key, cache->c_bucket_bits)];
 }
 
 /*
@@ -77,6 +88,8 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, 
u32 key,
struct mb_cache_entry *entry, *dup;
struct hlist_bl_node *dup_node;
struct hlist_bl_head *head;
+   struct mb_cache_req *tmp_req, req;
+   struct mb_bucket *bucket;
 
/* Schedule background reclaim if there are too many entries */
if (cache->c_entry_count >= cache->c_max_entries)
@@ -85,9 +98,33 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t 
mask, u32 key,
if (cache->c_entry_count >= 2*cache->c_max_entries)
mb_cache_shrink(cache, SYNC_SHRINK_BATCH);
 
+   bucket = mb_cache_entry_bucket(cache, key);
+   head = >hash;
+   hlist_bl_lock(head);
+   list_for_each_entry(tmp_req, >req_list, lnode) {
+   if (tmp_req->key == key && tmp_req->value == value) {
+   hlist_bl_unlock(head);
+   return -EBUSY;
+   }
+   }
+   hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
+   if (dup->e_key == key && dup->e_value == value) {
+   hlist_bl_unlock(head);
+   return -EBUSY;
+   }
+   }
+   req.key = key;
+   req.value = value;
+   list_add(, >req_list);
+   hlist_bl_unlock(head);
+
entry = kmem_cache_alloc(mb_entry_cache, mask);
-   if (!entry)
+   if (!entry) {
+   hlist_bl_lock(head);
+   list_del();
+   hlist_bl_unlock(head);
return -ENOMEM;
+   }
 
INIT_LIST_HEAD(>e_list);
/* One ref for hash, one ref returned */
@@ -96,15 +133,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t 
mask, u32 key,
entry->e_value = value;
entry->e_reusable = reusable;
entry->e_referenced = 0;
-   head = mb_cache_entry_head(cache, key);
+
hlist_bl_lock(head);
-   hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
-   if (dup->e_key == key && dup->e_value == value) {
-   hlist_bl_unlock(head);
-   kmem_cache_free(mb_entry_cache, entry);
-   return -EBUSY;
-   }
-   }
+   list_del();
hlist_bl_add_head(>e_hash_list, head);
hlist_bl_unlock(head);
 
@@ -133,7 +164,7 @@ static struct mb_cache_entry *__entry_find(struct mb_cache 
*cache,
struct hlist_bl_node *node;
struct hlist_bl_head *head;
 
-   head = mb_cache_entry_head(cache, key);
+   head = _cache_entry_bucket(cache, key)->hash;
hlist_bl_lock(head);
if (entry &a

[PATCH] scatterlist: Don't allocate sg lists using __get_free_page

2019-07-12 Thread Sultan Alsawaf
From: Sultan Alsawaf 

Allocating pages with __get_free_page is slower than going through the
slab allocator to grab free pages out from a pool.

These are the results from running the code at the bottom of this
message:
[1.278602] speedtest: __get_free_page: 9 us
[1.278606] speedtest: kmalloc: 4 us
[1.278609] speedtest: kmem_cache_alloc: 4 us
[1.278611] speedtest: vmalloc: 13 us

kmalloc and kmem_cache_alloc (which is what kmalloc uses for common
sizes behind the scenes) are the fastest choices. Use kmalloc to speed
up sg list allocation.

This is the code used to produce the above measurements:
#include 
#include 
#include 

static int speedtest(void *data)
{
static const struct sched_param sched_max_rt_prio = {
.sched_priority = MAX_RT_PRIO - 1
};
volatile s64 ctotal = 0, gtotal = 0, ktotal = 0, vtotal = 0;
struct kmem_cache *page_pool;
int i, j, trials = 1000;
volatile ktime_t start;
void *ptr[100];

sched_setscheduler_nocheck(current, SCHED_FIFO, _max_rt_prio);

page_pool = kmem_cache_create("pages", PAGE_SIZE, PAGE_SIZE, SLAB_PANIC,
  NULL);
for (i = 0; i < trials; i++) {
start = ktime_get();
for (j = 0; j < ARRAY_SIZE(ptr); j++)
while (!(ptr[j] = kmem_cache_alloc(page_pool, 
GFP_KERNEL)));
ctotal += ktime_us_delta(ktime_get(), start);
for (j = 0; j < ARRAY_SIZE(ptr); j++)
kmem_cache_free(page_pool, ptr[j]);

start = ktime_get();
for (j = 0; j < ARRAY_SIZE(ptr); j++)
while (!(ptr[j] = (void *)__get_free_page(GFP_KERNEL)));
gtotal += ktime_us_delta(ktime_get(), start);
for (j = 0; j < ARRAY_SIZE(ptr); j++)
free_page((unsigned long)ptr[j]);

start = ktime_get();
for (j = 0; j < ARRAY_SIZE(ptr); j++)
while (!(ptr[j] = kmalloc(PAGE_SIZE, GFP_KERNEL)));
ktotal += ktime_us_delta(ktime_get(), start);
for (j = 0; j < ARRAY_SIZE(ptr); j++)
kfree(ptr[j]);

start = ktime_get();
*ptr = vmalloc(ARRAY_SIZE(ptr) * PAGE_SIZE);
vtotal += ktime_us_delta(ktime_get(), start);
vfree(*ptr);
}
kmem_cache_destroy(page_pool);

printk("%s: __get_free_page: %lld us\n", __func__, gtotal / trials);
printk("%s: kmalloc: %lld us\n", __func__, ktotal / trials);
printk("%s: kmem_cache_alloc: %lld us\n", __func__, ctotal / trials);
printk("%s: vmalloc: %lld us\n", __func__, vtotal / trials);
complete(data);
return 0;
}

static int __init start_test(void)
{
DECLARE_COMPLETION_ONSTACK(done);

BUG_ON(IS_ERR(kthread_run(speedtest, , "malloc_test")));
wait_for_completion();
return 0;
}
late_initcall(start_test);

Signed-off-by: Sultan Alsawaf 
---
 lib/scatterlist.c | 23 ++-
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c2cf2c311b7d..d3093e8f8978 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -148,31 +148,12 @@ EXPORT_SYMBOL(sg_init_one);
  */
 static struct scatterlist *sg_kmalloc(unsigned int nents, gfp_t gfp_mask)
 {
-   if (nents == SG_MAX_SINGLE_ALLOC) {
-   /*
-* Kmemleak doesn't track page allocations as they are not
-* commonly used (in a raw form) for kernel data structures.
-* As we chain together a list of pages and then a normal
-* kmalloc (tracked by kmemleak), in order to for that last
-* allocation not to become decoupled (and thus a
-* false-positive) we need to inform kmemleak of all the
-* intermediate allocations.
-*/
-   void *ptr = (void *) __get_free_page(gfp_mask);
-   kmemleak_alloc(ptr, PAGE_SIZE, 1, gfp_mask);
-   return ptr;
-   } else
-   return kmalloc_array(nents, sizeof(struct scatterlist),
-gfp_mask);
+   return kmalloc_array(nents, sizeof(struct scatterlist), gfp_mask);
 }
 
 static void sg_kfree(struct scatterlist *sg, unsigned int nents)
 {
-   if (nents == SG_MAX_SINGLE_ALLOC) {
-   kmemleak_free(sg);
-   free_page((unsigned long) sg);
-   } else
-   kfree(sg);
+   kfree(sg);
 }
 
 /**
-- 
2.22.0



Re: [PATCH] scatterlist: Allocate a contiguous array instead of chaining

2019-07-12 Thread Sultan Alsawaf
On Fri, Jul 12, 2019 at 09:06:40AM +0200, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Ming Lei wrote:
> > vmalloc() may sleep, so it is impossible to be called in atomic context.
> 
> Allocations from atomic context should be avoided wherever possible and you
> really have to have a very convincing argument why an atomic allocation is
> absolutely necessary. I cleaned up quite some GFP_ATOMIC users over the
> last couple of years and all of them were doing it for the very wrong
> reasons and mostly just to silence the warning which is triggered with
> GFP_KERNEL when called from a non-sleepable context.
> 
> So I suggest to audit all call sites first and figure out whether they
> really must use GFP_ATOMIC and if possible clean them up, remove the GFP
> argument and then do the vmalloc thing on top.

Hello Thomas and Ming,

It looks like the following call sites are atomic:
drivers/crypto/qce/ablkcipher.c:92: ret = sg_alloc_table(>dst_tbl, 
rctx->dst_nents, gfp);
drivers/crypto/ccp/ccp-crypto-aes-cmac.c:110:   ret = 
sg_alloc_table(>data_sg, sg_count, gfp);
drivers/crypto/ccp/ccp-crypto-sha.c:103:ret = 
sg_alloc_table(>data_sg, sg_count, gfp);
drivers/spi/spi-pl022.c:1035:   ret = sg_alloc_table(>sgt_rx, pages, 
GFP_ATOMIC);
drivers/spi/spi-pl022.c:1039:   ret = sg_alloc_table(>sgt_tx, pages, 
GFP_ATOMIC);

The crypto ones are conditionally made atomic depending on the presence of
CRYPTO_TFM_REQ_MAY_SLEEP.

Additionally, the following allocation could be problematic with kvmalloc:
net/ceph/crypto.c:180:  ret = sg_alloc_table(sgt, chunk_cnt, GFP_NOFS);

This is a snippet from kvmalloc:
/*
 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page 
tables)
 * so the given set of flags has to be compatible.
 */
if ((flags & GFP_KERNEL) != GFP_KERNEL)
return kmalloc_node(size, flags, node);

Use of GFP_NOFS in net/ceph/crypto.c would cause kvmalloc to fall back to
kmalloc_node, which could cause problems if the allocation size is too large for
kmalloc_node to reasonably accomodate.

Also, it looks like the vmalloc family doesn't have kvmalloc's GFP_KERNEL check.
Is this intentional, or does vmalloc really not require GFP_KERNEL context?

Thanks,
Sultan


[PATCH] scatterlist: Allocate a contiguous array instead of chaining

2019-07-12 Thread Sultan Alsawaf
From: Sultan Alsawaf 

Typically, drivers allocate sg lists of sizes up to a few MiB in size.
The current algorithm deals with large sg lists by splitting them into
several smaller arrays and chaining them together. But if the sg list
allocation is large, and we know the size ahead of time, sg chaining is
both inefficient and unnecessary.

Rather than calling kmalloc hundreds of times in a loop for chaining
tiny arrays, we can simply do it all at once with kvmalloc, which has
the proper tradeoff on when to stop using kmalloc and instead use
vmalloc.

Abusing repeated kmallocs to produce a large allocation puts strain on
the slab allocator, when kvmalloc can be used instead. The single
kvmalloc allocation for all sg lists reduces the burden on the slab and
page allocators, since for large sg list allocations, this commit
replaces numerous kmalloc calls with one kvmalloc call.

The sg chaining is effectively disabled by changing SG_MAX_SINGLE_ALLOC
to UINT_MAX, which causes sg list allocations to never be split into
chains, since no allocation is larger than UINT_MAX. We then plumb
kvmalloc into the allocation functions so that it is used.

Signed-off-by: Sultan Alsawaf 
---
 include/linux/scatterlist.h |  2 +-
 lib/scatterlist.c   | 23 ++-
 2 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6eec50fb36c8..e2e26c53c441 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -310,7 +310,7 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int 
nents,
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
  */
-#define SG_MAX_SINGLE_ALLOC(PAGE_SIZE / sizeof(struct scatterlist))
+#define SG_MAX_SINGLE_ALLOC(UINT_MAX)
 
 /*
  * The maximum number of SG segments that we will put inside a
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c2cf2c311b7d..bf76854a34aa 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -148,31 +148,12 @@ EXPORT_SYMBOL(sg_init_one);
  */
 static struct scatterlist *sg_kmalloc(unsigned int nents, gfp_t gfp_mask)
 {
-   if (nents == SG_MAX_SINGLE_ALLOC) {
-   /*
-* Kmemleak doesn't track page allocations as they are not
-* commonly used (in a raw form) for kernel data structures.
-* As we chain together a list of pages and then a normal
-* kmalloc (tracked by kmemleak), in order to for that last
-* allocation not to become decoupled (and thus a
-* false-positive) we need to inform kmemleak of all the
-* intermediate allocations.
-*/
-   void *ptr = (void *) __get_free_page(gfp_mask);
-   kmemleak_alloc(ptr, PAGE_SIZE, 1, gfp_mask);
-   return ptr;
-   } else
-   return kmalloc_array(nents, sizeof(struct scatterlist),
-gfp_mask);
+   return kvmalloc_array(nents, sizeof(struct scatterlist), gfp_mask);
 }
 
 static void sg_kfree(struct scatterlist *sg, unsigned int nents)
 {
-   if (nents == SG_MAX_SINGLE_ALLOC) {
-   kmemleak_free(sg);
-   free_page((unsigned long) sg);
-   } else
-   kfree(sg);
+   kvfree(sg);
 }
 
 /**
-- 
2.22.0



Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-15 Thread Sultan Alsawaf
On Wed, May 15, 2019 at 02:32:48PM -0400, Steven Rostedt wrote:
> I'm confused why you did this?

Oleg said that debug_locks_off() could've been called and thus prevented
lockdep complaints about simple_lmk from appearing. To eliminate any possibility
of that, I disabled debug_locks_off().

Oleg also said that __lock_acquire() could return early if lock debugging were
somehow turned off after lockdep reported one bug. To mitigate any possibility
of that as well, I threw in the BUG_ON() for good measure.

I think at this point it's pretty clear that lockdep truly isn't complaining
about simple_lmk's locking pattern, and that lockdep's lack of complaints isn't
due to it being mysteriously turned off...

Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-14 Thread Sultan Alsawaf
On Tue, May 14, 2019 at 12:44:53PM -0400, Steven Rostedt wrote:
> OK, this has gotten my attention.
> 
> This thread is quite long, do you have a git repo I can look at, and
> also where is the first task_lock() taken before the
> find_lock_task_mm()?
> 
> -- Steve

Hi Steve,

This is the git repo I work on: 
https://github.com/kerneltoast/android_kernel_google_wahoo

With the newest simple_lmk iteration being this commit: 
https://github.com/kerneltoast/android_kernel_google_wahoo/commit/6b145b8c28b39f7047393169117f72ea7387d91c

This repo is based off the 4.4 kernel that Google ships on the Pixel 2/2XL.

simple_lmk iterates through the entire task list more than once and locks
potential victims using find_lock_task_mm(). It keeps these potential victims
locked across the multiple times that the task list is iterated.

The locking pattern that Oleg said should cause lockdep to complain is that
iterating through the entire task list more than once can lead to locking the
same task that was locked earlier with find_lock_task_mm(), and thus deadlock.
But there is a check in simple_lmk that avoids locking potential victims that
were already found, which avoids the deadlock, but lockdep doesn't know about
the check (which is done with vtsk_is_duplicate()) and should therefore
complain.

Lockdep does not complain though.

Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-13 Thread Sultan Alsawaf
On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote:
> I am starting to think I am ;)
> 
> If you have task1 != task2 this code
> 
>   task_lock(task1);
>   task_lock(task2);
> 
> should trigger print_deadlock_bug(), task1->alloc_lock and task2->alloc_lock 
> are
> the "same" lock from lockdep pov, held_lock's will have the same 
> hlock_class().

Okay, I've stubbed out debug_locks_off(), and lockdep is now complaining about a
bunch of false positives so it is _really_ enabled this time. I grepped for
lockdep last time to try and find a concise way to show over email that lockdep
didn't complain, but that backfired. Here is better evidence:

taimen:/ # dmesg | grep simple_lmk
[   58.349917] simple_lmk: Killing droid.deskclock with adj 906 to free 47548 
KiB
[   58.354748] simple_lmk: Killing .android.dialer with adj 906 to free 36576 
KiB
[   58.355030] simple_lmk: Killing rbandroid.sleep with adj 904 to free 50016 
KiB
[   58.582833] simple_lmk: Killing oadcastreceiver with adj 904 to free 43044 
KiB
[   58.587731] simple_lmk: Killing .apps.wellbeing with adj 902 to free 48128 
KiB
[   58.588084] simple_lmk: Killing android.carrier with adj 902 to free 43636 
KiB
[   58.671857] simple_lmk: Killing ndroid.keychain with adj 902 to free 39992 
KiB
[   58.675622] simple_lmk: Killing gs.intelligence with adj 900 to free 49572 
KiB
[   58.675770] simple_lmk: Killing le.modemservice with adj 800 to free 41976 
KiB
[   58.762060] simple_lmk: Killing ndroid.settings with adj 700 to free 74708 
KiB
[   58.763238] simple_lmk: Killing roid.apps.turbo with adj 700 to free 54660 
KiB
[   58.873337] simple_lmk: Killing d.process.acore with adj 700 to free 48540 
KiB
[   58.873513] simple_lmk: Killing d.process.media with adj 500 to free 46188 
KiB
[   58.873713] simple_lmk: Killing putmethod.latin with adj 200 to free 67304 
KiB
[   59.014046] simple_lmk: Killing android.vending with adj 201 to free 54900 
KiB
[   59.017623] simple_lmk: Killing rak.mibandtools with adj 201 to free 44552 
KiB
[   59.018423] simple_lmk: Killing eport.trebuchet with adj 100 to free 106208 
KiB
[   59.223633] simple_lmk: Killing id.printspooler with adj 900 to free 39664 
KiB
[   59.223762] simple_lmk: Killing gle.android.gms with adj 100 to free 64176 
KiB
[   70.955204] simple_lmk: Killing .apps.translate with adj 906 to free 47564 
KiB
[   70.955633] simple_lmk: Killing cloudprint:sync with adj 906 to free 31932 
KiB
[   70.955787] simple_lmk: Killing oid.apps.photos with adj 904 to free 50204 
KiB
[   71.060789] simple_lmk: Killing ecamera.android with adj 906 to free 32232 
KiB
[   71.061074] simple_lmk: Killing webview_service with adj 906 to free 26028 
KiB
[   71.061199] simple_lmk: Killing com.whatsapp with adj 904 to free 49484 KiB
[   71.190625] simple_lmk: Killing rbandroid.sleep with adj 906 to free 54724 
KiB
[   71.190775] simple_lmk: Killing android.vending with adj 906 to free 39848 
KiB
[   71.191303] simple_lmk: Killing m.facebook.orca with adj 904 to free 72296 
KiB
[   71.342042] simple_lmk: Killing droid.deskclock with adj 902 to free 49284 
KiB
[   71.342240] simple_lmk: Killing .apps.wellbeing with adj 900 to free 47632 
KiB
[   71.342529] simple_lmk: Killing le.modemservice with adj 800 to free 33648 
KiB
[   71.482391] simple_lmk: Killing d.process.media with adj 800 to free 40676 
KiB
[   71.482511] simple_lmk: Killing rdog.challegram with adj 700 to free 71920 
KiB
taimen:/ #

The first simple_lmk message appears at 58.349917. And based on the timestamps,
it's clear that simple_lmk called task_lock() for multiple different tasks,
which is the pattern you think should cause lockdep to complain. But here is the
full dmesg starting from that point:

[   58.349917] simple_lmk: Killing droid.deskclock with adj 906 to free 47548 
KiB
[   58.354748] simple_lmk: Killing .android.dialer with adj 906 to free 36576 
KiB
[   58.355030] simple_lmk: Killing rbandroid.sleep with adj 904 to free 50016 
KiB
[   58.432654] binder_alloc: 2284: binder_alloc_buf failed to map pages in 
userspace, no vma
[   58.432671] binder: 1206:1218 transaction failed 29189/-3, size 76-0 line 
3189
[   58.582833] simple_lmk: Killing oadcastreceiver with adj 904 to free 43044 
KiB
[   58.587731] simple_lmk: Killing .apps.wellbeing with adj 902 to free 48128 
KiB
[   58.588084] simple_lmk: Killing android.carrier with adj 902 to free 43636 
KiB
[   58.590785] binder: undelivered transaction 58370, process died.
[   58.671857] simple_lmk: Killing ndroid.keychain with adj 902 to free 39992 
KiB
[   58.675622] simple_lmk: Killing gs.intelligence with adj 900 to free 49572 
KiB
[   58.675770] simple_lmk: Killing le.modemservice with adj 800 to free 41976 
KiB
[   58.736678] binder: undelivered transaction 58814, process died.
[   58.736733] binder: release 3099:3128 transaction 57832 in, still active
[   58.736744] binder: send failed reply for transaction 57832 to 1876:3090
[   58.736761] binder: undelivered TRANSACTION_COMPLETE
[   

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-09 Thread Sultan Alsawaf
On Thu, May 09, 2019 at 05:56:46PM +0200, Oleg Nesterov wrote:
> Impossible ;) I bet lockdep should report the deadlock as soon as 
> find_victims()
> calls find_lock_task_mm() when you already have a locked victim.

I hope you're not a betting man ;)

With the following configured:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
# CONFIG_DEBUG_SPINLOCK_BITE_ON_BUG is not set
CONFIG_DEBUG_SPINLOCK_PANIC_ON_BUG=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set

And a printk added in vtsk_is_duplicate() to print when a duplicate is detected,
and my phone's memory cut in half to make simple_lmk do something, this is what
I observed:
taimen:/ # dmesg | grep lockdep
[0.00] \x09RCU lockdep checking is enabled.
taimen:/ # dmesg | grep simple_lmk
[   23.211091] simple_lmk: Killing android.carrier with adj 906 to free 37420 
kiB
[   23.211160] simple_lmk: Killing oadcastreceiver with adj 906 to free 36784 
kiB
[   23.248457] simple_lmk: Killing .apps.translate with adj 904 to free 45884 
kiB
[   23.248720] simple_lmk: Killing ndroid.settings with adj 904 to free 42868 
kiB
[   23.313417] simple_lmk: DUPLICATE VTSK!
[   23.313440] simple_lmk: Killing ndroid.keychain with adj 906 to free 33680 
kiB
[   23.313513] simple_lmk: Killing com.whatsapp with adj 904 to free 51436 kiB
[   34.646695] simple_lmk: DUPLICATE VTSK!
[   34.646717] simple_lmk: Killing ndroid.apps.gcs with adj 906 to free 37956 
kiB
[   34.646792] simple_lmk: Killing droid.apps.maps with adj 904 to free 63600 
kiB
taimen:/ # dmesg | grep lockdep
[0.00] \x09RCU lockdep checking is enabled.
taimen:/ # 

> As for 
> https://github.com/kerneltoast/android_kernel_google_wahoo/commit/afc8c9bf2dbde95941253c168d1adb64cfa2e3ad
> Well,
> 
>   mmdrop(mm);
>   simple_lmk_mm_freed(mm);
> 
> looks racy because mmdrop(mm) can free this mm_struct. Yes, 
> simple_lmk_mm_freed()
> does not dereference this pointer, but the same memory can be re-allocated as
> another ->mm for the new task which can be found by find_victims(), and _in 
> theory_
> this all can happen in between, so the "victims[i].mm == mm" can be false 
> positive.
> 
> And this also means that simple_lmk_mm_freed() should clear victims[i].mm when
> it detects "victims[i].mm == mm", otherwise we have the same theoretical race,
> victims_to_kill is only cleared when the last victim goes away.

Fair point. Putting simple_lmk_mm_freed(mm) right before mmdrop(mm), and
sprinkling in a cmpxchg in simple_lmk_mm_freed() should fix that up.

> Another nit... you can drop tasklist_lock right after the 1st "find_victims" 
> loop.

True!

> And it seems that you do not really need to walk the "victims" array twice 
> after that,
> you can do everything in a single loop, but this is cosmetic.

Won't this result in potentially holding the task lock way longer than necessary
for multiple tasks that aren't going to be killed?

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 12:58:27PM +0200, Christian Brauner wrote:
> This is work that is ongoing and requires kernel changes to make it
> feasible. One of the things that I have been working on for quite a
> while is the whole file descriptor for processes thing that is important
> for LMKD (Even though I never thought about this use-case when I started
> pitching this.). Joel and Daniel have joined in and are working on
> making LMKD possible.
> What I find odd is that every couple of weeks different solutions to the
> low memory problem are pitched. There is simple_lkml, there is LMKD, and
> there was a patchset that wanted to speed up memory reclaim at process
> kill-time by adding a new flag to the new pidfd_send_signal() syscall.
> That all seems - though related - rather uncoordinated. Now granted,
> coordinated is usually not how kernel development necessarily works but
> it would probably be good to have some sort of direction and from what I
> have seen LMKD seems to be the most coordinated effort. But that might
> just be my impression.

LMKD is just Android's userspace low-memory-killer daemon. It's been around for
a while.

This patch (simple_lmk) is meant to serve as an immediate solution for the
devices that'll never see a single line of PSI code running on them, which
amounts to... well, most Android devices currently in existence. I'm more of a
cowboy who made this patch after waiting a few years for memory management
improvements on Android that never happened. Though it looks like it's going to
happen soon(ish?) for super new devices that'll have the privilege of shipping
with PSI in use.

On Tue, May 07, 2019 at 01:09:21PM +0200, Greg Kroah-Hartman wrote:
> > It's even more odd that although a userspace solution is touted as the 
> > proper
> > way to go on LKML, almost no Android OEMs are using it, and even in that 
> > commit
> > I linked in the previous message, Google made a rather large set of
> > modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
> > What's going on?
> 
> "almost no"?  Again, Android Go is doing that, right?

I'd check for myself, but I can't seem to find kernel source for an Android Go
device...

This seems more confusing though. Why would the ultra-low-end devices use LMKD
while other devices use the broken lowmemorykiller driver?

> > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845.
> 
> Qualcomm should never be used as an example of a company that has any
> idea of what to do in their kernel :)

Agreed, but nearly all OEMs that use Qualcomm chipsets roll with Qualcomm's
kernel decisions, so Qualcomm has a bit of influence here.

> > If PSI were backported to 4.4, or even 3.18, would it really be used?
> 
> Why wouldn't it, if it worked properly?

For the same mysterious reason that Qualcomm and others cling to
lowmemorykiller, I presume. This is part of what's been confusing me for quite
some time...

> Please see the work that went into PSI and the patches around it.
> There's also a lwn.net article last week about the further work ongoing
> in this area.  With all of that, you should see how in-kernel memory
> killers are NOT the way to go.
> 
> > Regardless, even if PSI were backported, a full-fledged LMKD using it has 
> > yet to
> > be made, so it wouldn't be of much use now.
> 
> "LMKD"?  Again, PSI is in the 4.9.y android-common tree, so the
> userspace side should be in AOSP, right?

LMKD as in Android's low-memory-killer daemon. It is in AOSP, but it looks like
it's still a work in progress.

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote:
> Hi Sultan,
> Looks like you are posting this patch for devices that do not use
> userspace LMKD solution due to them using older kernels or due to
> their vendors sticking to in-kernel solution. If so, I see couple
> logistical issues with this patch. I don't see it being adopted in
> upstream kernel 5.x since it re-implements a deprecated mechanism even
> though vendors still use it. Vendors on the other hand, will not adopt
> it until you show evidence that it works way better than what
> lowmemorykilled driver does now. You would have to provide measurable
> data and explain your tests before they would consider spending time
> on this.

Yes, this is mostly for the devices already produced that are forced to suffer
with poor memory management. I can't even convince vendors to fix kernel
memory leaks, so there's no way I'd be able to convince them of trying this
patch, especially since it seems like you're having trouble convincing vendors
to stop using lowmemorykiller in the first place. And thankfully, convincing
vendors isn't my job :)

> On the implementation side I'm not convinced at all that this would
> work better on all devices and in all circumstances. We had cases when
> a new mechanism would show very good results until one usecase
> completely broke it. Bulk killing of processes that you are doing in
> your patch was a very good example of such a decision which later on
> we had to rethink. That's why baking these policies into kernel is
> very problematic. Another problem I see with the implementation that
> it ties process killing with the reclaim scan depth. It's very similar
> to how vmpressure works and vmpressure in my experience is very
> unpredictable.

Could you elaborate a bit on why bulk killing isn't good?

> > > I linked in the previous message, Google made a rather large set of
> > > modifications to the supposedly-defunct lowmemorykiller.c not one month 
> > > ago.
> > > What's going on?
> 
> If you look into that commit, it adds ability to report kill stats. If
> that was a change in how that driver works it would be rejected.

Fair, though it was quite strange seeing something that was supposedly totally
abandoned receiving a large chunk of code for reporting stats.

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote:
> I am not going to comment the intent, but to be honest I am skeptical too.

The general sentiment has been that this is a really bad idea, but I'm just a
frustrated Android user who wants his phone to not require mountains of zRAM
only to still manage memory poorly. Until I can go out and buy a non-Pixel phone
that uses PSI to make these decisions (and does a good job of it), I'm going to
stick to my hacky little driver for my personal devices. Many others who like to
hack their Android devices to make them last longer will probably find value in
this as well, since there are millions of people who use devices that'll never
seen any of PSI ported to their ancient 3.x kernels.

And yes, I know this would never be accepted to upstream in a million years. I
mostly wanted some code review and guidance, since mm code is pretty tricky :)

> On 05/06, Sultan Alsawaf wrote:
> >
> > +static unsigned long find_victims(struct victim_info *varr, int *vindex,
> > + int vmaxlen, int min_adj, int max_adj)
> > +{
> > +   unsigned long pages_found = 0;
> > +   int old_vindex = *vindex;
> > +   struct task_struct *tsk;
> > +
> > +   for_each_process(tsk) {
> > +   struct task_struct *vtsk;
> > +   unsigned long tasksize;
> > +   short oom_score_adj;
> > +
> > +   /* Make sure there's space left in the victim array */
> > +   if (*vindex == vmaxlen)
> > +   break;
> > +
> > +   /* Don't kill current, kthreads, init, or duplicates */
> > +   if (same_thread_group(tsk, current) ||
> > +   tsk->flags & PF_KTHREAD ||
> > +   is_global_init(tsk) ||
> > +   vtsk_is_duplicate(varr, *vindex, tsk))
> > +   continue;
> > +
> > +   vtsk = find_lock_task_mm(tsk);
> 
> Did you test this patch with lockdep enabled?
> 
> If I read the patch correctly, lockdep should complain. vtsk_is_duplicate()
> ensures that we do not take the same ->alloc_lock twice or more, but lockdep
> can't know this.

Yeah, lockdep is fine with this, at least on 4.4.

> > +static void scan_and_kill(unsigned long pages_needed)
> > +{
> > +   static DECLARE_WAIT_QUEUE_HEAD(victim_waitq);
> > +   struct victim_info victims[MAX_VICTIMS];
> > +   int i, nr_to_kill = 0, nr_victims = 0;
> > +   unsigned long pages_found = 0;
> > +   atomic_t victim_count;
> > +
> > +   /*
> > +* Hold the tasklist lock so tasks don't disappear while scanning. This
> > +* is preferred to holding an RCU read lock so that the list of tasks
> > +* is guaranteed to be up to date. Keep preemption disabled until the
> > +* SIGKILLs are sent so the victim kill process isn't interrupted.
> > +*/
> > +   read_lock(_lock);
> > +   preempt_disable();
> 
> read_lock() disables preemption, every task_lock() too, so this looks
> unnecessary.

Good point.

> > +   for (i = 1; i < ARRAY_SIZE(adj_prio); i++) {
> > +   pages_found += find_victims(victims, _victims, MAX_VICTIMS,
> > +   adj_prio[i], adj_prio[i - 1]);
> > +   if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS)
> > +   break;
> > +   }
> > +
> > +   /*
> > +* Calculate the number of tasks that need to be killed and quickly
> > +* release the references to those that'll live.
> > +*/
> > +   for (i = 0, pages_found = 0; i < nr_victims; i++) {
> > +   struct victim_info *victim = [i];
> > +   struct task_struct *vtsk = victim->tsk;
> > +
> > +   /* The victims' mm lock is taken in find_victims; release it */
> > +   if (pages_found >= pages_needed) {
> > +   task_unlock(vtsk);
> > +   continue;
> > +   }
> > +
> > +   /*
> > +* Grab a reference to the victim so it doesn't disappear after
> > +* the tasklist lock is released.
> > +*/
> > +   get_task_struct(vtsk);
> 
> The comment doesn't look correct. the victim can't dissapear until 
> task_unlock()
> below, it can't pass exit_mm().

I was always unsure about this and decided to hold a reference to the
task_struct to be safe. Thanks for clearing that up.

> > +   pages_found += victim->size;
> > +   nr_to_kill++;
> > +   }
> > +   read_unlock(_lock);
> > +
> > +   /* Kill the victims */
> > +   victim_co

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> Given that any "new" android device that gets shipped "soon" should be
> using 4.9.y or newer, is this a real issue?

It's certainly a real issue for those who can't buy brand new Android devices
without software bugs every six months :)

> And if it is, I'm sure that asking for those patches to be backported to
> 4.4.y would be just fine, have you asked?
>
> Note that I know of Android Go devices, running 3.18.y kernels, do NOT
> use the in-kernel memory killer, but instead use the userspace solution
> today.  So trying to get another in-kernel memory killer solution added
> anywhere seems quite odd.

It's even more odd that although a userspace solution is touted as the proper
way to go on LKML, almost no Android OEMs are using it, and even in that commit
I linked in the previous message, Google made a rather large set of
modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
What's going on?

Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were
backported to 4.4, or even 3.18, would it really be used? I don't really
understand the aversion to an in-kernel memory killer on LKML despite the rest
of the industry's attraction to it. Perhaps there's some inherently great cost
in using the userspace solution that I'm unaware of?

Regardless, even if PSI were backported, a full-fledged LMKD using it has yet to
be made, so it wouldn't be of much use now.

Thanks,
Sultan

[1] 
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/configs/sdm845_defconfig?h=LA.UM.7.3.r1-07400-sdm845.0#n492


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 09:04:30AM +0200, Greg Kroah-Hartman wrote:
> Um, why can't "all" Android devices take the same patches that the Pixel
> phones are using today?  They should all be in the public android-common
> kernel repositories that all Android devices should be syncing with on a
> weekly/monthly basis anyway, right?
> 
> thanks,
> 
> greg k-h

Hi Greg,

I only see PSI present in the android-common kernels for 4.9 and above. The vast
majority of Android devices do not run a 4.9+ kernel. It seems unreasonable to
expect OEMs to toil with backporting PSI themselves to get decent memory
management.

But even if they did backport PSI, it wouldn't help too much because a
PSI-enabled LMKD solution is not ready yet. It looks like a PSI-based LMKD is
still under heavy development and won't be ready for all Android devices for
quite some time.

Additionally, it looks like the supposedly-dead lowmemorykiller.c is still being
actively tweaked by Google [1], which does not instill confidence that a
definitive LMK solution a la PSI is coming any time soon.

Thanks,
Sultan

[1] 
https://android.googlesource.com/kernel/common/+/152bacdd85c46f0c76b00c4acc253e414513634c


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-06 Thread Sultan Alsawaf
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.

The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.

Signed-off-by: Sultan Alsawaf 
---
Hello everyone,

I've addressed some of the concerns that were brought up with the first version
of the Simple LMK patch. I understand that a kernel-based solution like this
that contains policy decisions for a specific userspace is not the way to go,
but the Android ecosystem still has a pressing need for a low memory killer that
works well.

Most Android devices still use the ancient and deprecated lowmemorykiller.c
kernel driver; Simple LMK seeks to replace that, at the very least until PSI and
a userspace daemon utilizing PSI are ready for *all* Android devices, and not
just the privileged Pixel phone line.

The feedback I received last time was quite helpful. This Simple LMK patch works
significantly better than the first, improving memory management by a large
margin while being immune to transient spikes in memory usage (since the signal
to start killing processes is determined by how hard kswapd tries to free up
pages, which is something that occurs over a span of time and not a single point
in time).

I'd love to hear some feedback on this new patch. I do encourage those who are
interested to take it for a spin on an Android device. This patch has been
tested successfully on Android 4.4 and 4.9 kernels. For the sake of review here,
I have adapted the patch to 5.1.

Thanks,
Sultan

 drivers/android/Kconfig  |  33 
 drivers/android/Makefile |   1 +
 drivers/android/simple_lmk.c | 315 +++
 include/linux/mm_types.h |   4 +
 include/linux/simple_lmk.h   |  11 ++
 kernel/fork.c|  13 ++
 mm/vmscan.c  |  12 ++
 7 files changed, 389 insertions(+)
 create mode 100644 drivers/android/simple_lmk.c
 create mode 100644 include/linux/simple_lmk.h

diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig
index 6fdf2abe4..bdd429338 100644
--- a/drivers/android/Kconfig
+++ b/drivers/android/Kconfig
@@ -54,6 +54,39 @@ config ANDROID_BINDER_IPC_SELFTEST
  exhaustively with combinations of various buffer sizes and
  alignments.
 
+config ANDROID_SIMPLE_LMK
+   bool "Simple Android Low Memory Killer"
+   depends on !ANDROID_LOW_MEMORY_KILLER && !MEMCG
+   ---help---
+ This is a complete low memory killer solution for Android that is
+ small and simple. Processes are killed according to the priorities
+ that Android gives them, so that the least important processes are
+ always killed first. Processes are killed until memory deficits are
+ satisfied, as observed from kswapd struggling to free up pages. Simple
+ LMK stops killing processes when kswapd finally goes back to sleep.
+
+if ANDROID_SIMPLE_LMK
+
+config ANDROID_SIMPLE_LMK_AGGRESSION
+   int "Reclaim frequency selection"
+   range 1 3
+   default 1
+   help
+ This value determines how frequently Simple LMK will perform memory
+ reclaims. A lower value corresponds to less frequent reclaims, which
+ maximizes memory usage. The range of values has a logarithmic
+ correlation; 2 is twice as aggressive as 1, and 3 is twice as
+ aggressive as 2, which makes 3 four times as aggressive as 1.
+
+config ANDROID_SIMPLE_LMK_MINFREE
+   int "Minimum MiB of memory to free per reclaim"
+   range 8 512
+   default 64
+   help
+ Simple LMK will try to free at least this much memory per reclaim.
+
+endif
+
 endif # if ANDROID
 
 endmenu
diff --git a/drivers/android/Makefile b/drivers/android/Makefile
index c7856e320..7c91293b6 100644
--- a/drivers/android/Makefile
+++ b/drivers/android/Makefile
@@ -3,3 +3,4 @@ ccflags-y += -I$(src)   # needed for trace 
events
 obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o
 obj-$(CONFIG_ANDROID_BINDER_IPC)   += binder.o binder_alloc.o
 obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o
+obj-$(CONFIG_ANDROID_SIMPLE_LMK)   += simple_lmk.o
diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c
new file mode 100644
index 0..a2ffb57b5
--- /dev/null
+++ b/drivers/android/simple_lmk.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Sultan Alsawaf .
+

Re: [RFCv2] string: Use faster alternatives when constant arguments are used

2019-04-01 Thread Sultan Alsawaf
On Mon, Apr 01, 2019 at 10:43:13PM +0200, Rasmus Villemoes wrote:
> Consider your patch replacing !strcmp(buf, "123") by !memcmp(buf, "123",
> 4). buf is known to point to a nul-terminated string. But it may point
> at, say, the second-last byte in a page, with the last byte in that page
> being a nul byte, and the following page being MMIO or unmapped or all
> kinds of bad things. On e.g. x86 where unaligned accesses are cheap, and
> seeing that you're only comparing for equality, gcc is likely to compile
> the memcmp version into
> 
>   *(u32*)buf == 0x00333231
> 
> because you've told the compiler that there's no problem accessing four
> bytes starting at buf. Boom. Even without unaligned access being cheap
> this can happen; suppose the length is 8 instead, and gcc somehow knows
> that buf is four-byte aligned (and in this case it happens to point four
> bytes before a page boundary), so it could compile the memcmp(,,8) into
> 
>   *(u32*)(buf+4) == secondword && *(u32*)buf == firstword
> 
> (or do the comparisons in the "natural" order, but it might still do
> both loads first).

Ok, this is the first solid counter I've seen against my patch. I hadn't
considered unaligned word-sized accesses. Thanks.

> > And if there are concerns for some arches but not others, then couldn't 
> > this be
> > a feasible optimization for those which would work well with it?
> 
> No. First, these are concerns for all arches. Second, if you can find
> some particular place where string parsing/matching is in any way
> performance relevant and not just done once during driver init or
> whatnot, maybe the maintainers of that file would take a patch
> hand-optimizing some strcmps to memcmps, or, depending on what the code
> does, perhaps replacing the whole *cmp logic with a custom hash table.

FWIW, I didn't have a specific place in the kernel in mind that heavily relied
on str* operations and needed optimization. I just thought it could be a "free"
optimization, but apparently not.

> But a patch implicitly and silently touching thousands of lines of code,
> without an analysis of why none of the above is a problem for any of
> those lines, for any .config, arch, compiler version? No.

Well, I thought it could be proven to be correct regardless of the context,
which would imply that making such a global change touching thousands lof lines
of code would be safe. But sadly it isn't.

This does bring into question a greater problem though: usage of memcmp
throughout the kernel where the size provided is not the size of the smaller
container being compared. This usage of memcmp (which is quite easy to find all
across the kernel) could run into the unaligned memory access across a page
boundary that you gave as an example, no?

Sultan


Re: [RFCv2] string: Use faster alternatives when constant arguments are used

2019-03-30 Thread Sultan Alsawaf
On Mon, Mar 25, 2019 at 10:24:00PM +0100, Rasmus Villemoes wrote:
> What I'm worried about is your patch changing every single strcmp(,
> "literal") into a memcmp, with absolutely no way of knowing or checking
> anything about the other buffer. And actually, it doesn't have to be a
> BE arch with a word-at-a-time memcmp.
> If (as is usually the case) the strcmp() result is compared to zero, after you
> change
> 
>   !strcmp(buf, "literal")
> 
> into
> 
>   !memcmp(buf, "literal", 8)
> 
> the compiler may (exactly as you want it to) change that into a single
> 8-byte load (or two 4-byte loads) and comparisons to literals, no
> memcmp() involved. And how do you know that _that_ is ok, for every one
> of the hundreds, if not thousands, of instances in the tree?

When would this not be ok though? From what I've always known,

  strcmp(terminated_buf1, terminated_buf2)

is equivalent to

  memcmp(terminated_buf1, terminated_buf2, strlen(terminated_buf1))

  and

  memcmp(terminated_buf1, terminated_buf2, strlen(terminated_buf2))

regardless of whether or not one side is a literal (my patch just leverages the
compiler's ability to recognize strlen called on literals and optimize it out).
The latter memcmp instances would indeed perform worse than the first strcmp
when neither arguments are literals, but I don't see what makes the memcmp usage
"dangerous". How can the memcmps cross a page boundary when memcmp itself will
only read in large buffers of data at word boundaries?

And if there are concerns for some arches but not others, then couldn't this be
a feasible optimization for those which would work well with it?

Sultan


Re: [RFCv2] string: Use faster alternatives when constant arguments are used

2019-03-24 Thread Sultan Alsawaf
On Sun, Mar 24, 2019 at 10:17:49PM +0100, Rasmus Villemoes wrote:
> gcc already knows the semantics of these functions and can optimize
> accordingly. E.g. for strcpy() of a literal to a buffer, gcc readily
> compiles

The example you gave appears to get optimized accordingly, but there are
numerous sane counterexamples that don't get optimized.

On arm64, with GCC 8.3.0, let's look at the simple strcmp usage in
kernel/trace/preemptirq_delay_test.c:

static int preemptirq_delay_run(void *data)
{
unsigned long flags;

if (!strcmp(test_mode, "irq")) {
local_irq_save(flags);
busy_wait(delay);
local_irq_restore(flags);
} else if (!strcmp(test_mode, "preempt")) {
preempt_disable();
busy_wait(delay);
preempt_enable();
}

return 0;
}

This is what happens without my patch:

preemptirq_delay_run:
.LFB2517:
.cfi_startproc
stp x29, x30, [sp, -32]!
.cfi_def_cfa_offset 32
.cfi_offset 29, -32
.cfi_offset 30, -24
adrpx1, .LC0
add x1, x1, :lo12:.LC0
mov x29, sp
stp x19, x20, [sp, 16]
.cfi_offset 19, -16
.cfi_offset 20, -8
adrpx19, .LANCHOR0
add x19, x19, :lo12:.LANCHOR0
mov x0, x19
>   bl  strcmp
cbz w0, .L22
adrpx1, .LC1
mov x0, x19
add x1, x1, :lo12:.LC1
>   bl  strcmp
cbz w0, .L23

The calls to strcmp are not optimized out. Here is what this snippet looks like
after my patch:

preemptirq_delay_run:
.LFB2517:
.cfi_startproc
stp x29, x30, [sp, -32]!
.cfi_def_cfa_offset 32
.cfi_offset 29, -32
.cfi_offset 30, -24
adrpx0, .LANCHOR0
mov w1, 29289
mov x29, sp
ldr w2, [x0, #:lo12:.LANCHOR0]
movkw1, 0x71, lsl 16
add x3, x0, :lo12:.LANCHOR0
cmp w2, w1
beq .L23
ldr x1, [x0, #:lo12:.LANCHOR0]
mov x0, 29296
movkx0, 0x6565, lsl 16
movkx0, 0x706d, lsl 32
movkx0, 0x74, lsl 48
cmp x1, x0
beq .L24

The calls to strcmp were optimized out completely, not even replaced by a call
to memcmp.

I can produce lots of kernel examples that don't seem to follow basic str*
optimization, which is what prompted me to write this in the first place.

> Does this even compile? It's a well-known (or perhaps
> not-so-well-known?) pitfall that __builtin_constant_p() is not
> guaranteed to be usable in __builtin_choose_expr() - the latter only
> accepts bona fide integer constant expressions, while evaluation of
> __builtin_constant_p can be delayed until various optimization phases.

It not only compiles, but also seems to work correctly from what I've observed.

> This seems to be buggy - you don't know that src is at least as long as
> dest. And arguing "if it's shorter, there's a nul byte, which will
> differ from dest at that point, so memcmp will/should stop" means that
> the whole word-at-a-time argument would be out.

You mean reading the last word of a string could read out of bounds of the
string when using memcmp? There are lots of memcmp instances using a literal
string in the kernel; I don't think it would be hard to find one that violates
what you've pointed out, so I'm not really sure why it's a problem.

After a few minutes of grepping, it seems like the memcmp usage in
drivers/md/dm-integrity.c can read out of bounds on its arguments:
} else if (!memcmp(opt_string, "internal_hash:", strlen("internal_hash:"))) {
r = get_alg_and_key(opt_string, >internal_hash_alg, >error,
"Invalid internal_hash argument");
if (r)
goto bad;
} else if (!memcmp(opt_string, "journal_crypt:", strlen("journal_crypt:"))) {
r = get_alg_and_key(opt_string, >journal_crypt_alg, >error,
"Invalid journal_crypt argument");
if (r)
goto bad;
} else if (!memcmp(opt_string, "journal_mac:", strlen("journal_mac:"))) {

Where opt_string is a dynamically-set argument with no specified minimum length.

Sultan


Re: [RFC v3] string: Use faster alternatives when constant arguments are used

2019-03-24 Thread Sultan Alsawaf
On Sat, Mar 23, 2019 at 08:31:53PM -0700, Nathan Chancellor wrote:
> Explicitly cc'ing some folks who have touched include/linux/string.h in
> the past and might want to take a look at this.
> 
> Nathan

Thanks. One last revision with some nitpicks fixed is attached, though I doubt
it'll influence any comments on the concept of this anyway.

---
 include/linux/string.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f..40a8a9436 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char 
*str, const char *prefix
return strncmp(str, prefix, len) == 0 ? len : 0;
 }
 
+/*
+ * Replace some common string helpers with faster alternatives when one of the
+ * arguments is a constant (i.e., literal string). This uses strlen instead of
+ * sizeof for calculating the string length in order to silence compiler
+ * warnings that may arise due to what the compiler thinks is incorrect sizeof
+ * usage. The strlen calls on constants are folded into scalar values at 
compile
+ * time, so performance is not reduced by using strlen.
+ */
+#define strcpy(dest, src)  
\
+   __builtin_choose_expr(__builtin_constant_p(src),
\
+   memcpy((dest), (src), strlen(src) + 1), 
\
+   (strcpy)((dest), (src)))
+
+#define strcat(dest, src)  
\
+   __builtin_choose_expr(__builtin_constant_p(src),
\
+   ({  
\
+   memcpy((dest) + strlen(dest), (src), strlen(src) + 1);  
\
+   (dest); 
\
+   }), 
\
+   (strcat)((dest), (src)))
+
+#define strcmp(left, right)
\
+   __builtin_choose_expr(__builtin_constant_p(left),   
\
+   __builtin_choose_expr(__builtin_constant_p(right),  
\
+   (strcmp)((left), (right)),  
\
+   memcmp((left), (right), strlen(left) + 1)), 
\
+   __builtin_choose_expr(__builtin_constant_p(right),  
\
+   memcmp((left), (right), strlen(right) + 1), 
\
+   (strcmp)((left), (right
+
 #endif /* _LINUX_STRING_H_ */
-- 
2.21.0



[RFCv2] string: Use faster alternatives when constant arguments are used

2019-03-23 Thread Sultan Alsawaf
I messed up the return value for strcat in the first patch. Here's a fixed
version, ready for some scathing reviews.

From: Sultan Alsawaf 

When strcpy, strcat, and strcmp are used with a literal string, they can
be optimized to memcpy or memcmp calls. These alternatives are faster
since knowing the length of a string argument beforehand allows
traversal through the string word at a time without being concerned
about looking for the terminating zero character. In some cases, the
replaced calls to memcpy or memcmp can even be optimized out completely
for a significant speed up.

Signed-off-by: Sultan Alsawaf 
---
 include/linux/string.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f..59c301c0e 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char 
*str, const char *prefix
return strncmp(str, prefix, len) == 0 ? len : 0;
 }
 
+/*
+ * Replace some common string helpers with faster alternatives when one of the
+ * arguments is a constant (i.e., literal string). This uses strlen instead of
+ * sizeof for calculating the string length in order to silence compiler
+ * warnings that may arise due to what the compiler thinks is incorrect sizeof
+ * usage. The strlen calls on constants are folded into scalar values at 
compile
+ * time, so performance is not reduced by using strlen.
+ */
+#define strcpy(dest, src)  
\
+   __builtin_choose_expr(__builtin_constant_p(src),
\
+   memcpy((dest), (src), strlen(src) + 1), 
\
+   (strcpy)((dest), (src)))
+
+#define strcat(dest, src)  
\
+   __builtin_choose_expr(__builtin_constant_p(src),
\
+   ({  
\
+   memcpy(strchr((dest), '\0'), (src), strlen(src) + 1);   
\
+   (dest); 
\
+   }), 
\
+   (strcat)((dest), (src)))
+
+#define strcmp(dest, src)  
\
+   __builtin_choose_expr(__builtin_constant_p(dest),   
\
+   __builtin_choose_expr(__builtin_constant_p(src),
\
+   (strcmp)((dest), (src)),
\
+   memcmp((dest), (src), strlen(dest) + 1)),   
\
+   __builtin_choose_expr(__builtin_constant_p(src),
\
+   memcmp((dest), (src), strlen(src) + 1), 
\
+   (strcmp)((dest), (src
+
 #endif /* _LINUX_STRING_H_ */
-- 
2.21.0



[RFC] string: Use faster alternatives when constant arguments are used

2019-03-23 Thread Sultan Alsawaf
From: Sultan Alsawaf 

When strcpy, strcat, and strcmp are used with a literal string, they can
be optimized to memcpy or memcmp calls. These alternatives are faster
since knowing the length of a string argument beforehand allows
traversal through the string word at a time without being concerned
about looking for the terminating zero character. In some cases, the
replaced calls to memcpy or memcmp can even be optimized out completely
for a significant speed up.

Signed-off-by: Sultan Alsawaf 
---
 include/linux/string.h | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f..65db58342 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -476,4 +476,31 @@ static __always_inline size_t str_has_prefix(const char 
*str, const char *prefix
return strncmp(str, prefix, len) == 0 ? len : 0;
 }
 
+/*
+ * Replace some common string helpers with faster alternatives when one of the
+ * arguments is a constant (i.e., literal string). This uses strlen instead of
+ * sizeof for calculating the string length in order to silence compiler
+ * warnings that may arise due to what the compiler thinks is incorrect sizeof
+ * usage. The strlen calls on constants are folded into scalar values at 
compile
+ * time, so performance is not reduced by using strlen.
+ */
+#define strcpy(dest, src)  \
+   __builtin_choose_expr(__builtin_constant_p(src),\
+   memcpy((dest), (src), strlen(src) + 1), \
+   (strcpy)((dest), (src)))
+
+#define strcat(dest, src)  \
+   __builtin_choose_expr(__builtin_constant_p(src),\
+   memcpy(strchr((dest), '\0'), (src), strlen(src) + 1),   \
+   (strcat)((dest), (src)))
+
+#define strcmp(dest, src)  \
+   __builtin_choose_expr(__builtin_constant_p(dest),   \
+   __builtin_choose_expr(__builtin_constant_p(src),\
+   (strcmp)((dest), (src)),\
+   memcmp((dest), (src), strlen(dest) + 1)),   \
+   __builtin_choose_expr(__builtin_constant_p(src),\
+   memcmp((dest), (src), strlen(src) + 1), \
+   (strcmp)((dest), (src
+
 #endif /* _LINUX_STRING_H_ */
-- 
2.21.0



Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Sultan Alsawaf
On Thu, Mar 14, 2019 at 11:16:41PM -0400, Steven Rostedt wrote:
> How would you implement such a method in userspace? kill() doesn't take
> any parameters but the pid of the process you want to send a signal to,
> and the signal to send. This would require a new system call, and be
> quite a bit of work. If you can solve this with an ebpf program, I
> strongly suggest you do that instead.

This can be done by introducing a new signal number that provides SIGKILL
functionality while blocking (maybe SIGKILLBLOCK?).

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Sultan Alsawaf
On Thu, Mar 14, 2019 at 10:54:48PM -0400, Joel Fernandes wrote:
> I'm not sure if that makes much semantic sense for how the signal handling is
> supposed to work. Imagine a parent sends SIGKILL to its child, and then does
> a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot
> execute, and because the wait cannot execute, the zombie task will not get
> reaped and so the SIGKILL senders never gets unblocked and the whole thing
> just gets locked up. No? I don't know it just feels incorrect.

Block until the victim becomes a zombie instead.

> Further, in your idea adding stuff to task_struct will simply bloat it - when
> this task can easily be handled using eBPF without making any kernel changes.
> Either by probing sched_process_free or sched_process_exit tracepoints.
> Scheduler maintainers generally frown on adding stuff to task_struct
> pointlessly there's a good reason since bloating it effects the performance
> etc, and something like this would probably never be ifdef'd out behind a
> CONFIG.

Adding something to task_struct is just the easiest way to test things for
experimentation. This can be avoided in my suggestion by passing the pointer to
a completion via the relevant functions, and then completing it at the time the
victim transitions to a zombie state. I understand it's possible to use eBPF for
this, but it seems kind of messy since this functionality is something that I
think others would want provided by the kernel (i.e., anyone using PSI to
implement their own OOM killer daemon similar to LMKD).

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Sultan Alsawaf
On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote:
> About the 100ms latency, I wonder whether it is that high because of
> the way Android's lmkd is observing that a process has died. There is
> a gap between when a process memory is freed and when it disappears
> from the process-table.  Once a process is SIGKILLed, it becomes a
> zombie. Its memory is freed instantly during the SIGKILL delivery (I
> traced this so that's how I know), but until it is reaped by its
> parent thread, it will still exist in /proc/ . So if testing the
> existence of /proc/ is how Android is observing that the process
> died, then there can be a large latency where it takes a very long
> time for the parent to actually reap the child way after its memory
> was long freed. A quicker way to know if a process's memory is freed
> before it is reaped could be to read back /proc//maps in
> userspace of the victim , and that file will be empty for zombie
> processes. So then one does not need wait for the parent to reap it. I
> wonder how much of that 100ms you mentioned is actually the "Waiting
> while Parent is reaping the child", than "memory freeing time". So
> yeah for this second problem, the procfds work will help.
>
> By the way another approach that can provide a quick and asynchronous
> notification of when the process memory is freed, is to monitor
> sched_process_exit trace event using eBPF. You can tell eBPF the PID
> that you want to monitor before the SIGKILL. As soon as the process
> dies and its memory is freed, the eBPF program can send a notification
> to user space (using the perf_events polling infra). The
> sched_process_exit fires just after the mmput() happens so it is quite
> close to when the memory is reclaimed. This also doesn't need any
> kernel changes. I could come up with a prototype for this and
> benchmark it on Android, if you want. Just let me know.

Perhaps I'm missing something, but if you want to know when a process has died
after sending a SIGKILL to it, then why not just make the SIGKILL optionally
block until the process has died completely? It'd be rather trivial to just
store a pointer to an onstack completion inside the victim process' task_struct,
and then complete it in free_task().

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Sultan Alsawaf
On Tue, Mar 12, 2019 at 10:17:43AM -0700, Tim Murray wrote:
> Knowing whether a SIGKILL'd process has finished reclaiming is as far
> as I know not possible without something like procfds. That's where
> the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both
> attempt to wait up to 100ms for reclaim to finish by checking for the
> continued existence of the thread that received the SIGKILL, but this
> really means that they wait up to 100ms for the _thread_ to finish,
> which doesn't tell you anything about the memory used by that process.
> If those threads terminate early and lowmemorykiller/lmkd get a signal
> to kill again, then there may be two processes competing for CPU time
> to reclaim memory. That doesn't reclaim any faster and may be an
> unnecessary kill.
> ...
> - offer a way to wait for process termination so lmkd can tell when
> reclaim has finished and know when killing another process is
> appropriate

Should be pretty easy with something like this:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1..6ac478af2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1199,6 +1199,7 @@ struct task_struct {
unsigned long   lowest_stack;
unsigned long   prev_lowest_stack;
 #endif
+   ktime_t sigkill_time;
 
/*
 * New fields for task_struct should be added above here, so that
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa2..0ae182777 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -435,6 +435,8 @@ void put_task_stack(struct task_struct *tsk)
 
 void free_task(struct task_struct *tsk)
 {
+   ktime_t sigkill_time = tsk->sigkill_time;
+   pid_t pid = tsk->pid;
 #ifndef CONFIG_THREAD_INFO_IN_TASK
/*
 * The task is finally done with both the stack and thread_info,
@@ -455,6 +457,9 @@ void free_task(struct task_struct *tsk)
if (tsk->flags & PF_KTHREAD)
free_kthread_struct(tsk);
free_task_struct(tsk);
+   if (sigkill_time)
+   printk("%d killed after %lld us\n", pid,
+  ktime_us_delta(ktime_get(), sigkill_time));
 }
 EXPORT_SYMBOL(free_task);
 
@@ -1881,6 +1886,7 @@ static __latent_entropy struct task_struct *copy_process(
p->sequential_io= 0;
p->sequential_io_avg= 0;
 #endif
+   p->sigkill_time = 0;
 
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
diff --git a/kernel/signal.c b/kernel/signal.c
index 5d53183e2..1142c8811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1168,6 +1168,8 @@ static int __send_signal(int sig, struct kernel_siginfo 
*info, struct task_struc
}
 
 out_set:
+   if (sig == SIGKILL)
+   t->sigkill_time = ktime_get();
signalfd_notify(t, sig);
sigaddset(>signal, sig);


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Sultan Alsawaf
On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> The only way to control the OOM behavior pro-actively is to throttle
> allocation speed. We have memcg high limit for that purpose. Along with
> PSI, I can imagine a reasonably working user space early oom
> notifications and reasonable acting upon that.

The issue with pro-active memory management that prompted me to create this was
poor memory utilization. All of the alternative means of reclaiming pages in the
page allocator's slow path turn out to be very useful for maximizing memory
utilization, which is something that we would have to forgo by relying on a
purely pro-active solution. I have not had a chance to look at PSI yet, but
unless a PSI-enabled solution allows allocations to reach the same point as when
the OOM killer is invoked (which is contradictory to what it sets out to do),
then it cannot take advantage of all of the alternative memory-reclaim means
employed in the slowpath, and will result in killing a process before it is
_really_ necessary.

> If you design is relies on the speed of killing then it is fundamentally
> flawed AFAICT. You cannot assume anything about how quickly a task dies.
> It might be blocked in an uninterruptible sleep or performin an
> operation which takes some time. Sure, oom_reaper might help here but
> still.

In theory we could instantly zap any process that is not trapped in the kernel
at the time that the OOM killer is invoked without any consequences though, no?

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 03:15:35PM -0700, Suren Baghdasaryan wrote:
> This what LMKD currently is - a userspace RT process.
> My point was that this page allocation queue that you implemented
> can't be implemented in userspace, at least not without extensive
> communication with kernel.

Oh, that's easy to address. My page allocation queue and the decision on when to
kill a process are orthogonal. In fact, the page allocation queue could be
touched up a bit to factor in the issues Michal mentioned, and it can be
implemented as an improvement to the existing OOM killer. The point of it is
just to ensure that page allocation requests that have gone OOM are given
priority over other allocation requests when free pages start to trickle in.

Userspace doesn't need to know about the page allocation queue, and the queue is
not necessary to implement the method of determining when to kill processes that
I've proposed. It's an optimization, not a necessity.

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 05:11:25PM -0400, Joel Fernandes wrote:
> But the point is that a transient temporary memory spike should not be a
> signal to kill _any_ process.  The reaction to kill shouldn't be so
> spontaneous that unwanted tasks are killed because the system went into
> panic mode. It should be averaged out which I believe is what PSI does.

In my patch from the first email, I implemented the decision to kill a process
at the same time that the existing kernel OOM killer decides to kill a process.
If the kernel's OOM killer were susceptible to killing processes due to
transient memory spikes, then I think there would have been several complaints
about this behavior regardless of which userspace or architecture is in use.
I think the existing OOM killer has this done right.

The decision to kill a process occurs after the page allocator has tried _very_
hard to satisfy a page allocation via alternative means, such as utilizing
compaction, flushing file-backed pages to disk via kswapd, and direct reclaim.
Once all of those means have failed, it is quite reasonable to kill a process to
free memory. Trying to wait out the memory starvation at this point would be
futile.

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> The idea seems interesting although I need to think about this a bit
> more. Killing processes based on failed page allocation might backfire
> during transient spikes in memory usage.

This issue could be alleviated if tasks could be killed and have their pages
reaped faster. Currently, Linux takes a _very_ long time to free a task's memory
after an initial privileged SIGKILL is sent to a task, even with the task's
priority being set to the highest possible (so unwanted scheduler preemption
starving dying tasks of CPU time is not the issue at play here). I've
frequently measured the difference in time between when a SIGKILL is sent for a
task and when free_task() is called for that task to be hundreds of
milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD
suffers from as well, and perhaps any OOM killer implementation in Linux, since
you cannot evaluate effect you've had on memory pressure by killing a process
for at least several tens of milliseconds.

> AFAIKT the biggest issue with using this approach in userspace is that
> it's not practically implementable without heavy in-kernel support.
> How to implement such interaction between kernel and userspace would
> be an interesting discussion which I would be happy to participate in.

You could signal a lightweight userspace process that has maximum scheduler
priority and have it kill the tasks it'd like.

Thanks,
Sultan


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 06:43:20PM +0100, Michal Hocko wrote:
> I am sorry but we are not going to maintain two different OOM
> implementations in the kernel. From a quick look the implementation is
> quite a hack which is not really suitable for anything but a very
> specific usecase. E.g. reusing a freed page for a waiting allocation
> sounds like an interesting idea but it doesn't really work for many
> reasons. E.g. any NUMA affinity is broken, zone protection doesn't work
> either. Not to mention how the code hooks into the allocator hot paths.
> This is simply no no.
> 
> Last but not least people have worked really hard to provide means (PSI)
> to do what you need in the userspace.

Hi Michal,

Thanks for the feedback. I had no doubt that this would be vehemently rejected
on the mailing list, but I wanted feedback/opinions on it and thus sent it as 
anRFC. At best I thought perhaps the mechanisms I've employed might serve as
inspiration for LMKD improvements in Android, since this hacky OOM killer I've
devised does work quite well for the very specific usecase it is set out to
address. The NUMA affinity and zone protection bits are helpful insights too.

I'll take a look at PSI which Joel mentioned as well.

Thanks,
Sultan Alsawaf


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-10 Thread Sultan Alsawaf
On Sun, Mar 10, 2019 at 10:03:35PM +0100, Greg Kroah-Hartman wrote:
> On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > This is a complete low memory killer solution for Android that is small
> > and simple. It kills the largest, least-important processes it can find
> > whenever a page allocation has completely failed (right after direct
> > reclaim). Processes are killed according to the priorities that Android
> > gives them, so that the least important processes are always killed
> > first. Killing larger processes is preferred in order to free the most
> > memory possible in one go.
> > 
> > Simple LMK is integrated deeply into the page allocator in order to
> > catch exactly when a page allocation fails and exactly when a page is
> > freed. Failed page allocations that have invoked Simple LMK are placed
> > on a queue and wait for Simple LMK to satisfy them. When a page is about
> > to be freed, the failed page allocations are given priority over normal
> > page allocations by Simple LMK to see if they can immediately use the
> > freed page.
> > 
> > Additionally, processes are continuously killed by failed small-order
> > page allocations until they are satisfied.
> > 
> > Signed-off-by: Sultan Alsawaf 
> 
> Wait, why?  We just removed the in-kernel android memory killer, we
> don't want to add another one back again, right?  Android Go devices
> work just fine with the userspace memory killer code, and those are "low
> memory" by design.
> 
> Why do we need kernel code here at all?
> 
> thanks,
> 
> greg k-h

Hi Greg,

Thanks for replying. It has not been my experience and the experience of many
others that Android's userspace low memory memory killer works "just fine." On
my Pixel 3 XL with a meager 4GB of memory, the userspace killer has had issues
with killing too many processes, which has resulted in a noticeably poor user
experience for all Pixel owners. From the looks of lmkd on the master branch,
there still isn't really any definitive solution for this, aside from a 100ms
delay in between process kills.

I think that the userspace low memory killer is more complex than necessary,
especially since in the kernel we can detect exactly when we run out of memory
and react far more quickly than any userspace daemon.

The original reasoning behind why the old kernel low memory killer was removed
is also a bit vague to me. It just seemed to be abandonware, and all of a sudden
a userspace daemon was touted as the solution.

This driver is like an Android-flavored version of the kernel's oom killer, and
has proven quite effective for me on my Pixel. Processes are killed exactly when
a page allocation fails, so memory use is maximized. There is no complexity to
try and estimate how full memory is either.

Thanks,
Sultan


[RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-10 Thread Sultan Alsawaf
From: Sultan Alsawaf 

This is a complete low memory killer solution for Android that is small
and simple. It kills the largest, least-important processes it can find
whenever a page allocation has completely failed (right after direct
reclaim). Processes are killed according to the priorities that Android
gives them, so that the least important processes are always killed
first. Killing larger processes is preferred in order to free the most
memory possible in one go.

Simple LMK is integrated deeply into the page allocator in order to
catch exactly when a page allocation fails and exactly when a page is
freed. Failed page allocations that have invoked Simple LMK are placed
on a queue and wait for Simple LMK to satisfy them. When a page is about
to be freed, the failed page allocations are given priority over normal
page allocations by Simple LMK to see if they can immediately use the
freed page.

Additionally, processes are continuously killed by failed small-order
page allocations until they are satisfied.

Signed-off-by: Sultan Alsawaf 
---
 drivers/android/Kconfig  |  28 
 drivers/android/Makefile |   1 +
 drivers/android/simple_lmk.c | 301 +++
 include/linux/sched.h|   3 +
 include/linux/simple_lmk.h   |  11 ++
 kernel/fork.c|   3 +
 mm/page_alloc.c  |  13 ++
 7 files changed, 360 insertions(+)
 create mode 100644 drivers/android/simple_lmk.c
 create mode 100644 include/linux/simple_lmk.h

diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig
index 6fdf2abe4..7469d049d 100644
--- a/drivers/android/Kconfig
+++ b/drivers/android/Kconfig
@@ -54,6 +54,34 @@ config ANDROID_BINDER_IPC_SELFTEST
  exhaustively with combinations of various buffer sizes and
  alignments.
 
+config ANDROID_SIMPLE_LMK
+   bool "Simple Android Low Memory Killer"
+   depends on !MEMCG
+   ---help---
+ This is a complete low memory killer solution for Android that is
+ small and simple. It is integrated deeply into the page allocator to
+ know exactly when a page allocation hits OOM and exactly when a page
+ is freed. Processes are killed according to the priorities that
+ Android gives them, so that the least important processes are always
+ killed first.
+
+if ANDROID_SIMPLE_LMK
+
+config ANDROID_SIMPLE_LMK_MINFREE
+   int "Minimum MiB of memory to free per reclaim"
+   default "64"
+   help
+ Simple LMK will free at least this many MiB of memory per reclaim.
+
+config ANDROID_SIMPLE_LMK_KILL_TIMEOUT
+   int "Kill timeout in milliseconds"
+   default "50"
+   help
+ Simple LMK will only perform memory reclaim at most once per this
+ amount of time.
+
+endif # if ANDROID_SIMPLE_LMK
+
 endif # if ANDROID
 
 endmenu
diff --git a/drivers/android/Makefile b/drivers/android/Makefile
index c7856e320..7c91293b6 100644
--- a/drivers/android/Makefile
+++ b/drivers/android/Makefile
@@ -3,3 +3,4 @@ ccflags-y += -I$(src)   # needed for trace 
events
 obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o
 obj-$(CONFIG_ANDROID_BINDER_IPC)   += binder.o binder_alloc.o
 obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o
+obj-$(CONFIG_ANDROID_SIMPLE_LMK)   += simple_lmk.o
diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c
new file mode 100644
index 0..8a441650a
--- /dev/null
+++ b/drivers/android/simple_lmk.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Sultan Alsawaf .
+ */
+
+#define pr_fmt(fmt) "simple_lmk: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MIN_FREE_PAGES (CONFIG_ANDROID_SIMPLE_LMK_MINFREE * SZ_1M / PAGE_SIZE)
+
+struct oom_alloc_req {
+   struct page *page;
+   struct completion done;
+   struct list_head lh;
+   unsigned int order;
+   int migratetype;
+};
+
+struct victim_info {
+   struct task_struct *tsk;
+   unsigned long size;
+};
+
+enum {
+   DISABLED,
+   STARTING,
+   READY,
+   KILLING
+};
+
+/* Pulled from the Android framework */
+static const short int adj_prio[] = {
+   906, /* CACHED_APP_MAX_ADJ */
+   905, /* Cached app */
+   904, /* Cached app */
+   903, /* Cached app */
+   902, /* Cached app */
+   901, /* Cached app */
+   900, /* CACHED_APP_MIN_ADJ */
+   800, /* SERVICE_B_ADJ */
+   700, /* PREVIOUS_APP_ADJ */
+   600, /* HOME_APP_ADJ */
+   500, /* SERVICE_ADJ */
+   400, /* HEAVY_WEIGHT_APP_ADJ */
+   300, /* BACKUP_APP_ADJ */
+   200, /* PERCEPTIBLE_APP_ADJ */
+   100, /* VISIBLE_APP_ADJ */
+   0/* FOREGROUND_APP_ADJ */
+};
+
+/* Make sure that PID_MAX_DEFAULT isn't too big, or these arrays will be huge 
*/
+static struct victim_info victim_array[PID_MAX_DEFAULT];
+static 

[PATCH] random: fix inconsistent spinlock usage

2019-02-15 Thread Sultan Alsawaf
All users of the struct entropy_store spinlock use the irqsave spinlock variant.
Spinlock users of the same lock should use be consistent in their use of a
certain spinlock primitive, which makes add_interrupt_randomness()'s spinlock
usage incorrect.

Fix the inconsistency by converting add_interrupt_randomness()'s spinlocks to
use the irqsave primitive.

Signed-off-by: Sultan Alsawaf 
---
 drivers/char/random.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38c6d1af6..1365017a7 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1239,6 +1239,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
  __u64 ip;
  unsigned long seed;
  int credit = 0;
+ unsigned long flags;

  if (cycles == 0)
  cycles = get_reg(fast_pool, regs);
@@ -1269,7 +1270,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
  return;

  r = _pool;
- if (!spin_trylock(>lock))
+ if (!spin_trylock_irqsave(>lock, flags))
  return;

  fast_pool->last = now;
@@ -1285,7 +1286,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
  __mix_pool_bytes(r, , sizeof(seed));
  credit = 1;
  }
- spin_unlock(>lock);
+ spin_unlock_irqrestore(>lock, flags);

  fast_pool->count = 0;

-- 
2.20.1


random: remove no-op BUG_ON

2019-02-15 Thread Sultan Alsawaf
container_of simply does pointer arithmetic; it's not going to spit out NULL, so
this BUG_ON is unneeded.

Signed-off-by: Sultan Alsawaf 
---
 drivers/char/random.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38c6d1af6..5ea9cf3d3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1357,7 +1357,7 @@ static void push_to_pool(struct work_struct *work)
 {
  struct entropy_store *r = container_of(work, struct entropy_store,
push_work);
- BUG_ON(!r);
+
  _xfer_secondary_pool(r, random_read_wakeup_bits/8);
  trace_push_to_pool(r->name, r->entropy_count >> ENTROPY_SHIFT,
 r->pull->entropy_count >> ENTROPY_SHIFT);
-- 
2.20.1


Re: [PATCH] random: remove unused argument from add_interrupt_randomness()

2019-02-15 Thread Sultan Alsawaf
This is still an issue in the latest kernel version.

Sultan

On Mon, Apr 30, 2018 at 8:12 AM Sultan Alsawaf  wrote:
>
> From a872cf4f0bb57a7bb3c95ea557082fb7733344f8 Mon Sep 17 00:00:00 2001
> From: Sultan Alsawaf 
> Date: Sun, 29 Apr 2018 20:04:35 -0700
> Subject: [PATCH] random: remove unused argument from
>  add_interrupt_randomness()
>
> The irq_flags parameter is not used. Remove it.
>
> Signed-off-by: Sultan Alsawaf 
> ---
>  drivers/char/random.c  | 4 ++--
>  drivers/hv/hv.c| 2 +-
>  drivers/hv/vmbus_drv.c | 2 +-
>  include/linux/random.h | 2 +-
>  kernel/irq/handle.c| 2 +-
>  5 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 3cd3aae24d6d..125c1f545277 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -131,7 +131,7 @@
>   * void add_device_randomness(const void *buf, unsigned int size);
>   * void add_input_randomness(unsigned int type, unsigned int code,
>   *unsigned int value);
> - * void add_interrupt_randomness(int irq, int irq_flags);
> + * void add_interrupt_randomness(int irq);
>   * void add_disk_randomness(struct gendisk *disk);
>   *
>   * add_device_randomness() is for adding data to the random pool that
> @@ -1189,7 +1189,7 @@ static __u32 get_reg(struct fast_pool *f, struct 
> pt_regs *regs)
> return *ptr;
>  }
>
> -void add_interrupt_randomness(int irq, int irq_flags)
> +void add_interrupt_randomness(int irq)
>  {
> struct entropy_store*r;
> struct fast_pool*fast_pool = this_cpu_ptr(_randomness);
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 9b82549cbbc8..4b76ba23d666 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -115,7 +115,7 @@ static void hv_stimer0_isr(void)
>
> hv_cpu = this_cpu_ptr(hv_context.cpu_context);
> hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt);
> -   add_interrupt_randomness(stimer0_vector, 0);
> +   add_interrupt_randomness(stimer0_vector);
>  }
>
>  static int hv_ce_set_next_event(unsigned long delta,
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index b10fe26c4891..41f4116f8c31 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1015,7 +1015,7 @@ static void vmbus_isr(void)
> tasklet_schedule(_cpu->msg_dpc);
> }
>
> -   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
> +   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
>  }
>
>
> diff --git a/include/linux/random.h b/include/linux/random.h
> index 2ddf13b4281e..210400258fef 100644
> --- a/include/linux/random.h
> +++ b/include/linux/random.h
> @@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {}
>
>  extern void add_input_randomness(unsigned int type, unsigned int code,
>  unsigned int value) __latent_entropy;
> -extern void add_interrupt_randomness(int irq, int irq_flags) 
> __latent_entropy;
> +extern void add_interrupt_randomness(int irq) __latent_entropy;
>
>  extern void get_random_bytes(void *buf, int nbytes);
>  extern int wait_for_random_bytes(void);
> diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
> index 38554bc35375..e2f7afcb1ae6 100644
> --- a/kernel/irq/handle.c
> +++ b/kernel/irq/handle.c
> @@ -188,7 +188,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
>
> retval = __handle_irq_event_percpu(desc, );
>
> -   add_interrupt_randomness(desc->irq_data.irq, flags);
> +   add_interrupt_randomness(desc->irq_data.irq);
>
> if (!noirqdebug)
> note_interrupt(desc, retval);
> --
> 2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-05-01 Thread Sultan Alsawaf
On Tue, May 01, 2018 at 08:56:04PM -0400, Theodore Y. Ts'o wrote:
> On Tue, May 01, 2018 at 05:43:17PM -0700, Sultan Alsawaf wrote:
> > 
> > I've attached what I think is a reasonable stopgap solution until this is
> > actually fixed. If you're willing to revert the CVE-2018-1108 patches
> > completely, then I don't think you'll mind using this patch in the meantime.
> 
> I would put it slightly differently; reverting the CVE-2018-1108
> patches is less dangerous than what you are proposing in your attached
> patch.
> 
> Again, I think the right answer is to fix userspace to not require
> cryptographic grade entropy during early system startup, and for
> people to *think* about what they are doing.  I've looked at the
> systemd's use of hmac in journal-authenticate, and as near as I can
> tell, there isn't any kind of explanation about why it was necessary,
> or what threat it was trying to protect against.
> 
>   - Ted

Why is /dev/urandom so much more dangerous than /dev/random? The
more I search, the more I see that many sources consider /dev/urandom
to be cryptographically secure... and since I hold down a single key on
the keyboard to make my computer boot without any kernel workarounds,
I'm sure the NSA would eventually notice my predictable behavior and get
their hands on my Richard Stallman photos.

Fixing all the "broken" userspace instances of entropy usage during early
system startup is a tall order. What about barebone machines used as
remote servers? I feel like just "fixing userspace" isn't going to cover
all of the usecases that the CVE-2018-1108 patches broke.

Sultan


Re: Linux messages full of `random: get_random_u32 called from`

2018-05-01 Thread Sultan Alsawaf
On Tue, May 01, 2018 at 08:56:04PM -0400, Theodore Y. Ts'o wrote:
> On Tue, May 01, 2018 at 05:43:17PM -0700, Sultan Alsawaf wrote:
> > 
> > I've attached what I think is a reasonable stopgap solution until this is
> > actually fixed. If you're willing to revert the CVE-2018-1108 patches
> > completely, then I don't think you'll mind using this patch in the meantime.
> 
> I would put it slightly differently; reverting the CVE-2018-1108
> patches is less dangerous than what you are proposing in your attached
> patch.
> 
> Again, I think the right answer is to fix userspace to not require
> cryptographic grade entropy during early system startup, and for
> people to *think* about what they are doing.  I've looked at the
> systemd's use of hmac in journal-authenticate, and as near as I can
> tell, there isn't any kind of explanation about why it was necessary,
> or what threat it was trying to protect against.
> 
>   - Ted

Why is /dev/urandom so much more dangerous than /dev/random? The
more I search, the more I see that many sources consider /dev/urandom
to be cryptographically secure... and since I hold down a single key on
the keyboard to make my computer boot without any kernel workarounds,
I'm sure the NSA would eventually notice my predictable behavior and get
their hands on my Richard Stallman photos.

Fixing all the "broken" userspace instances of entropy usage during early
system startup is a tall order. What about barebone machines used as
remote servers? I feel like just "fixing userspace" isn't going to cover
all of the usecases that the CVE-2018-1108 patches broke.

Sultan


Re: Linux messages full of `random: get_random_u32 called from`

2018-05-01 Thread Sultan Alsawaf
On Tue, May 01, 2018 at 05:35:56PM -0500, Justin Forbes wrote:
> 
> I have not reproduced in GCE myself.  We did get some confirmation
> that removing dracut-fips does make the problem less dire (but I
> wouldn't call a 4 minute boot a win, but booting in 4 minutes is
> better than not booting at all).  Specifically systemd calls libgcrypt
> before it even opens the log with fips there, and this is before
> virtio-rng modules could even load. Right now though, we are looking
> at pretty much any possible options as the majority of people are
> calling for me to backout the patches completely from rawhide.

I've attached what I think is a reasonable stopgap solution until this is
actually fixed. If you're willing to revert the CVE-2018-1108 patches
completely, then I don't think you'll mind using this patch in the meantime.

Sultan

>From 5be2efdde744d3c55db3df81c0493fc67dc35620 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Tue, 1 May 2018 17:36:17 -0700
Subject: [PATCH] random: use urandom instead of random for now and speed up
 crng init

With the fixes for CVE-2018-1108, /dev/random now requires user-provided
entropy on quite a few machines lacking high levels of boot entropy
in order to complete its initialization. This causes issues on environments
where userspace depends on /dev/random in order to finish booting
completely (i.e., userspace will remain stuck, unable to boot, waiting for
entropy more-or-less indefinitely until the user provides it via something
like keystrokes or mouse movements).

As a temporary workaround, redirect /dev/random to /dev/urandom instead,
and speed up the initialization process by slightly relaxing the
threshold for interrupts to go towards adding one bit of entropy credit
(only until initialization is complete).

Signed-off-by: Sultan Alsawaf <sultan...@gmail.com>
---
 drivers/char/mem.c| 3 ++-
 drivers/char/random.c | 9 ++---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index ffeb60d3434c..cc9507f01c79 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -870,7 +870,8 @@ static const struct memdev {
 #endif
 [5] = { "zero", 0666, _fops, 0 },
 [7] = { "full", 0666, _fops, 0 },
-[8] = { "random", 0666, _fops, 0 },
+/* Redirect /dev/random to /dev/urandom until /dev/random is fixed */
+[8] = { "random", 0666, _fops, 0 },
 [9] = { "urandom", 0666, _fops, 0 },
 #ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, _fops, 0 },
diff --git a/drivers/char/random.c b/drivers/char/random.c
index d9e38523b383..bce3b43cdd3b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1200,9 +1200,12 @@ void add_interrupt_randomness(int irq)
return;
}
 
-   if ((fast_pool->count < 64) &&
-   !time_after(now, fast_pool->last + HZ))
-   return;
+   if (fast_pool->count < 64) {
+   unsigned long timeout = crng_ready() ? HZ : HZ / 4;
+
+   if (!time_after(now, fast_pool->last + timeout))
+   return;
+   }
 
r = _pool;
if (!spin_trylock(>lock))
-- 
2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-05-01 Thread Sultan Alsawaf
On Tue, May 01, 2018 at 05:35:56PM -0500, Justin Forbes wrote:
> 
> I have not reproduced in GCE myself.  We did get some confirmation
> that removing dracut-fips does make the problem less dire (but I
> wouldn't call a 4 minute boot a win, but booting in 4 minutes is
> better than not booting at all).  Specifically systemd calls libgcrypt
> before it even opens the log with fips there, and this is before
> virtio-rng modules could even load. Right now though, we are looking
> at pretty much any possible options as the majority of people are
> calling for me to backout the patches completely from rawhide.

I've attached what I think is a reasonable stopgap solution until this is
actually fixed. If you're willing to revert the CVE-2018-1108 patches
completely, then I don't think you'll mind using this patch in the meantime.

Sultan

>From 5be2efdde744d3c55db3df81c0493fc67dc35620 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Tue, 1 May 2018 17:36:17 -0700
Subject: [PATCH] random: use urandom instead of random for now and speed up
 crng init

With the fixes for CVE-2018-1108, /dev/random now requires user-provided
entropy on quite a few machines lacking high levels of boot entropy
in order to complete its initialization. This causes issues on environments
where userspace depends on /dev/random in order to finish booting
completely (i.e., userspace will remain stuck, unable to boot, waiting for
entropy more-or-less indefinitely until the user provides it via something
like keystrokes or mouse movements).

As a temporary workaround, redirect /dev/random to /dev/urandom instead,
and speed up the initialization process by slightly relaxing the
threshold for interrupts to go towards adding one bit of entropy credit
(only until initialization is complete).

Signed-off-by: Sultan Alsawaf 
---
 drivers/char/mem.c| 3 ++-
 drivers/char/random.c | 9 ++---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index ffeb60d3434c..cc9507f01c79 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -870,7 +870,8 @@ static const struct memdev {
 #endif
 [5] = { "zero", 0666, _fops, 0 },
 [7] = { "full", 0666, _fops, 0 },
-[8] = { "random", 0666, _fops, 0 },
+/* Redirect /dev/random to /dev/urandom until /dev/random is fixed */
+[8] = { "random", 0666, _fops, 0 },
 [9] = { "urandom", 0666, _fops, 0 },
 #ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, _fops, 0 },
diff --git a/drivers/char/random.c b/drivers/char/random.c
index d9e38523b383..bce3b43cdd3b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1200,9 +1200,12 @@ void add_interrupt_randomness(int irq)
return;
}
 
-   if ((fast_pool->count < 64) &&
-   !time_after(now, fast_pool->last + HZ))
-   return;
+   if (fast_pool->count < 64) {
+   unsigned long timeout = crng_ready() ? HZ : HZ / 4;
+
+   if (!time_after(now, fast_pool->last + timeout))
+   return;
+   }
 
r = _pool;
if (!spin_trylock(>lock))
-- 
2.14.1


Re: [PATCH] random: remove unused argument from add_interrupt_randomness()

2018-04-30 Thread Sultan Alsawaf
>From a872cf4f0bb57a7bb3c95ea557082fb7733344f8 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Sun, 29 Apr 2018 20:04:35 -0700
Subject: [PATCH] random: remove unused argument from
 add_interrupt_randomness()

The irq_flags parameter is not used. Remove it.

Signed-off-by: Sultan Alsawaf <sultan...@gmail.com>
---
 drivers/char/random.c  | 4 ++--
 drivers/hv/hv.c| 2 +-
 drivers/hv/vmbus_drv.c | 2 +-
 include/linux/random.h | 2 +-
 kernel/irq/handle.c| 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3cd3aae24d6d..125c1f545277 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -131,7 +131,7 @@
  * void add_device_randomness(const void *buf, unsigned int size);
  * void add_input_randomness(unsigned int type, unsigned int code,
  *unsigned int value);
- * void add_interrupt_randomness(int irq, int irq_flags);
+ * void add_interrupt_randomness(int irq);
  * void add_disk_randomness(struct gendisk *disk);
  *
  * add_device_randomness() is for adding data to the random pool that
@@ -1189,7 +1189,7 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs 
*regs)
return *ptr;
 }
 
-void add_interrupt_randomness(int irq, int irq_flags)
+void add_interrupt_randomness(int irq)
 {
struct entropy_store*r;
struct fast_pool*fast_pool = this_cpu_ptr(_randomness);
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 9b82549cbbc8..4b76ba23d666 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -115,7 +115,7 @@ static void hv_stimer0_isr(void)
 
hv_cpu = this_cpu_ptr(hv_context.cpu_context);
hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt);
-   add_interrupt_randomness(stimer0_vector, 0);
+   add_interrupt_randomness(stimer0_vector);
 }
 
 static int hv_ce_set_next_event(unsigned long delta,
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b10fe26c4891..41f4116f8c31 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1015,7 +1015,7 @@ static void vmbus_isr(void)
tasklet_schedule(_cpu->msg_dpc);
}
 
-   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
+   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
 }
 
 
diff --git a/include/linux/random.h b/include/linux/random.h
index 2ddf13b4281e..210400258fef 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {}
 
 extern void add_input_randomness(unsigned int type, unsigned int code,
 unsigned int value) __latent_entropy;
-extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy;
+extern void add_interrupt_randomness(int irq) __latent_entropy;
 
 extern void get_random_bytes(void *buf, int nbytes);
 extern int wait_for_random_bytes(void);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 38554bc35375..e2f7afcb1ae6 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -188,7 +188,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
 
retval = __handle_irq_event_percpu(desc, );
 
-   add_interrupt_randomness(desc->irq_data.irq, flags);
+   add_interrupt_randomness(desc->irq_data.irq);
 
if (!noirqdebug)
note_interrupt(desc, retval);
-- 
2.14.1


Re: [PATCH] random: remove unused argument from add_interrupt_randomness()

2018-04-30 Thread Sultan Alsawaf
>From a872cf4f0bb57a7bb3c95ea557082fb7733344f8 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Sun, 29 Apr 2018 20:04:35 -0700
Subject: [PATCH] random: remove unused argument from
 add_interrupt_randomness()

The irq_flags parameter is not used. Remove it.

Signed-off-by: Sultan Alsawaf 
---
 drivers/char/random.c  | 4 ++--
 drivers/hv/hv.c| 2 +-
 drivers/hv/vmbus_drv.c | 2 +-
 include/linux/random.h | 2 +-
 kernel/irq/handle.c| 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3cd3aae24d6d..125c1f545277 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -131,7 +131,7 @@
  * void add_device_randomness(const void *buf, unsigned int size);
  * void add_input_randomness(unsigned int type, unsigned int code,
  *unsigned int value);
- * void add_interrupt_randomness(int irq, int irq_flags);
+ * void add_interrupt_randomness(int irq);
  * void add_disk_randomness(struct gendisk *disk);
  *
  * add_device_randomness() is for adding data to the random pool that
@@ -1189,7 +1189,7 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs 
*regs)
return *ptr;
 }
 
-void add_interrupt_randomness(int irq, int irq_flags)
+void add_interrupt_randomness(int irq)
 {
struct entropy_store*r;
struct fast_pool*fast_pool = this_cpu_ptr(_randomness);
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 9b82549cbbc8..4b76ba23d666 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -115,7 +115,7 @@ static void hv_stimer0_isr(void)
 
hv_cpu = this_cpu_ptr(hv_context.cpu_context);
hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt);
-   add_interrupt_randomness(stimer0_vector, 0);
+   add_interrupt_randomness(stimer0_vector);
 }
 
 static int hv_ce_set_next_event(unsigned long delta,
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b10fe26c4891..41f4116f8c31 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1015,7 +1015,7 @@ static void vmbus_isr(void)
tasklet_schedule(_cpu->msg_dpc);
}
 
-   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
+   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
 }
 
 
diff --git a/include/linux/random.h b/include/linux/random.h
index 2ddf13b4281e..210400258fef 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {}
 
 extern void add_input_randomness(unsigned int type, unsigned int code,
 unsigned int value) __latent_entropy;
-extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy;
+extern void add_interrupt_randomness(int irq) __latent_entropy;
 
 extern void get_random_bytes(void *buf, int nbytes);
 extern int wait_for_random_bytes(void);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 38554bc35375..e2f7afcb1ae6 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -188,7 +188,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
 
retval = __handle_irq_event_percpu(desc, );
 
-   add_interrupt_randomness(desc->irq_data.irq, flags);
+   add_interrupt_randomness(desc->irq_data.irq);
 
if (!noirqdebug)
note_interrupt(desc, retval);
-- 
2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 08:11:07PM -0400, Theodore Y. Ts'o wrote:
>
> What your patch does is assume that there is a full bit of uncertainty
> that can be obtained from the information gathered from each
> interrupt.  I *might* be willing to assume that to be valid on x86
> systems that have a high resolution cycle counter.  But on ARM
> platforms, especially during system bootup when the user isn't typing
> anything and SSD's and flash storage tend to have very predictable
> timing patterns?  Not a bet I'd be willing to take.  Even with a cycle
> counter, there's a reason why we assumed that we need to mix in timing
> results from 64 interrupts or one second's worth before we would give
> a single bit's worth of entropy credit.
> 
>   - Ted

What about abusing high-resolution timers to get entropy? Since hrtimers can't
make guarantees down to the nanosecond, there's always a skew between the
requested expiry time and the actual expiry time.

Please see the attached patch and let me know just how horrible it is.

Sultan

>From b0d21c38558c661531d4cb46816fbb36b874a169 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Sun, 29 Apr 2018 21:28:08 -0700
Subject: [PATCH] random: use high-res timers to generate entropy until crng
 init is done

---
 drivers/char/random.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d9e38523b383..af2d60bbcec3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -286,6 +286,7 @@
 #define OUTPUT_POOL_WORDS  (1 << (OUTPUT_POOL_SHIFT-5))
 #define SEC_XFER_SIZE  512
 #define EXTRACT_SIZE   10
+#define ENTROPY_GEN_INTVL_NS   (1 * NSEC_PER_MSEC)
 
 
 #define LONGS(x) (((x) + sizeof(unsigned long) - 1)/sizeof(unsigned long))
@@ -408,6 +409,8 @@ static struct fasync_struct *fasync;
 static DEFINE_SPINLOCK(random_ready_list_lock);
 static LIST_HEAD(random_ready_list);
 
+static struct hrtimer entropy_gen_hrtimer;
+
 struct crng_state {
__u32   state[16];
unsigned long   init_time;
@@ -2287,3 +2290,47 @@ void add_hwgenerator_randomness(const char *buffer, 
size_t count,
credit_entropy_bits(poolp, entropy);
 }
 EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
+
+/*
+ * Generate entropy on init using high-res timers. Although high-res timers
+ * provide nanosecond precision, they don't actually honor requests to the
+ * nanosecond. The skew between the expected time difference in nanoseconds and
+ * the actual time difference can be used as a way to generate entropy on boot
+ * for machines that lack sufficient boot-time entropy.
+ */
+static enum hrtimer_restart entropy_timer_cb(struct hrtimer *timer)
+{
+   static u64 prev_ns;
+   u64 curr_ns, delta;
+
+   if (crng_ready())
+   return HRTIMER_NORESTART;
+
+   curr_ns = ktime_get_mono_fast_ns();
+   delta = curr_ns - prev_ns;
+
+   add_interrupt_randomness(delta);
+
+   /* Use the hrtimer skew to make the next interval more unpredictable */
+   if (likely(prev_ns))
+   hrtimer_add_expires_ns(timer, delta);
+   else
+   hrtimer_add_expires_ns(timer, ENTROPY_GEN_INTVL_NS);
+
+   prev_ns = curr_ns;
+   return HRTIMER_RESTART;
+}
+
+static int entropy_gen_hrtimer_init(void)
+{
+   if (!IS_ENABLED(CONFIG_HIGH_RES_TIMERS))
+   return 0;
+
+   hrtimer_init(_gen_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+   entropy_gen_hrtimer.function = entropy_timer_cb;
+   hrtimer_start(_gen_hrtimer, ns_to_ktime(ENTROPY_GEN_INTVL_NS),
+   HRTIMER_MODE_REL);
+   return 0;
+}
+core_initcall(entropy_gen_hrtimer_init);
-- 
2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 08:11:07PM -0400, Theodore Y. Ts'o wrote:
>
> What your patch does is assume that there is a full bit of uncertainty
> that can be obtained from the information gathered from each
> interrupt.  I *might* be willing to assume that to be valid on x86
> systems that have a high resolution cycle counter.  But on ARM
> platforms, especially during system bootup when the user isn't typing
> anything and SSD's and flash storage tend to have very predictable
> timing patterns?  Not a bet I'd be willing to take.  Even with a cycle
> counter, there's a reason why we assumed that we need to mix in timing
> results from 64 interrupts or one second's worth before we would give
> a single bit's worth of entropy credit.
> 
>   - Ted

What about abusing high-resolution timers to get entropy? Since hrtimers can't
make guarantees down to the nanosecond, there's always a skew between the
requested expiry time and the actual expiry time.

Please see the attached patch and let me know just how horrible it is.

Sultan

>From b0d21c38558c661531d4cb46816fbb36b874a169 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Sun, 29 Apr 2018 21:28:08 -0700
Subject: [PATCH] random: use high-res timers to generate entropy until crng
 init is done

---
 drivers/char/random.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d9e38523b383..af2d60bbcec3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -286,6 +286,7 @@
 #define OUTPUT_POOL_WORDS  (1 << (OUTPUT_POOL_SHIFT-5))
 #define SEC_XFER_SIZE  512
 #define EXTRACT_SIZE   10
+#define ENTROPY_GEN_INTVL_NS   (1 * NSEC_PER_MSEC)
 
 
 #define LONGS(x) (((x) + sizeof(unsigned long) - 1)/sizeof(unsigned long))
@@ -408,6 +409,8 @@ static struct fasync_struct *fasync;
 static DEFINE_SPINLOCK(random_ready_list_lock);
 static LIST_HEAD(random_ready_list);
 
+static struct hrtimer entropy_gen_hrtimer;
+
 struct crng_state {
__u32   state[16];
unsigned long   init_time;
@@ -2287,3 +2290,47 @@ void add_hwgenerator_randomness(const char *buffer, 
size_t count,
credit_entropy_bits(poolp, entropy);
 }
 EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
+
+/*
+ * Generate entropy on init using high-res timers. Although high-res timers
+ * provide nanosecond precision, they don't actually honor requests to the
+ * nanosecond. The skew between the expected time difference in nanoseconds and
+ * the actual time difference can be used as a way to generate entropy on boot
+ * for machines that lack sufficient boot-time entropy.
+ */
+static enum hrtimer_restart entropy_timer_cb(struct hrtimer *timer)
+{
+   static u64 prev_ns;
+   u64 curr_ns, delta;
+
+   if (crng_ready())
+   return HRTIMER_NORESTART;
+
+   curr_ns = ktime_get_mono_fast_ns();
+   delta = curr_ns - prev_ns;
+
+   add_interrupt_randomness(delta);
+
+   /* Use the hrtimer skew to make the next interval more unpredictable */
+   if (likely(prev_ns))
+   hrtimer_add_expires_ns(timer, delta);
+   else
+   hrtimer_add_expires_ns(timer, ENTROPY_GEN_INTVL_NS);
+
+   prev_ns = curr_ns;
+   return HRTIMER_RESTART;
+}
+
+static int entropy_gen_hrtimer_init(void)
+{
+   if (!IS_ENABLED(CONFIG_HIGH_RES_TIMERS))
+   return 0;
+
+   hrtimer_init(_gen_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+   entropy_gen_hrtimer.function = entropy_timer_cb;
+   hrtimer_start(_gen_hrtimer, ns_to_ktime(ENTROPY_GEN_INTVL_NS),
+   HRTIMER_MODE_REL);
+   return 0;
+}
+core_initcall(entropy_gen_hrtimer_init);
-- 
2.14.1


[PATCH] random: remove unused argument from add_interrupt_randomness()

2018-04-29 Thread Sultan Alsawaf
>From cdc2a03f93fdec88ad040a212605e20ab97c3e19 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Sun, 29 Apr 2018 20:04:35 -0700
Subject: [PATCH] random: remove unused argument from add_interrupt_randomness()

The irq_flags parameter is not used. Remove it.

Signed-off-by: Sultan Alsawaf <sultan...@gmail.com>
---
 drivers/char/random.c  | 4 ++--
 drivers/hv/vmbus_drv.c | 2 +-
 include/linux/random.h | 2 +-
 kernel/irq/handle.c| 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38729baed6ee..d9e38523b383 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -131,7 +131,7 @@
  * void add_device_randomness(const void *buf, unsigned int size);
  * void add_input_randomness(unsigned int type, unsigned int code,
  *unsigned int value);
- * void add_interrupt_randomness(int irq, int irq_flags);
+ * void add_interrupt_randomness(int irq);
  * void add_disk_randomness(struct gendisk *disk);
  *
  * add_device_randomness() is for adding data to the random pool that
@@ -1164,7 +1164,7 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs 
*regs)
return *ptr;
 }
 
-void add_interrupt_randomness(int irq, int irq_flags)
+void add_interrupt_randomness(int irq)
 {
struct entropy_store*r;
struct fast_pool*fast_pool = this_cpu_ptr(_randomness);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc65c4d79c1f..777096d560cb 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1016,7 +1016,7 @@ static void vmbus_isr(void)
tasklet_schedule(_cpu->msg_dpc);
}
 
-   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
+   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
 }
 
 
diff --git a/include/linux/random.h b/include/linux/random.h
index 4024f7d9c77d..b71f87dbf7cd 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {}
 
 extern void add_input_randomness(unsigned int type, unsigned int code,
 unsigned int value) __latent_entropy;
-extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy;
+extern void add_interrupt_randomness(int irq) __latent_entropy;
 
 extern void get_random_bytes(void *buf, int nbytes);
 extern int wait_for_random_bytes(void);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 79f987b942b8..1bc4dcc489d0 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -186,7 +186,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
 
retval = __handle_irq_event_percpu(desc, );
 
-   add_interrupt_randomness(desc->irq_data.irq, flags);
+   add_interrupt_randomness(desc->irq_data.irq);
 
if (!noirqdebug)
note_interrupt(desc, retval);
-- 
2.14.1


[PATCH] random: remove unused argument from add_interrupt_randomness()

2018-04-29 Thread Sultan Alsawaf
>From cdc2a03f93fdec88ad040a212605e20ab97c3e19 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Sun, 29 Apr 2018 20:04:35 -0700
Subject: [PATCH] random: remove unused argument from add_interrupt_randomness()

The irq_flags parameter is not used. Remove it.

Signed-off-by: Sultan Alsawaf 
---
 drivers/char/random.c  | 4 ++--
 drivers/hv/vmbus_drv.c | 2 +-
 include/linux/random.h | 2 +-
 kernel/irq/handle.c| 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38729baed6ee..d9e38523b383 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -131,7 +131,7 @@
  * void add_device_randomness(const void *buf, unsigned int size);
  * void add_input_randomness(unsigned int type, unsigned int code,
  *unsigned int value);
- * void add_interrupt_randomness(int irq, int irq_flags);
+ * void add_interrupt_randomness(int irq);
  * void add_disk_randomness(struct gendisk *disk);
  *
  * add_device_randomness() is for adding data to the random pool that
@@ -1164,7 +1164,7 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs 
*regs)
return *ptr;
 }
 
-void add_interrupt_randomness(int irq, int irq_flags)
+void add_interrupt_randomness(int irq)
 {
struct entropy_store*r;
struct fast_pool*fast_pool = this_cpu_ptr(_randomness);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc65c4d79c1f..777096d560cb 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1016,7 +1016,7 @@ static void vmbus_isr(void)
tasklet_schedule(_cpu->msg_dpc);
}
 
-   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
+   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
 }
 
 
diff --git a/include/linux/random.h b/include/linux/random.h
index 4024f7d9c77d..b71f87dbf7cd 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {}
 
 extern void add_input_randomness(unsigned int type, unsigned int code,
 unsigned int value) __latent_entropy;
-extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy;
+extern void add_interrupt_randomness(int irq) __latent_entropy;
 
 extern void get_random_bytes(void *buf, int nbytes);
 extern int wait_for_random_bytes(void);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 79f987b942b8..1bc4dcc489d0 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -186,7 +186,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
 
retval = __handle_irq_event_percpu(desc, );
 
-   add_interrupt_randomness(desc->irq_data.irq, flags);
+   add_interrupt_randomness(desc->irq_data.irq);
 
if (!noirqdebug)
note_interrupt(desc, retval);
-- 
2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Mon, Apr 30, 2018 at 12:43:48AM +0200, Jason A. Donenfeld wrote:
> > -   if ((fast_pool->count < 64) &&
> > -   !time_after(now, fast_pool->last + HZ))
> > -   return;
> > -
> 
> I suspect you still want the rate-limiting in place. But if you _do_
> want to cheat like this, you could instead just modify the condition
> to only relax the rate limiting when !crng_init().

Good idea. Attached a new patch that's less intrusive. It still fixes my issue,
of course.

Sultan

>From 6870b0383b88438d842599aa8608a260e6fb0ed2 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Sun, 29 Apr 2018 15:44:27 -0700
Subject: [PATCH] random: don't ratelimit add_interrupt_randomness() until crng
 is ready

---
 drivers/char/random.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38729baed6ee..8c00c008e797 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1201,7 +1201,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
}
 
if ((fast_pool->count < 64) &&
-   !time_after(now, fast_pool->last + HZ))
+   !time_after(now, fast_pool->last + HZ) && crng_ready())
return;
 
r = _pool;
-- 
2.14.1



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Mon, Apr 30, 2018 at 12:43:48AM +0200, Jason A. Donenfeld wrote:
> > -   if ((fast_pool->count < 64) &&
> > -   !time_after(now, fast_pool->last + HZ))
> > -   return;
> > -
> 
> I suspect you still want the rate-limiting in place. But if you _do_
> want to cheat like this, you could instead just modify the condition
> to only relax the rate limiting when !crng_init().

Good idea. Attached a new patch that's less intrusive. It still fixes my issue,
of course.

Sultan

>From 6870b0383b88438d842599aa8608a260e6fb0ed2 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Sun, 29 Apr 2018 15:44:27 -0700
Subject: [PATCH] random: don't ratelimit add_interrupt_randomness() until crng
 is ready

---
 drivers/char/random.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38729baed6ee..8c00c008e797 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1201,7 +1201,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
}
 
if ((fast_pool->count < 64) &&
-   !time_after(now, fast_pool->last + HZ))
+   !time_after(now, fast_pool->last + HZ) && crng_ready())
return;
 
r = _pool;
-- 
2.14.1



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 06:05:19PM -0400, Theodore Y. Ts'o wrote:
> It's more accurate to say that using /dev/urandom is no worse than
> before (from a few years ago).  There are, alas, plenty of
> distributions and user space application programmers that basically
> got lazy using /dev/urandom, and assumed that there would be plenty of
> entropy during early system startup.
> 
> When they switched over the getrandom(2), the most egregious examples
> of this caused pain (and they got fixed), but due to a bug in
> drivers/char/random.c, if getrandom(2) was called after the entropy
> pool was "half initialized", it would not block, but proceed.
> 
> Is that exploitable?  Well, Jann and I didn't find an _obvious_ way to
> exploit the short coming, which is this wasn't treated like an
> emergency situation ala the embarassing situation we had five years
> ago[1].
> 
> [1] https://factorable.net/paper.html
> 
> However, it was enough to make us be uncomfortable, which is why I
> pushed the changes that I did.  At least on the devices we had at
> hand, using the distributions that we typically use, the impact seemed
> minimal.  Unfortuantely, there is no way to know for sure without
> rolling out change and seeing who screams.  In the ideal world,
> software would not require cryptographic randomness immediately after
> boot, before the user logs in.  And ***really***, as in [1], softwaret
> should not be generating long-term public keys that are essential to
> the security of the box a few seconds immediately after the device is
> first unboxed and plugged in.i
> 
> What would be useful is if people gave reports that listed exactly
> what laptop and distributions they are using.  Just "a high spec x86
> laptop" isn't terribly useful, because *my* brand-new Dell XPS 13
> running Debian testing is working just fine.  The year, model, make,
> and CPU type plus what distribution (and distro version number) you
> are running is useful, so I can assess how wide spread the unhappiness
> is going to be, and what mitigation steps make sense.
> 
> 
> What mitigations steps can be taken?
> 
> If you believe in security-through-complexity (the cache architecture
> of x86 is *so* complicated no one can understand it, so
> Jitterentropy / Haveged *must* be secure), or security-through-secrecy
> (the cache architecture of x86 is only avilable to internal architects
> inside Intel, so Jitterentropy / Haveged *must* be secure, never mind
> that the Intel CPU architects who were asked about it were "nervous"),
> then wiring up CONFIG_JITTERENTROPY or using haveged might be one
> approach.
> 
> If you believe that Intel hasn't backdoored RDRAND, then installing
> rng-tools and running rngd with --enable-drng will enable RDRAND.
> That seems to be popular with various defense contractors, perhaps on
> the assumption that if it _was_ backdoored (no one knows for sure), it
> was probably with the connivance or request of the US government, who
> doesn't need to worry about spying on itself.
> 
> Or you can use some kind of open hardware design RNG, such as
> ChoasKey[2] from Altus Metrum.  But that requires using specially
> ordered hardware plugged into a USB slot, and it's probably not a mass
> solution.
> 
> [2] https://altusmetrum.org/ChaosKey/
> 
> 
> Personally, I prefer fixing the software to simply not require
> cryptographic grade entropy before the user has logged in.  Because
> it's better than the alternatives.
> 
>   - Ted
> 

The attached patch fixes my crng init woes. With it, crng init completes 0.86
seconds into boot, but I can't help but feel like a solution this obvious would
just expose my Richard Stallman photo collection to prying eyes at the NSA.

Thoughts on the patch?

Sultan

>From 597b0f2b3c986f853bf1d30a7fb9d76869e47fe8 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Sun, 29 Apr 2018 15:22:59 -0700
Subject: [PATCH] random: remove ratelimiting from add_interrupt_randomness()

---
 drivers/char/random.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38729baed6ee..5b38277b104a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -574,7 +574,6 @@ static void mix_pool_bytes(struct entropy_store *r, const 
void *in,
 
 struct fast_pool {
__u32   pool[4];
-   unsigned long   last;
unsigned short  reg_idx;
unsigned char   count;
 };
@@ -1195,20 +1194,14 @@ void add_interrupt_randomness(int irq, int irq_flags)
crng_fast_load((char *) fast_pool->pool,
   sizeof(fast_pool->pool))) {
fast_pool->count = 0;
-   

Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 06:05:19PM -0400, Theodore Y. Ts'o wrote:
> It's more accurate to say that using /dev/urandom is no worse than
> before (from a few years ago).  There are, alas, plenty of
> distributions and user space application programmers that basically
> got lazy using /dev/urandom, and assumed that there would be plenty of
> entropy during early system startup.
> 
> When they switched over the getrandom(2), the most egregious examples
> of this caused pain (and they got fixed), but due to a bug in
> drivers/char/random.c, if getrandom(2) was called after the entropy
> pool was "half initialized", it would not block, but proceed.
> 
> Is that exploitable?  Well, Jann and I didn't find an _obvious_ way to
> exploit the short coming, which is this wasn't treated like an
> emergency situation ala the embarassing situation we had five years
> ago[1].
> 
> [1] https://factorable.net/paper.html
> 
> However, it was enough to make us be uncomfortable, which is why I
> pushed the changes that I did.  At least on the devices we had at
> hand, using the distributions that we typically use, the impact seemed
> minimal.  Unfortuantely, there is no way to know for sure without
> rolling out change and seeing who screams.  In the ideal world,
> software would not require cryptographic randomness immediately after
> boot, before the user logs in.  And ***really***, as in [1], softwaret
> should not be generating long-term public keys that are essential to
> the security of the box a few seconds immediately after the device is
> first unboxed and plugged in.i
> 
> What would be useful is if people gave reports that listed exactly
> what laptop and distributions they are using.  Just "a high spec x86
> laptop" isn't terribly useful, because *my* brand-new Dell XPS 13
> running Debian testing is working just fine.  The year, model, make,
> and CPU type plus what distribution (and distro version number) you
> are running is useful, so I can assess how wide spread the unhappiness
> is going to be, and what mitigation steps make sense.
> 
> 
> What mitigations steps can be taken?
> 
> If you believe in security-through-complexity (the cache architecture
> of x86 is *so* complicated no one can understand it, so
> Jitterentropy / Haveged *must* be secure), or security-through-secrecy
> (the cache architecture of x86 is only avilable to internal architects
> inside Intel, so Jitterentropy / Haveged *must* be secure, never mind
> that the Intel CPU architects who were asked about it were "nervous"),
> then wiring up CONFIG_JITTERENTROPY or using haveged might be one
> approach.
> 
> If you believe that Intel hasn't backdoored RDRAND, then installing
> rng-tools and running rngd with --enable-drng will enable RDRAND.
> That seems to be popular with various defense contractors, perhaps on
> the assumption that if it _was_ backdoored (no one knows for sure), it
> was probably with the connivance or request of the US government, who
> doesn't need to worry about spying on itself.
> 
> Or you can use some kind of open hardware design RNG, such as
> ChoasKey[2] from Altus Metrum.  But that requires using specially
> ordered hardware plugged into a USB slot, and it's probably not a mass
> solution.
> 
> [2] https://altusmetrum.org/ChaosKey/
> 
> 
> Personally, I prefer fixing the software to simply not require
> cryptographic grade entropy before the user has logged in.  Because
> it's better than the alternatives.
> 
>   - Ted
> 

The attached patch fixes my crng init woes. With it, crng init completes 0.86
seconds into boot, but I can't help but feel like a solution this obvious would
just expose my Richard Stallman photo collection to prying eyes at the NSA.

Thoughts on the patch?

Sultan

>From 597b0f2b3c986f853bf1d30a7fb9d76869e47fe8 Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Sun, 29 Apr 2018 15:22:59 -0700
Subject: [PATCH] random: remove ratelimiting from add_interrupt_randomness()

---
 drivers/char/random.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38729baed6ee..5b38277b104a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -574,7 +574,6 @@ static void mix_pool_bytes(struct entropy_store *r, const 
void *in,
 
 struct fast_pool {
__u32   pool[4];
-   unsigned long   last;
unsigned short  reg_idx;
unsigned char   count;
 };
@@ -1195,20 +1194,14 @@ void add_interrupt_randomness(int irq, int irq_flags)
crng_fast_load((char *) fast_pool->pool,
   sizeof(fast_pool->pool))) {
fast_pool->count = 0;
-   

Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 11:18:55PM +0200, Pavel Machek wrote:
> So -- I'm pretty sure systemd and friends should be using
> /dev/urandom. Maybe gpg wants to use /dev/random. _Maybe_.
> 
> [2.948192] random: systemd: uninitialized urandom read (16 bytes
> read)
> [2.953526] systemd[1]: systemd 215 running in system mode. (+PAM
> +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ
> -SECCOMP -APPARMOR)
> [2.980278] systemd[1]: Detected architecture 'x86'.
> [3.115072] usb 5-2: New USB device found, idVendor=0483,
> idProduct=2016, bcdDevice= 0.01
> [3.119633] usb 5-2: New USB device strings: Mfr=1, Product=2,
> SerialNumber=0
> [3.124147] usb 5-2: Product: Biometric Coprocessor
> [3.128621] usb 5-2: Manufacturer: STMicroelectronics
> [3.163839] systemd[1]: Failed to insert module 'ipv6'
> [3.181266] systemd[1]: Set hostname to .
> [3.267243] random: systemd-sysv-ge: uninitialized urandom read (16
> bytes read)
> [3.669590] random: systemd-sysv-ge: uninitialized urandom read (16
> bytes read)
> [3.696242] random: systemd: uninitialized urandom read (16 bytes
> read)
> [3.700066] random: systemd: uninitialized urandom read (16 bytes
> read)
> [3.703716] random: systemd: uninitialized urandom read (16 bytes
> read)
> 
> Anyway, urandom should need to be seeded once, and then provide random
> data forever... which is not impression I get from the dmesg output
> above. Boot clearly proceeds... somehow. So now I'm confused.

Hmm... Well, the attached patch (which redirects /dev/random to /dev/urandom)
didn't fix my boot issue, so I'm at a loss as well.

Sultan

>From 15f54e2756866956d8713fdec92b54c6c69eb1bb Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan...@gmail.com>
Date: Sun, 29 Apr 2018 12:53:44 -0700
Subject: [PATCH] char: mem: Link /dev/random to /dev/urandom

---
 drivers/char/mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index ffeb60d3434c..0cd22e6100ad 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -870,7 +870,7 @@ static const struct memdev {
 #endif
 [5] = { "zero", 0666, _fops, 0 },
 [7] = { "full", 0666, _fops, 0 },
-[8] = { "random", 0666, _fops, 0 },
+[8] = { "random", 0666, _fops, 0 },
 [9] = { "urandom", 0666, _fops, 0 },
 #ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, _fops, 0 },
-- 
2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 11:18:55PM +0200, Pavel Machek wrote:
> So -- I'm pretty sure systemd and friends should be using
> /dev/urandom. Maybe gpg wants to use /dev/random. _Maybe_.
> 
> [2.948192] random: systemd: uninitialized urandom read (16 bytes
> read)
> [2.953526] systemd[1]: systemd 215 running in system mode. (+PAM
> +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ
> -SECCOMP -APPARMOR)
> [2.980278] systemd[1]: Detected architecture 'x86'.
> [3.115072] usb 5-2: New USB device found, idVendor=0483,
> idProduct=2016, bcdDevice= 0.01
> [3.119633] usb 5-2: New USB device strings: Mfr=1, Product=2,
> SerialNumber=0
> [3.124147] usb 5-2: Product: Biometric Coprocessor
> [3.128621] usb 5-2: Manufacturer: STMicroelectronics
> [3.163839] systemd[1]: Failed to insert module 'ipv6'
> [3.181266] systemd[1]: Set hostname to .
> [3.267243] random: systemd-sysv-ge: uninitialized urandom read (16
> bytes read)
> [3.669590] random: systemd-sysv-ge: uninitialized urandom read (16
> bytes read)
> [3.696242] random: systemd: uninitialized urandom read (16 bytes
> read)
> [3.700066] random: systemd: uninitialized urandom read (16 bytes
> read)
> [3.703716] random: systemd: uninitialized urandom read (16 bytes
> read)
> 
> Anyway, urandom should need to be seeded once, and then provide random
> data forever... which is not impression I get from the dmesg output
> above. Boot clearly proceeds... somehow. So now I'm confused.

Hmm... Well, the attached patch (which redirects /dev/random to /dev/urandom)
didn't fix my boot issue, so I'm at a loss as well.

Sultan

>From 15f54e2756866956d8713fdec92b54c6c69eb1bb Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf 
Date: Sun, 29 Apr 2018 12:53:44 -0700
Subject: [PATCH] char: mem: Link /dev/random to /dev/urandom

---
 drivers/char/mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index ffeb60d3434c..0cd22e6100ad 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -870,7 +870,7 @@ static const struct memdev {
 #endif
 [5] = { "zero", 0666, _fops, 0 },
 [7] = { "full", 0666, _fops, 0 },
-[8] = { "random", 0666, _fops, 0 },
+[8] = { "random", 0666, _fops, 0 },
 [9] = { "urandom", 0666, _fops, 0 },
 #ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, _fops, 0 },
-- 
2.14.1


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 08:41:01PM +0200, Pavel Machek wrote:
> Umm. No. https://www.youtube.com/watch?v=xneBjc8z0DE

Okay, but /dev/urandom isn't a solution to this problem because it isn't usable
until crng init is complete, so it suffers from the same init lag as
/dev/random.

Sultan


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 08:41:01PM +0200, Pavel Machek wrote:
> Umm. No. https://www.youtube.com/watch?v=xneBjc8z0DE

Okay, but /dev/urandom isn't a solution to this problem because it isn't usable
until crng init is complete, so it suffers from the same init lag as
/dev/random.

Sultan


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
I'd also like to add that my high-spec x86 laptop exhibits the same issue as
my Edgar Chromebook.

Here's my dmesg: https://hastebin.com/dofejolobi.go

The most interesting line:
[   90.811633] random: crng init done

I waited 90 seconds after boot to provide entropy myself, at which point crng
init completed. In other words, crng init only completed because I provided
the entropy by smashing the keyboard. I could've waited longer and crng init
wouldn't have completed without my input.

Mind you, this laptop has a 45W CPU, so power savings were definitely not
considered in its design. Do you have any machines that can provide enough
boot entropy to satisfy crng init without requiring user-provided entropy?

Sultan


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
I'd also like to add that my high-spec x86 laptop exhibits the same issue as
my Edgar Chromebook.

Here's my dmesg: https://hastebin.com/dofejolobi.go

The most interesting line:
[   90.811633] random: crng init done

I waited 90 seconds after boot to provide entropy myself, at which point crng
init completed. In other words, crng init only completed because I provided
the entropy by smashing the keyboard. I could've waited longer and crng init
wouldn't have completed without my input.

Mind you, this laptop has a 45W CPU, so power savings were definitely not
considered in its design. Do you have any machines that can provide enough
boot entropy to satisfy crng init without requiring user-provided entropy?

Sultan


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 04:32:05PM +0200, Pavel Machek wrote:
> Hi!
> 
> > This is why ultimately, we do need to attack this problem from both
> > ends, which means teaching userspace programs to only request
> > cryptographic-grade randomness when it is really needed --- and most
> > of the time, if the user has not logged in yet, you probably don't
> > need cryptographic-grade randomness
> 
> IOW moving them from /dev/random to /dev/urandom?
>   Pavel
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

/dev/urandom isn't cryptographically secure, so that's not an option.


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-29 Thread Sultan Alsawaf
On Sun, Apr 29, 2018 at 04:32:05PM +0200, Pavel Machek wrote:
> Hi!
> 
> > This is why ultimately, we do need to attack this problem from both
> > ends, which means teaching userspace programs to only request
> > cryptographic-grade randomness when it is really needed --- and most
> > of the time, if the user has not logged in yet, you probably don't
> > need cryptographic-grade randomness
> 
> IOW moving them from /dev/random to /dev/urandom?
>   Pavel
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

/dev/urandom isn't cryptographically secure, so that's not an option.


Re: Linux messages full of `random: get_random_u32 called from`

2018-04-27 Thread Sultan Alsawaf
> On Thu, Apr 26, 2018 at 10:20:44PM -0700, Sultan Alsawaf wrote:
>> I noted at least 20,000 mmc interrupts before I intervened in the boot 
>> process to provide entropy
>> myself. That's just for mmc, so I'm sure there were even more interrupts 
>> elsewhere. Is 20k+ interrupts
>> really not sufficient?
> How did you determine that there were 20,000 mmc interrupts before you
> had logged in?  Did you have access to the machine w/o having access
> to the login prompt?
>
> I can send a patch (see attached) that will spew large amounts of logs
> as each interrupt comes in and the entropy pool is getting intialized.
> That's how I test things on QEMU, and Jann did something similar on a
> (physical) test machine, so I'm pretty confident that if you were
> getting interrupts, it would result in them contributing into the
> pool.
>
> You will need a serial console, or build a kernel with a much larger
> dmesg buffer, since if you really are getting that many interrupts it
> will cause a lot of log spew.  
>> There are lots of other sources of entropy available as well, like
>> the ever-changing CPU frequencies reported by any recent Intel chip
>> (i.e., they report precision down to 1 kHz).
> That's something we could look at, but the problem is if there is some
> systemd unit during early boot that blocks waiting for the entropy
> pool to be initalized, the system will come to a dead halt, and even
> the CPU frequency shifts will probably not move much --- just as there
> weren't any interrupts while some system startup on the boot path
> wedges the system startup waiting for entropy.
>
> This is why ultimately, we do need to attack this problem from both
> ends, which means teaching userspace programs to only request
> cryptographic-grade randomness when it is really needed --- and most
> of the time, if the user has not logged in yet, you probably don't
> need cryptographic-grade randomness
>
>   - Ted
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index cd888d4ee605..69bd29f039e7 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -916,6 +916,10 @@ static void crng_reseed(struct crng_state *crng, struct 
> entropy_store *r)
>   __u32   key[8];
>   } buf;
>  
> + if (crng == _crng)
> + pr_notice("random: crng_reseed primary from %px\n", r);
> + else
> + pr_notice("random: crng_reseed crng %px from %px\n", crng, r);
>   if (r) {
>   num = extract_entropy(r, , 32, 16, 0);
>   if (num == 0)
> @@ -1241,6 +1245,10 @@ void add_interrupt_randomness(int irq, int irq_flags)
>   fast_pool->pool[2] ^= ip;
>   fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 :
>   get_reg(fast_pool, regs);
> + if (crng_init < 2)
> + pr_notice("random: add_interrupt(cycles=0x%08llx, now=%ld, "
> +   "irq=%d, ip=0x%08lx)\n",
> +   cycles, now, irq, _RET_IP_);
>  
>   fast_mix(fast_pool);
>   add_interrupt_bench(cycles);
> @@ -1282,6 +1290,9 @@ void add_interrupt_randomness(int irq, int irq_flags)
>  
>   /* award one bit for the contents of the fast pool */
>   credit_entropy_bits(r, credit + 1);
> + if (crng_init < 2)
> + pr_notice("random: batched into pool in stage %d, bits now %d",
> +   crng_init, ENTROPY_BITS(r));
>  }
>  EXPORT_SYMBOL_GPL(add_interrupt_randomness);

I dumped the contents of /proc/interrupts to dmesg using the attached patch I 
threw together,
and then waited a sufficient amount of time before introducing entropy myself 
in order to ensure
that the interrupt readings were not contaminated by user-provided interrupts.

Here is the interesting snippet from my dmesg:
[   30.689076] /proc/interrupts dump: 
   |CPU0   CPU1   CPU2   CPU3   
  0:  6  0  0  0   IO-APIC
2-edge  timer
  8:  0  0  1  0   IO-APIC
8-edge  rtc0
  9:  0533  0  0   IO-APIC
9-fasteoi   acpi
 10:  0  0  0  0   IO-APIC   
10-edge  tpm0
 29:  0  0  0  0   IO-APIC   
29-fasteoi   intel_sst_driver
 36:203  0  0  0   IO-APIC   
36-fasteoi   808622C1:04
 37:  0264  0  0   IO-APIC   
37-fasteoi   808622C1:05
 42:  0  0  0  

Re: Linux messages full of `random: get_random_u32 called from`

2018-04-27 Thread Sultan Alsawaf
> On Thu, Apr 26, 2018 at 10:20:44PM -0700, Sultan Alsawaf wrote:
>> I noted at least 20,000 mmc interrupts before I intervened in the boot 
>> process to provide entropy
>> myself. That's just for mmc, so I'm sure there were even more interrupts 
>> elsewhere. Is 20k+ interrupts
>> really not sufficient?
> How did you determine that there were 20,000 mmc interrupts before you
> had logged in?  Did you have access to the machine w/o having access
> to the login prompt?
>
> I can send a patch (see attached) that will spew large amounts of logs
> as each interrupt comes in and the entropy pool is getting intialized.
> That's how I test things on QEMU, and Jann did something similar on a
> (physical) test machine, so I'm pretty confident that if you were
> getting interrupts, it would result in them contributing into the
> pool.
>
> You will need a serial console, or build a kernel with a much larger
> dmesg buffer, since if you really are getting that many interrupts it
> will cause a lot of log spew.  
>> There are lots of other sources of entropy available as well, like
>> the ever-changing CPU frequencies reported by any recent Intel chip
>> (i.e., they report precision down to 1 kHz).
> That's something we could look at, but the problem is if there is some
> systemd unit during early boot that blocks waiting for the entropy
> pool to be initalized, the system will come to a dead halt, and even
> the CPU frequency shifts will probably not move much --- just as there
> weren't any interrupts while some system startup on the boot path
> wedges the system startup waiting for entropy.
>
> This is why ultimately, we do need to attack this problem from both
> ends, which means teaching userspace programs to only request
> cryptographic-grade randomness when it is really needed --- and most
> of the time, if the user has not logged in yet, you probably don't
> need cryptographic-grade randomness
>
>   - Ted
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index cd888d4ee605..69bd29f039e7 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -916,6 +916,10 @@ static void crng_reseed(struct crng_state *crng, struct 
> entropy_store *r)
>   __u32   key[8];
>   } buf;
>  
> + if (crng == _crng)
> + pr_notice("random: crng_reseed primary from %px\n", r);
> + else
> + pr_notice("random: crng_reseed crng %px from %px\n", crng, r);
>   if (r) {
>   num = extract_entropy(r, , 32, 16, 0);
>   if (num == 0)
> @@ -1241,6 +1245,10 @@ void add_interrupt_randomness(int irq, int irq_flags)
>   fast_pool->pool[2] ^= ip;
>   fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 :
>   get_reg(fast_pool, regs);
> + if (crng_init < 2)
> + pr_notice("random: add_interrupt(cycles=0x%08llx, now=%ld, "
> +   "irq=%d, ip=0x%08lx)\n",
> +   cycles, now, irq, _RET_IP_);
>  
>   fast_mix(fast_pool);
>   add_interrupt_bench(cycles);
> @@ -1282,6 +1290,9 @@ void add_interrupt_randomness(int irq, int irq_flags)
>  
>   /* award one bit for the contents of the fast pool */
>   credit_entropy_bits(r, credit + 1);
> + if (crng_init < 2)
> + pr_notice("random: batched into pool in stage %d, bits now %d",
> +   crng_init, ENTROPY_BITS(r));
>  }
>  EXPORT_SYMBOL_GPL(add_interrupt_randomness);

I dumped the contents of /proc/interrupts to dmesg using the attached patch I 
threw together,
and then waited a sufficient amount of time before introducing entropy myself 
in order to ensure
that the interrupt readings were not contaminated by user-provided interrupts.

Here is the interesting snippet from my dmesg:
[   30.689076] /proc/interrupts dump: 
   |CPU0   CPU1   CPU2   CPU3   
  0:  6  0  0  0   IO-APIC
2-edge  timer
  8:  0  0  1  0   IO-APIC
8-edge  rtc0
  9:  0533  0  0   IO-APIC
9-fasteoi   acpi
 10:  0  0  0  0   IO-APIC   
10-edge  tpm0
 29:  0  0  0  0   IO-APIC   
29-fasteoi   intel_sst_driver
 36:203  0  0  0   IO-APIC   
36-fasteoi   808622C1:04
 37:  0264  0  0   IO-APIC   
37-fasteoi   808622C1:05
 42:  0  0  0  

Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf
> The CRNG changes were needed because were erroneously saying that the
> entropy pool was securely initialized before it really was.  Saying
> that CRNG should be able to init on its own is much like saying, "Ted
> should be able to fly wherever he wants in his own personal Gulfstream
> V."  It would certainly be _nice_ if I could afford my personal jet.
> I certainly wish I were that rich.  But the problem is that dollars
> (or Euro's) are like entropy, they don't just magically drop out of
> the sky.
>
> If there isn't user-provided entropy, and the hardware isn't providing
> sufficient entropy, where did you think the kernel is supposed to get
> the entropy from?  Should it dial 1-800-TRUST-NSA?
>
> From the dmesg log, you have a Chromebook Acer 14.  I'm guessing the
> problem is that Chromebooks have hardware tries *very* hard not to
> issue interrupts, since that helps with power savings.  The following
> from your dmesg is very interesting:
>
> [0.526786] tpm tpm0: [Firmware Bug]: TPM interrupt not working, polling 
> instead
>
> I suspect this isn't a firmware bug; it's the hardware working as
> intended / working as designed, for power savings reasons.
>
> So there are two ways to fix this that I can see.  One is to try to
> adjust userspace so that it allows the boot to proceed.  As there is
> more activity, the disk completion interrupts, the user typing their
> username/password into the login screen, etc., there will be timing
> events which can be used to harvest entropy.
>
> The other approach would be to compile the kernel with
> CONFIG_HW_RANDOM_TPM and to modify drivers/char/tpm/tpm-chip.c tot
> initalize chip->hwrng.quality = 500.  We've historically made this
> something that the system administrator must set via sysfs.  This is
> because we wanted system adminisrators to explicitly say that they
> trust the any hardware manufacturer that (a) they haven't been paid by
> your choice of the Chinese MSS or the US NSA to introduce a backdoor,i
> and (b) they are competent to actually implemnt a _secure_ hardware
> random number generator.  Sadly, this has not always been the case.
> Please see:
>
>   https://www.chromium.org/chromium-os/tpm_firmware_update
>
> And note that your Edgar Chromebook is one the list of devices that
> have a TPM with the buggy firmware.  Fortunately this particular TPM
> bug only affects RSA prime generation, so as far as I know there is no
> _known_ vulerability in your TPM's hardware random number generator.
> B ut we want it to be _your_ responsibility to decide you are willing
> to truste it.  I certainly don't want to be legally liable --- or even
> have the moral responsibility --- of guaranteeing that every single
> TPM out there is bug-free(tm).
>
>   - Ted

Why don't we tell users that they need to smash their keyboards to make their 
computers boot
then? And if they question it, we can tell them that it certainly would be 
_nice_ to not have
to smash their keyboards to make their computers boot, but alas, a part of me 
has a feeling that
users would not take kindly to that :)

I noted at least 20,000 mmc interrupts before I intervened in the boot process 
to provide entropy
myself. That's just for mmc, so I'm sure there were even more interrupts 
elsewhere. Is 20k+ interrupts
really not sufficient?

There are lots of other sources of entropy available as well, like the 
ever-changing CPU frequencies reported
by any recent Intel chip (i.e., they report precision down to 1 kHz). Why are 
we so limited to h/w interrupts?

Sultan



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf
> The CRNG changes were needed because were erroneously saying that the
> entropy pool was securely initialized before it really was.  Saying
> that CRNG should be able to init on its own is much like saying, "Ted
> should be able to fly wherever he wants in his own personal Gulfstream
> V."  It would certainly be _nice_ if I could afford my personal jet.
> I certainly wish I were that rich.  But the problem is that dollars
> (or Euro's) are like entropy, they don't just magically drop out of
> the sky.
>
> If there isn't user-provided entropy, and the hardware isn't providing
> sufficient entropy, where did you think the kernel is supposed to get
> the entropy from?  Should it dial 1-800-TRUST-NSA?
>
> From the dmesg log, you have a Chromebook Acer 14.  I'm guessing the
> problem is that Chromebooks have hardware tries *very* hard not to
> issue interrupts, since that helps with power savings.  The following
> from your dmesg is very interesting:
>
> [0.526786] tpm tpm0: [Firmware Bug]: TPM interrupt not working, polling 
> instead
>
> I suspect this isn't a firmware bug; it's the hardware working as
> intended / working as designed, for power savings reasons.
>
> So there are two ways to fix this that I can see.  One is to try to
> adjust userspace so that it allows the boot to proceed.  As there is
> more activity, the disk completion interrupts, the user typing their
> username/password into the login screen, etc., there will be timing
> events which can be used to harvest entropy.
>
> The other approach would be to compile the kernel with
> CONFIG_HW_RANDOM_TPM and to modify drivers/char/tpm/tpm-chip.c tot
> initalize chip->hwrng.quality = 500.  We've historically made this
> something that the system administrator must set via sysfs.  This is
> because we wanted system adminisrators to explicitly say that they
> trust the any hardware manufacturer that (a) they haven't been paid by
> your choice of the Chinese MSS or the US NSA to introduce a backdoor,i
> and (b) they are competent to actually implemnt a _secure_ hardware
> random number generator.  Sadly, this has not always been the case.
> Please see:
>
>   https://www.chromium.org/chromium-os/tpm_firmware_update
>
> And note that your Edgar Chromebook is one the list of devices that
> have a TPM with the buggy firmware.  Fortunately this particular TPM
> bug only affects RSA prime generation, so as far as I know there is no
> _known_ vulerability in your TPM's hardware random number generator.
> B ut we want it to be _your_ responsibility to decide you are willing
> to truste it.  I certainly don't want to be legally liable --- or even
> have the moral responsibility --- of guaranteeing that every single
> TPM out there is bug-free(tm).
>
>   - Ted

Why don't we tell users that they need to smash their keyboards to make their 
computers boot
then? And if they question it, we can tell them that it certainly would be 
_nice_ to not have
to smash their keyboards to make their computers boot, but alas, a part of me 
has a feeling that
users would not take kindly to that :)

I noted at least 20,000 mmc interrupts before I intervened in the boot process 
to provide entropy
myself. That's just for mmc, so I'm sure there were even more interrupts 
elsewhere. Is 20k+ interrupts
really not sufficient?

There are lots of other sources of entropy available as well, like the 
ever-changing CPU frequencies reported
by any recent Intel chip (i.e., they report precision down to 1 kHz). Why are 
we so limited to h/w interrupts?

Sultan



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf
> Hmm, it looks like the multiuser startup is getting blocked on snapd:
>
>  29.060s snapd.service
>
> graphical.target @1min 32.145s
> └─multi-user.target @1min 32.145s
>   └─hddtemp.service @6.512s +28ms
> └─network-online.target @6.508s
>   └─NetworkManager-wait-online.service @2.428s +4.079s
> └─NetworkManager.service @2.016s +404ms
>   └─dbus.service @1.869s
> └─basic.target @1.824s
>   └─sockets.target @1.824s
> └─snapd.socket @1.821s +1ms
>   └─sysinit.target @1.812s
> └─apparmor.service @587ms +1.224s
>   └─local-fs.target @585ms
> └─local-fs-pre.target @585ms
>   └─keyboard-setup.service @235ms +346ms
> └─systemd-journald.socket @226ms
>   └─system.slice @225ms
> └─-.slice @220ms
>
> This appears to be some kind of new package management system for
> Ubuntu:
>
> Description-en: Tool to interact with Ubuntu Core Snappy.
>  Install, configure, refresh and remove snap packages. Snaps are
>  'universal' packages that work across many different Linux systems,
>  enabling secure distribution of the latest apps and utilities for
>  cloud, servers, desktops and the internet of things.
>
> Why it the Ubuntu package believes it needs to be fully started before
> the login screen can display is unclear to me.  It might be worth
> using systemctl to disable snapd.serivce and see if that makes things
> work better for you.
>
>   - Ted

I removed snapd completely which did nothing.

Here are new logs:
systemd-analyze blame: https://hastebin.com/edehikuyeb.css
systemd-analyze critical-chain: https://hastebin.com/vedufafema.pl
dmesg: https://hastebin.com/zuwuwoxadu.vbs

I should also note that leaving the system untouched does not result in it 
booting: I must
provide a source of entropy, otherwise it just stays stuck. In both of the 
dmesgs I've given, I
manually provided entropy to the system after about 5 minutes of waiting.

Also, regardless of what's hanging on CRNG init, CRNG should be able to init on 
its own in a timely
manner without the need for user-provided entropy. Userspace was working fine 
before the recent CRNG
kernel changes, so I don't think this is a userspace bug.

-Sultan



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf
> Hmm, it looks like the multiuser startup is getting blocked on snapd:
>
>  29.060s snapd.service
>
> graphical.target @1min 32.145s
> └─multi-user.target @1min 32.145s
>   └─hddtemp.service @6.512s +28ms
> └─network-online.target @6.508s
>   └─NetworkManager-wait-online.service @2.428s +4.079s
> └─NetworkManager.service @2.016s +404ms
>   └─dbus.service @1.869s
> └─basic.target @1.824s
>   └─sockets.target @1.824s
> └─snapd.socket @1.821s +1ms
>   └─sysinit.target @1.812s
> └─apparmor.service @587ms +1.224s
>   └─local-fs.target @585ms
> └─local-fs-pre.target @585ms
>   └─keyboard-setup.service @235ms +346ms
> └─systemd-journald.socket @226ms
>   └─system.slice @225ms
> └─-.slice @220ms
>
> This appears to be some kind of new package management system for
> Ubuntu:
>
> Description-en: Tool to interact with Ubuntu Core Snappy.
>  Install, configure, refresh and remove snap packages. Snaps are
>  'universal' packages that work across many different Linux systems,
>  enabling secure distribution of the latest apps and utilities for
>  cloud, servers, desktops and the internet of things.
>
> Why it the Ubuntu package believes it needs to be fully started before
> the login screen can display is unclear to me.  It might be worth
> using systemctl to disable snapd.serivce and see if that makes things
> work better for you.
>
>   - Ted

I removed snapd completely which did nothing.

Here are new logs:
systemd-analyze blame: https://hastebin.com/edehikuyeb.css
systemd-analyze critical-chain: https://hastebin.com/vedufafema.pl
dmesg: https://hastebin.com/zuwuwoxadu.vbs

I should also note that leaving the system untouched does not result in it 
booting: I must
provide a source of entropy, otherwise it just stays stuck. In both of the 
dmesgs I've given, I
manually provided entropy to the system after about 5 minutes of waiting.

Also, regardless of what's hanging on CRNG init, CRNG should be able to init on 
its own in a timely
manner without the need for user-provided entropy. Userspace was working fine 
before the recent CRNG
kernel changes, so I don't think this is a userspace bug.

-Sultan



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf
> Hmm, can you let the boot hang for a while?  It should continue after
> a few minutes if you wait long enough, but wait a minute or two, then
> give it entropy so the boot can continue.  Then can you use
> "systemd-analyze blame" or "systemd-analyize critical-chain" and we
> can see what process was trying to get randomness during the boot
> startup and blocking waiting for the CRNG to be fully initialized.
>
>- Ted

systemd-analyze blame: https://hastebin.com/ikipavevew.css
systemd-analyze critical-chain: https://hastebin.com/odoyuqeges.pl
dmesg: https://hastebin.com/waracebeja.vbs



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-26 Thread Sultan Alsawaf
> Hmm, can you let the boot hang for a while?  It should continue after
> a few minutes if you wait long enough, but wait a minute or two, then
> give it entropy so the boot can continue.  Then can you use
> "systemd-analyze blame" or "systemd-analyize critical-chain" and we
> can see what process was trying to get randomness during the boot
> startup and blocking waiting for the CRNG to be fully initialized.
>
>- Ted

systemd-analyze blame: https://hastebin.com/ikipavevew.css
systemd-analyze critical-chain: https://hastebin.com/odoyuqeges.pl
dmesg: https://hastebin.com/waracebeja.vbs



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-25 Thread Sultan Alsawaf
> Thanks for the report!
>
> I assume since you're upgrading your own kernel, you must not be
> running Chrome OS on your Acer CB3-431 Chromebook (Edgar).  Are you
> running Chromium --- or some Linux distribution on it?
>
> Thanks,
>
>   - Ted

Correct, I'm running Xubuntu 18.04 with my own kernel based off linux-stable.



Re: Linux messages full of `random: get_random_u32 called from`

2018-04-25 Thread Sultan Alsawaf
> Thanks for the report!
>
> I assume since you're upgrading your own kernel, you must not be
> running Chrome OS on your Acer CB3-431 Chromebook (Edgar).  Are you
> running Chromium --- or some Linux distribution on it?
>
> Thanks,
>
>   - Ted

Correct, I'm running Xubuntu 18.04 with my own kernel based off linux-stable.



  1   2   >