date:20181020

Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Amir Goldstein

On Sun, Oct 21, 2018 at 1:27 AM Matthew Wilcox  wrote:
>
> On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote:
> > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
> > header and replace with simple assignment. For each case, S_IFx >> 12
> > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
> > us the correct file type. It is expected that for *nix compatibility
> > reasons, the relation between S_IFx and DT_x will not change. For
> > cases where the mode is invalid, upper layer validation catches this
> > anyway, so this improves readability and arguably performance by
> > assigning (mode & S_IFMT) >> 12 directly without any jump table
> > or conditional logic.
>
> Shouldn't we also do this for other filesystems?  A quick scan suggests
> someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2,
> nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs.
>

I've tried to do that 2 years ago, not even for readability, but to
fix a lurking
bug in conversion table implementation that was copy from ext2 to
many other fs:
https://marc.info/?l=linux-fsdevel=148217829301701=2
https://marc.info/?l=linux-fsdevel=2=1=amir73il+file+type+conversion=b

The push back from ext4/xfs maintainers was that while all fs use common
values to encode d_type in on disk format, those on-disk format values
are private to the fs.

Eventually, the fix that got merged (only to xfs) did the opposite, it converted
the conversion table lookup to a switch statement:
https://marc.info/?l=linux-fsdevel=14824386620=2

Some fs maintainers were happy about the initial version, but I did not
pursue pushing that cleanup:
https://marc.info/?l=linux-fsdevel=14824386620=2

Phillip, you are welcome to re-spin those patches if you like.
Note that the generic conversion helpers do not rely on upper layer
to catch invalid mode values.

Cheers,
Amir.

Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Amir Goldstein

On Sun, Oct 21, 2018 at 1:27 AM Matthew Wilcox  wrote:
>
> On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote:
> > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
> > header and replace with simple assignment. For each case, S_IFx >> 12
> > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
> > us the correct file type. It is expected that for *nix compatibility
> > reasons, the relation between S_IFx and DT_x will not change. For
> > cases where the mode is invalid, upper layer validation catches this
> > anyway, so this improves readability and arguably performance by
> > assigning (mode & S_IFMT) >> 12 directly without any jump table
> > or conditional logic.
>
> Shouldn't we also do this for other filesystems?  A quick scan suggests
> someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2,
> nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs.
>

I've tried to do that 2 years ago, not even for readability, but to
fix a lurking
bug in conversion table implementation that was copy from ext2 to
many other fs:
https://marc.info/?l=linux-fsdevel=148217829301701=2
https://marc.info/?l=linux-fsdevel=2=1=amir73il+file+type+conversion=b

The push back from ext4/xfs maintainers was that while all fs use common
values to encode d_type in on disk format, those on-disk format values
are private to the fs.

Eventually, the fix that got merged (only to xfs) did the opposite, it converted
the conversion table lookup to a switch statement:
https://marc.info/?l=linux-fsdevel=14824386620=2

Some fs maintainers were happy about the initial version, but I did not
pursue pushing that cleanup:
https://marc.info/?l=linux-fsdevel=14824386620=2

Phillip, you are welcome to re-spin those patches if you like.
Note that the generic conversion helpers do not rely on upper layer
to catch invalid mode values.

Cheers,
Amir.

Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug

2018-10-20 Thread Joel Fernandes

On Sun, Oct 21, 2018 at 09:16:59AM +0530, Sai Prakash Ranjan wrote:
> On 10/20/2018 9:57 PM, Joel Fernandes wrote:
> > On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote:
> > > On 10/20/2018 10:55 AM, Joel Fernandes wrote:
> > > > On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote:
> > > > > Hi,
> > > > > 
> > > > > This patch series adds Event tracing support to pstore and is 
> > > > > continuation
> > > > > to the RFC patch introduced to add a new tracing facility for register
> > > > > accesses called Register Trace Buffer(RTB). Since we decided to not 
> > > > > introduce
> > > > > a separate framework to trace register accesses and use existing 
> > > > > framework
> > > > > like tracepoints, I have moved from RFC. Details of the RFC in link 
> > > > > below:
> > > > > 
> > > > > Link: 
> > > > > https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/
> > > > > 
> > > > > MSR tracing example given by Steven was helpful in using tracepoints 
> > > > > for
> > > > > register accesses instead of using separate trace. But just having 
> > > > > these
> > > > > IO traces would not help much unless we could have them in some 
> > > > > persistent
> > > > > ram buffer for debugging unclocked access or some kind of bus hang or 
> > > > > an
> > > > > unexpected reset caused by some buggy driver which happens a lot 
> > > > > during
> > > > > initial development stages. By analyzing the last few entries of this 
> > > > > buffer,
> > > > > we could identify the register access which is causing the issue.
> > > > 
> > > > Hi Sai,
> > > > 
> > > > I wanted to see if I could make some time to get your patches working. 
> > > > We are
> > > > hitting usecases that need something like this as well. Basically 
> > > > devices
> > > > hanging and then the ramdump does not tell us much, so in this case 
> > > > pstore
> > > > events can be really helpful. This usecase came up last year as well.
> > > > 
> > > > Anyway while I was going through your patches, I cleaned up some pstore 
> > > > code
> > > > as well and I have 3 more patches on top of yours for this clean up. I 
> > > > prefer
> > > > we submit the patches together and sync our work together so that there 
> > > > is
> > > > least conflict.
> > > > 
> > > > Here's my latest tree:
> > > > https://github.com/joelagnel/linux-kernel/commits/pstore-events
> > > > (note that I have only build tested the patches since I just wrote them 
> > > > and
> > > > its quite late in the night here ;-))
> > > > 
> > > 
> > > Hi Joel,
> > > 
> > > Thanks for looking into this. Sure, I will be happy to sync up with you on
> > 
> > Thanks. And added a fourth patch in the tree too.

While at it, I was thinking about the problem we are trying to solve in
another way. If ftrace itself can use pages from the persistent ram store,
instead of the kernel's buddy allocator, then the ftrace ring buffer itself
could persist across a system reboot.

The clear advantage of that for Sai's pstore events work is, not having to
duplicate a lot of the ring buffer code and stuff into pstore (for the pstore
events for example, I wanted time stamps as well and ftrace's ring buffer has
some nice time management code to deal with time deltas). We already have
other ring buffers maintained in other parts of the kernel for tracing right?
So I'm a bit averse to duplicating that into pstore as well for tracing. The
other advantage of persisting the ftrace ring buffer would also mean data
from other tracers such as function-graph tracer and irqsoff tracers would
also persist and then we can also probably get rid of ftrace-in-pstore stuff
which is kind of incomplete anyway since it does not have time stamps for
recorded functions.

Steven and Kees: what do you think, is persisting ftrace ring buffer across
reboots a worthwhile idea? Any thoughts on how feasible something like that
could be, code wise? Off the top, I think the ring buffer state that ftrace
needs other than the trace data itself will also have to be persisted.

thanks,

 - Joel

Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug

2018-10-20 Thread Joel Fernandes

On Sun, Oct 21, 2018 at 09:16:59AM +0530, Sai Prakash Ranjan wrote:
> On 10/20/2018 9:57 PM, Joel Fernandes wrote:
> > On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote:
> > > On 10/20/2018 10:55 AM, Joel Fernandes wrote:
> > > > On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote:
> > > > > Hi,
> > > > > 
> > > > > This patch series adds Event tracing support to pstore and is 
> > > > > continuation
> > > > > to the RFC patch introduced to add a new tracing facility for register
> > > > > accesses called Register Trace Buffer(RTB). Since we decided to not 
> > > > > introduce
> > > > > a separate framework to trace register accesses and use existing 
> > > > > framework
> > > > > like tracepoints, I have moved from RFC. Details of the RFC in link 
> > > > > below:
> > > > > 
> > > > > Link: 
> > > > > https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/
> > > > > 
> > > > > MSR tracing example given by Steven was helpful in using tracepoints 
> > > > > for
> > > > > register accesses instead of using separate trace. But just having 
> > > > > these
> > > > > IO traces would not help much unless we could have them in some 
> > > > > persistent
> > > > > ram buffer for debugging unclocked access or some kind of bus hang or 
> > > > > an
> > > > > unexpected reset caused by some buggy driver which happens a lot 
> > > > > during
> > > > > initial development stages. By analyzing the last few entries of this 
> > > > > buffer,
> > > > > we could identify the register access which is causing the issue.
> > > > 
> > > > Hi Sai,
> > > > 
> > > > I wanted to see if I could make some time to get your patches working. 
> > > > We are
> > > > hitting usecases that need something like this as well. Basically 
> > > > devices
> > > > hanging and then the ramdump does not tell us much, so in this case 
> > > > pstore
> > > > events can be really helpful. This usecase came up last year as well.
> > > > 
> > > > Anyway while I was going through your patches, I cleaned up some pstore 
> > > > code
> > > > as well and I have 3 more patches on top of yours for this clean up. I 
> > > > prefer
> > > > we submit the patches together and sync our work together so that there 
> > > > is
> > > > least conflict.
> > > > 
> > > > Here's my latest tree:
> > > > https://github.com/joelagnel/linux-kernel/commits/pstore-events
> > > > (note that I have only build tested the patches since I just wrote them 
> > > > and
> > > > its quite late in the night here ;-))
> > > > 
> > > 
> > > Hi Joel,
> > > 
> > > Thanks for looking into this. Sure, I will be happy to sync up with you on
> > 
> > Thanks. And added a fourth patch in the tree too.

While at it, I was thinking about the problem we are trying to solve in
another way. If ftrace itself can use pages from the persistent ram store,
instead of the kernel's buddy allocator, then the ftrace ring buffer itself
could persist across a system reboot.

The clear advantage of that for Sai's pstore events work is, not having to
duplicate a lot of the ring buffer code and stuff into pstore (for the pstore
events for example, I wanted time stamps as well and ftrace's ring buffer has
some nice time management code to deal with time deltas). We already have
other ring buffers maintained in other parts of the kernel for tracing right?
So I'm a bit averse to duplicating that into pstore as well for tracing. The
other advantage of persisting the ftrace ring buffer would also mean data
from other tracers such as function-graph tracer and irqsoff tracers would
also persist and then we can also probably get rid of ftrace-in-pstore stuff
which is kind of incomplete anyway since it does not have time stamps for
recorded functions.

Steven and Kees: what do you think, is persisting ftrace ring buffer across
reboots a worthwhile idea? Any thoughts on how feasible something like that
could be, code wise? Off the top, I think the ring buffer state that ftrace
needs other than the trace data itself will also have to be persisted.

thanks,

 - Joel

Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug

2018-10-20 Thread Sai Prakash Ranjan


On 10/21/2018 9:16 AM, Sai Prakash Ranjan wrote:

On 10/20/2018 9:57 PM, Joel Fernandes wrote:

On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote:

On 10/20/2018 10:55 AM, Joel Fernandes wrote:

On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote:

Hi,

This patch series adds Event tracing support to pstore and is 
continuation

to the RFC patch introduced to add a new tracing facility for register
accesses called Register Trace Buffer(RTB). Since we decided to not 
introduce
a separate framework to trace register accesses and use existing 
framework
like tracepoints, I have moved from RFC. Details of the RFC in link 
below:


Link: 
https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ 



MSR tracing example given by Steven was helpful in using 
tracepoints for
register accesses instead of using separate trace. But just having 
these
IO traces would not help much unless we could have them in some 
persistent
ram buffer for debugging unclocked access or some kind of bus hang 
or an
unexpected reset caused by some buggy driver which happens a lot 
during
initial development stages. By analyzing the last few entries of 
this buffer,

we could identify the register access which is causing the issue.


Hi Sai,

I wanted to see if I could make some time to get your patches 
working. We are
hitting usecases that need something like this as well. Basically 
devices
hanging and then the ramdump does not tell us much, so in this case 
pstore

events can be really helpful. This usecase came up last year as well.

Anyway while I was going through your patches, I cleaned up some 
pstore code
as well and I have 3 more patches on top of yours for this clean up. 
I prefer
we submit the patches together and sync our work together so that 
there is

least conflict.

Here's my latest tree:
https://github.com/joelagnel/linux-kernel/commits/pstore-events
(note that I have only build tested the patches since I just wrote 
them and

its quite late in the night here ;-))



Hi Joel,

Thanks for looking into this. Sure, I will be happy to sync up with 
you on


Thanks. And added a fourth patch in the tree too.

this. I can test your additional patches on top of my pstore patches. 
BTW,

I'm still stuck at copying binary record into pstore and then extract it
during read time. Seems like I'm missing something.


Sure, push your latest somewhere and let me know. I'll try to get you 
unstuck.




Thanks Joel, I will push my changes and let you know in some time.



Hi Joel,

Here's my tree:
https://github.com/saiprakash-ranjan/linux/commits/pstore-events

I have one patch extra on top of your patches. Nothing much on binary 
record storage in this patch, only removed kmalloc in pstore event call 
to avoid loop.


Thanks,
Sai
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug

2018-10-20 Thread Sai Prakash Ranjan


On 10/21/2018 9:16 AM, Sai Prakash Ranjan wrote:

On 10/20/2018 9:57 PM, Joel Fernandes wrote:

On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote:

On 10/20/2018 10:55 AM, Joel Fernandes wrote:

On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote:

Hi,

This patch series adds Event tracing support to pstore and is 
continuation

to the RFC patch introduced to add a new tracing facility for register
accesses called Register Trace Buffer(RTB). Since we decided to not 
introduce
a separate framework to trace register accesses and use existing 
framework
like tracepoints, I have moved from RFC. Details of the RFC in link 
below:


Link: 
https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ 



MSR tracing example given by Steven was helpful in using 
tracepoints for
register accesses instead of using separate trace. But just having 
these
IO traces would not help much unless we could have them in some 
persistent
ram buffer for debugging unclocked access or some kind of bus hang 
or an
unexpected reset caused by some buggy driver which happens a lot 
during
initial development stages. By analyzing the last few entries of 
this buffer,

we could identify the register access which is causing the issue.


Hi Sai,

I wanted to see if I could make some time to get your patches 
working. We are
hitting usecases that need something like this as well. Basically 
devices
hanging and then the ramdump does not tell us much, so in this case 
pstore

events can be really helpful. This usecase came up last year as well.

Anyway while I was going through your patches, I cleaned up some 
pstore code
as well and I have 3 more patches on top of yours for this clean up. 
I prefer
we submit the patches together and sync our work together so that 
there is

least conflict.

Here's my latest tree:
https://github.com/joelagnel/linux-kernel/commits/pstore-events
(note that I have only build tested the patches since I just wrote 
them and

its quite late in the night here ;-))



Hi Joel,

Thanks for looking into this. Sure, I will be happy to sync up with 
you on


Thanks. And added a fourth patch in the tree too.

this. I can test your additional patches on top of my pstore patches. 
BTW,

I'm still stuck at copying binary record into pstore and then extract it
during read time. Seems like I'm missing something.


Sure, push your latest somewhere and let me know. I'll try to get you 
unstuck.




Thanks Joel, I will push my changes and let you know in some time.



Hi Joel,

Here's my tree:
https://github.com/saiprakash-ranjan/linux/commits/pstore-events

I have one patch extra on top of your patches. Nothing much on binary 
record storage in this patch, only removed kmalloc in pstore event call 
to avoid loop.


Thanks,
Sai
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

Your Consignment at JFK International Airport.

2018-10-20 Thread Mr.Will Clark

Attention Beneficiary

I am Mr.Will Clark, A senior officer at JFK International Airport. I am 
contacting you regarding an abandoned diplomatic consignment box and the x-ray 
scan report box revealed some U.S dollar bill in it which could be 
approximately 12,5Million Dollars and the official paper of the box indicates 
your contact details.To confirm you as the authentic beneficiary, do send me 
your full name, your home address, your mobile telephone number,occupation and 
the nearest airport close to you.

For your information, the box was abandoned by the diplomat who was on transit 
to your location because he was not able to pay the ATL clearance fee he 
abandoned it with the authority without any notification of his arrival for 
clearance. I have taken it upon myself to contact you personally about this 
abandoned box so that we can transact this as a deal and share the total money 
80% for you and 20% for me. My main purpose of contacting you is that the U.S. 
Custom's Border and Protection are on clearance for some beneficiaries 
abandoned consignment for 2017 who  will be listed among the list for 
confiscation to be transfer to the US Teasury for 2018 few weeks from now.

I want both of us to transact this as a deal all i want is your trust as the 
arrangement for the box to be delivered to you which can be concluded within 
4-6hours after confirmation is made and upon your acceptance and willingness to 
co-operate.

All communication must be held extremely confidential to ensure a successful 
delivery. Send your response to my private email.(  willclark0...@gmail.com  )

I will send you an SMS to confirm i got your information.

Warmest Regards
Mr.Will Clark

Your Consignment at JFK International Airport.

2018-10-20 Thread Mr.Will Clark

Attention Beneficiary

I am Mr.Will Clark, A senior officer at JFK International Airport. I am 
contacting you regarding an abandoned diplomatic consignment box and the x-ray 
scan report box revealed some U.S dollar bill in it which could be 
approximately 12,5Million Dollars and the official paper of the box indicates 
your contact details.To confirm you as the authentic beneficiary, do send me 
your full name, your home address, your mobile telephone number,occupation and 
the nearest airport close to you.

For your information, the box was abandoned by the diplomat who was on transit 
to your location because he was not able to pay the ATL clearance fee he 
abandoned it with the authority without any notification of his arrival for 
clearance. I have taken it upon myself to contact you personally about this 
abandoned box so that we can transact this as a deal and share the total money 
80% for you and 20% for me. My main purpose of contacting you is that the U.S. 
Custom's Border and Protection are on clearance for some beneficiaries 
abandoned consignment for 2017 who  will be listed among the list for 
confiscation to be transfer to the US Teasury for 2018 few weeks from now.

I want both of us to transact this as a deal all i want is your trust as the 
arrangement for the box to be delivered to you which can be concluded within 
4-6hours after confirmation is made and upon your acceptance and willingness to 
co-operate.

All communication must be held extremely confidential to ensure a successful 
delivery. Send your response to my private email.(  willclark0...@gmail.com  )

I will send you an SMS to confirm i got your information.

Warmest Regards
Mr.Will Clark

Re: [PATCH 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up

2018-10-20 Thread Miguel Ojeda

Hi Dan,

On Sat, Oct 20, 2018 at 9:22 PM Dan Carpenter  wrote:
>
> On Sat, Oct 20, 2018 at 04:42:07PM +0200, Miguel Ojeda wrote:
> > Using an attribute is indeed better whenever possible. In C++17 it is
> > an standard attribute and there have been proposals to include some of
> > them for C as well since 2016 at least, e.g. the latest for
> > fallthrough at:
> >
> >   http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2268.pdf
> >
> > I have taken a quick look into supporting it (typing it here to save
> > it on the mailing list :-), and we have:
> >
> >   * gcc >= 7.1 supports -Wimplicit-fallthrough with
> > __attribute__((fallthrough)), the comment syntax and the C++
> > [[fallthrough]] syntax.
> >   * gcc < 7.1 complains about empty declarations (it does not know
> > about attributes for null statements) and also
> > -Wdeclaration-after-statement.
>
> I'm not sure I understand about empty decalarations.  The idea is that
> we would do:

That paragraph tried to explain that gcc < 7.1 did not know about
__attribute__((fallthrough)); i.e. that everything was introduced in
gcc 7.1.

Anyway, the conclusion was a neuron misfiring of mine -- in my mind I
was thinking clang supported the comment syntax (when I clearly wrote
that it didn't). Never mind --- thanks for pointing it out!

>
> case 3:
> frob();
> __fall_through();
> case 4:
> frob();
>
> #if GCC > 7.1
> #define __fall_through() __attribute__((fallthrough))
> #else
> #define __fall_through()
> #endif

Yes, of course, that is what we do for other attributes -- actually in
-next we have pending a better way for checking, using
__has_attribute:

  #if __has_attribute(fallthrough)
  #define __fallthrough __attribute__((fallthrough))
  #else
  #define __fallthrough
  #endif

>
> So long as GCC and static analysis tools understand about the attribute
> that's enought to catch the bugs. No one cares what clang and icc do.

Not so sure about that -- there are quite some people looking forward
to building Linux with clang, even if only to have another compiler to
check for warnings and to use the clang/llvm-related tools (and to
write new ones).

> We would just disable the fall through warnings on those until they are
> updated to support the fallthrough attribute.

You mean if they start supporting the warning but not the attribute? I
don't think that would be likely (i.e. if clang enables the warning on
C mode, they will have to introduce a way to suppress it; which should
be the attribute (at least), since they typically like to be
compatible with gcc and since they already support it in C++ >= 11),
but everything is possible.

>
> We wouldn't update all the fall through comments until later, but going
> forward people could use the __fall_through() macro if they want.

Agreed. I will introduce it in the compiler-attributes tree -- should
be there for -rc1 if no one complains. CC'ing all of you in the patch.

Cheers,
Miguel

Re: [PATCH 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up

2018-10-20 Thread Miguel Ojeda

Hi Dan,

On Sat, Oct 20, 2018 at 9:22 PM Dan Carpenter  wrote:
>
> On Sat, Oct 20, 2018 at 04:42:07PM +0200, Miguel Ojeda wrote:
> > Using an attribute is indeed better whenever possible. In C++17 it is
> > an standard attribute and there have been proposals to include some of
> > them for C as well since 2016 at least, e.g. the latest for
> > fallthrough at:
> >
> >   http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2268.pdf
> >
> > I have taken a quick look into supporting it (typing it here to save
> > it on the mailing list :-), and we have:
> >
> >   * gcc >= 7.1 supports -Wimplicit-fallthrough with
> > __attribute__((fallthrough)), the comment syntax and the C++
> > [[fallthrough]] syntax.
> >   * gcc < 7.1 complains about empty declarations (it does not know
> > about attributes for null statements) and also
> > -Wdeclaration-after-statement.
>
> I'm not sure I understand about empty decalarations.  The idea is that
> we would do:

That paragraph tried to explain that gcc < 7.1 did not know about
__attribute__((fallthrough)); i.e. that everything was introduced in
gcc 7.1.

Anyway, the conclusion was a neuron misfiring of mine -- in my mind I
was thinking clang supported the comment syntax (when I clearly wrote
that it didn't). Never mind --- thanks for pointing it out!

>
> case 3:
> frob();
> __fall_through();
> case 4:
> frob();
>
> #if GCC > 7.1
> #define __fall_through() __attribute__((fallthrough))
> #else
> #define __fall_through()
> #endif

Yes, of course, that is what we do for other attributes -- actually in
-next we have pending a better way for checking, using
__has_attribute:

  #if __has_attribute(fallthrough)
  #define __fallthrough __attribute__((fallthrough))
  #else
  #define __fallthrough
  #endif

>
> So long as GCC and static analysis tools understand about the attribute
> that's enought to catch the bugs. No one cares what clang and icc do.

Not so sure about that -- there are quite some people looking forward
to building Linux with clang, even if only to have another compiler to
check for warnings and to use the clang/llvm-related tools (and to
write new ones).

> We would just disable the fall through warnings on those until they are
> updated to support the fallthrough attribute.

You mean if they start supporting the warning but not the attribute? I
don't think that would be likely (i.e. if clang enables the warning on
C mode, they will have to introduce a way to suppress it; which should
be the attribute (at least), since they typically like to be
compatible with gcc and since they already support it in C++ >= 11),
but everything is possible.

>
> We wouldn't update all the fall through comments until later, but going
> forward people could use the __fall_through() macro if they want.

Agreed. I will introduce it in the compiler-attributes tree -- should
be there for -rc1 if no one complains. CC'ing all of you in the patch.

Cheers,
Miguel

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner  writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>tick_do_update_jiffies64
> > >>   update_wall_time
> > >>   timekeeping_advance
> > >>  timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, 
> > > but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from 
> > > a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock 
> > > caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >real time.
> > 
> >The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >has not relation ship to each other (except a synchronized length of
> >the second).  So applications that migrate can see CLOCK_MONOTONIC
> >and CLOCK_BOOTTIME go backwards.
> > 
> >This is the truly pressing problem and adding some kind of offset
> >sounds like it would be the solution.  Possibly by allowing a boot
> >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >that needes to deal with both something inside of google where they
> >slew time to avoid leap time seconds and something in the outside
> >world proper UTC time is kept as an offset from TAI with the
> >occasional leap seconds.
> > 
> >In the later case it would fundamentally require having seconds of
> >different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner  writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>tick_do_update_jiffies64
> > >>   update_wall_time
> > >>   timekeeping_advance
> > >>  timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, 
> > > but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from 
> > > a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock 
> > > caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >real time.
> > 
> >The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >has not relation ship to each other (except a synchronized length of
> >the second).  So applications that migrate can see CLOCK_MONOTONIC
> >and CLOCK_BOOTTIME go backwards.
> > 
> >This is the truly pressing problem and adding some kind of offset
> >sounds like it would be the solution.  Possibly by allowing a boot
> >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >that needes to deal with both something inside of google where they
> >slew time to avoid leap time seconds and something in the outside
> >world proper UTC time is kept as an offset from TAI with the
> >occasional leap seconds.
> > 
> >In the later case it would fundamentally require having seconds of
> >different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea

Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug

2018-10-20 Thread Sai Prakash Ranjan


On 10/20/2018 9:57 PM, Joel Fernandes wrote:

On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote:

On 10/20/2018 10:55 AM, Joel Fernandes wrote:

On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote:

Hi,

This patch series adds Event tracing support to pstore and is continuation
to the RFC patch introduced to add a new tracing facility for register
accesses called Register Trace Buffer(RTB). Since we decided to not introduce
a separate framework to trace register accesses and use existing framework
like tracepoints, I have moved from RFC. Details of the RFC in link below:

Link: 
https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/

MSR tracing example given by Steven was helpful in using tracepoints for
register accesses instead of using separate trace. But just having these
IO traces would not help much unless we could have them in some persistent
ram buffer for debugging unclocked access or some kind of bus hang or an
unexpected reset caused by some buggy driver which happens a lot during
initial development stages. By analyzing the last few entries of this buffer,
we could identify the register access which is causing the issue.


Hi Sai,

I wanted to see if I could make some time to get your patches working. We are
hitting usecases that need something like this as well. Basically devices
hanging and then the ramdump does not tell us much, so in this case pstore
events can be really helpful. This usecase came up last year as well.

Anyway while I was going through your patches, I cleaned up some pstore code
as well and I have 3 more patches on top of yours for this clean up. I prefer
we submit the patches together and sync our work together so that there is
least conflict.

Here's my latest tree:
https://github.com/joelagnel/linux-kernel/commits/pstore-events
(note that I have only build tested the patches since I just wrote them and
its quite late in the night here ;-))



Hi Joel,

Thanks for looking into this. Sure, I will be happy to sync up with you on


Thanks. And added a fourth patch in the tree too.


this. I can test your additional patches on top of my pstore patches. BTW,
I'm still stuck at copying binary record into pstore and then extract it
during read time. Seems like I'm missing something.


Sure, push your latest somewhere and let me know. I'll try to get you unstuck.



Thanks Joel, I will push my changes and let you know in some time.

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug

2018-10-20 Thread Sai Prakash Ranjan


On 10/20/2018 9:57 PM, Joel Fernandes wrote:

On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote:

On 10/20/2018 10:55 AM, Joel Fernandes wrote:

On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote:

Hi,

This patch series adds Event tracing support to pstore and is continuation
to the RFC patch introduced to add a new tracing facility for register
accesses called Register Trace Buffer(RTB). Since we decided to not introduce
a separate framework to trace register accesses and use existing framework
like tracepoints, I have moved from RFC. Details of the RFC in link below:

Link: 
https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/

MSR tracing example given by Steven was helpful in using tracepoints for
register accesses instead of using separate trace. But just having these
IO traces would not help much unless we could have them in some persistent
ram buffer for debugging unclocked access or some kind of bus hang or an
unexpected reset caused by some buggy driver which happens a lot during
initial development stages. By analyzing the last few entries of this buffer,
we could identify the register access which is causing the issue.


Hi Sai,

I wanted to see if I could make some time to get your patches working. We are
hitting usecases that need something like this as well. Basically devices
hanging and then the ramdump does not tell us much, so in this case pstore
events can be really helpful. This usecase came up last year as well.

Anyway while I was going through your patches, I cleaned up some pstore code
as well and I have 3 more patches on top of yours for this clean up. I prefer
we submit the patches together and sync our work together so that there is
least conflict.

Here's my latest tree:
https://github.com/joelagnel/linux-kernel/commits/pstore-events
(note that I have only build tested the patches since I just wrote them and
its quite late in the night here ;-))



Hi Joel,

Thanks for looking into this. Sure, I will be happy to sync up with you on


Thanks. And added a fourth patch in the tree too.


this. I can test your additional patches on top of my pstore patches. BTW,
I'm still stuck at copying binary record into pstore and then extract it
during read time. Seems like I'm missing something.


Sure, push your latest somewhere and let me know. I'll try to get you unstuck.



Thanks Joel, I will push my changes and let you know in some time.

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> 
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> Reading the code the calling sequence there is:
> >> tick_sched_do_timer
> >>tick_do_update_jiffies64
> >>   update_wall_time
> >>   timekeeping_advance
> >>  timekeepging_update
> >> 
> >> If I read that properly under the right nohz circumstances that update
> >> can be delayed indefinitely.
> >> 
> >> So I think we could prototype a time namespace that was per
> >> timekeeping_update and just had update_wall_time iterate through
> >> all of the time namespaces.
> >
> > Please don't go there. timekeeping_update() is already heavy and walking
> > through a gazillion of namespaces will just make it horrible,
> >
> >> I don't think the naive version would scale to very many time
> >> namespaces.
> >
> > :)
> >
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.
> 
> > From there it becomes hairy, because it's not only timekeeping,
> > i.e. reading time, this is also affecting all timers which are armed from a
> > namespace.
> >
> > That gets really ugly because when you do settimeofday() or adjtimex() for
> > a particular namespace, then you have to search for all armed timers of
> > that namespace and adjust them.
> >
> > The original posix timer code had the same issue because it mapped the
> > clock realtime timers to the timer wheel so any setting of the clock caused
> > a full walk of all armed timers, disarming, adjusting and requeing
> > them. That's horrible not only performance wise, it's also a locking
> > nightmare of all sorts.
> >
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> Then it sounds like this will take some more digging.
> 
> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
> 
> I see two fundamental driving cases for a time namespace.
> 1) Migration from one node to another node in a cluster in almost
>real time.
> 
>The problem is that CLOCK_MONOTONIC between nodes in the cluster
>has not relation ship to each other (except a synchronized length of
>the second).  So applications that migrate can see CLOCK_MONOTONIC
>and CLOCK_BOOTTIME go backwards.
> 
>This is the truly pressing problem and adding some kind of offset
>sounds like it would be the solution.  Possibly by allowing a boot
>time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> 
> 2) Dealing with two separate time management domains.  Say a machine
>that needes to deal with both something inside of google where they
>slew time to avoid leap time seconds and something in the outside
>world proper UTC time is kept as an offset from TAI with the
>occasional leap seconds.
> 
>In the later case it would fundamentally require having seconds of
>different length.
> 

I want to add that the second case should be optional.

When a container is migrated to another host, we have to restore its
monotonic and boottime clocks, but we still expect that the container
will continue using the host real-time clock.

Before stating this series, I was thinking about this, I decided that
these cases can be solved independently. Probably, the full isolation of
the time sub-system will have much higher overhead than just offsets for
a few clocks. And the idea that isolation of the real-time clock should
be optional gives us another hint that offsets for monotonic and
boot-time clocks can be implemented independently.

Eric and Tomas, what do you think about this? If you agree that these
two cases can be implemented separately, what should we do with this

Re: [RFC 00/20] ns: Introduce Time Namespace

2018-10-20 Thread Andrei Vagin

On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> Thomas Gleixner  writes:
> 
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> Reading the code the calling sequence there is:
> >> tick_sched_do_timer
> >>tick_do_update_jiffies64
> >>   update_wall_time
> >>   timekeeping_advance
> >>  timekeepging_update
> >> 
> >> If I read that properly under the right nohz circumstances that update
> >> can be delayed indefinitely.
> >> 
> >> So I think we could prototype a time namespace that was per
> >> timekeeping_update and just had update_wall_time iterate through
> >> all of the time namespaces.
> >
> > Please don't go there. timekeeping_update() is already heavy and walking
> > through a gazillion of namespaces will just make it horrible,
> >
> >> I don't think the naive version would scale to very many time
> >> namespaces.
> >
> > :)
> >
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
> 
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around?  I have not dug
> deeply enough into the code to see that yet.
> 
> > From there it becomes hairy, because it's not only timekeeping,
> > i.e. reading time, this is also affecting all timers which are armed from a
> > namespace.
> >
> > That gets really ugly because when you do settimeofday() or adjtimex() for
> > a particular namespace, then you have to search for all armed timers of
> > that namespace and adjust them.
> >
> > The original posix timer code had the same issue because it mapped the
> > clock realtime timers to the timer wheel so any setting of the clock caused
> > a full walk of all armed timers, disarming, adjusting and requeing
> > them. That's horrible not only performance wise, it's also a locking
> > nightmare of all sorts.
> >
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
> 
> Then it sounds like this will take some more digging.
> 
> Please pardon me for thinking out load.
> 
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
> 
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
> 
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
> 
> I see two fundamental driving cases for a time namespace.
> 1) Migration from one node to another node in a cluster in almost
>real time.
> 
>The problem is that CLOCK_MONOTONIC between nodes in the cluster
>has not relation ship to each other (except a synchronized length of
>the second).  So applications that migrate can see CLOCK_MONOTONIC
>and CLOCK_BOOTTIME go backwards.
> 
>This is the truly pressing problem and adding some kind of offset
>sounds like it would be the solution.  Possibly by allowing a boot
>time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> 
> 2) Dealing with two separate time management domains.  Say a machine
>that needes to deal with both something inside of google where they
>slew time to avoid leap time seconds and something in the outside
>world proper UTC time is kept as an offset from TAI with the
>occasional leap seconds.
> 
>In the later case it would fundamentally require having seconds of
>different length.
> 

I want to add that the second case should be optional.

When a container is migrated to another host, we have to restore its
monotonic and boottime clocks, but we still expect that the container
will continue using the host real-time clock.

Before stating this series, I was thinking about this, I decided that
these cases can be solved independently. Probably, the full isolation of
the time sub-system will have much higher overhead than just offsets for
a few clocks. And the idea that isolation of the real-time clock should
be optional gives us another hint that offsets for monotonic and
boot-time clocks can be implemented independently.

Eric and Tomas, what do you think about this? If you agree that these
two cases can be implemented separately, what should we do with this

[ANNOUNCE] Call for Papers - linux.conf.au Kernel Miniconf, Christchurch NZ, 21-25 Jan 2019

2018-10-20 Thread Andrew Donnellan

The linux.conf.au Kernel Miniconf is happening once again, this time in 
Christchurch on 22 Jan 2019.


*** Submissions close on 2018-12-16, 23:59 AoE, with early submissions 
(before 2018-11-16, 23:59 AoE) given priority. ***


*** Submission details: http://lca-kernel.ozlabs.org/2019-cfp.html ***

The Kernel Miniconf is a 1 day stream alongside the main LCA conference 
to talk about kernel stuff.


We invite submissions on anything related to kernel and low-level 
systems programming. We welcome submissions from developers of all 
levels of experience in the kernel community, covering a broad range of 
topics.


Past Kernel Miniconfs have included technical talks on topics such as 
memory management, RCU, scheduling and filesystems, as well as talks on 
Linux kernel community topics such as licensing and Linux kernel 
development process.


We strongly encourage both first-time and seasoned speakers from all 
backgrounds, ages, genders, nationalities, ethnicities, religions and 
abilities. Like the main LCA conference itself, we respect and encourage 
diversity at our miniconf.


Speakers will need to purchase an LCA ticket to attend.

See http://lca-kernel.ozlabs.org/2019-cfp.html for full details and the 
submission form.


--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

[ANNOUNCE] Call for Papers - linux.conf.au Kernel Miniconf, Christchurch NZ, 21-25 Jan 2019

2018-10-20 Thread Andrew Donnellan

The linux.conf.au Kernel Miniconf is happening once again, this time in 
Christchurch on 22 Jan 2019.


*** Submissions close on 2018-12-16, 23:59 AoE, with early submissions 
(before 2018-11-16, 23:59 AoE) given priority. ***


*** Submission details: http://lca-kernel.ozlabs.org/2019-cfp.html ***

The Kernel Miniconf is a 1 day stream alongside the main LCA conference 
to talk about kernel stuff.


We invite submissions on anything related to kernel and low-level 
systems programming. We welcome submissions from developers of all 
levels of experience in the kernel community, covering a broad range of 
topics.


Past Kernel Miniconfs have included technical talks on topics such as 
memory management, RCU, scheduling and filesystems, as well as talks on 
Linux kernel community topics such as licensing and Linux kernel 
development process.


We strongly encourage both first-time and seasoned speakers from all 
backgrounds, ages, genders, nationalities, ethnicities, religions and 
abilities. Like the main LCA conference itself, we respect and encourage 
diversity at our miniconf.


Speakers will need to purchase an LCA ticket to attend.

See http://lca-kernel.ozlabs.org/2019-cfp.html for full details and the 
submission form.


--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-20 Thread David Howells

Alan Jenkins  wrote:

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4dfe7e23b7ee..e8d61d5f581d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
>  {
>   namespace_lock();
>   lock_mount_hash();
> - if (!real_mount(mnt)->mnt_ns) {
> + if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
>   mntget(mnt);
>   umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
>   }
> @@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   if (old->mnt_ns && !attached)
>   goto out1;
>  
> - if (old->mnt.mnt_flags & MNT_LOCKED)
> + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
>   goto out1;
>  
>   if (old_path->dentry != old_path->mnt->mnt_root)

I've already got one of these; I'll fold in the other also.

David

Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]

2018-10-20 Thread David Howells

Alan Jenkins  wrote:

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4dfe7e23b7ee..e8d61d5f581d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
>  {
>   namespace_lock();
>   lock_mount_hash();
> - if (!real_mount(mnt)->mnt_ns) {
> + if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
>   mntget(mnt);
>   umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
>   }
> @@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct 
> path *new_path)
>   if (old->mnt_ns && !attached)
>   goto out1;
>  
> - if (old->mnt.mnt_flags & MNT_LOCKED)
> + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
>   goto out1;
>  
>   if (old_path->dentry != old_path->mnt->mnt_root)

I've already got one of these; I'll fold in the other also.

David

RE: [PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Steve Wise




> -Original Message-
> From: Wenwen Wang 
> Sent: Saturday, October 20, 2018 6:56 PM
> To: sw...@opengridcomputing.com
> Cc: Kangjie Lu ; sw...@chelsio.com; dledf...@redhat.com;
> j...@ziepe.ca; linux-r...@vger.kernel.org; open list  ker...@vger.kernel.org>; Wenwen Wang 
> Subject: Re: [PATCH] iw_cxgb4: fix a missing-check bug
> 
> On Sat, Oct 20, 2018 at 6:41 PM Steve Wise
>  wrote:
> >
> > Hey Wenwen,
> >
> > > Subject: [PATCH] iw_cxgb4: fix a missing-check bug
> > >
> > > In c4iw_flush_hw_cq, the next CQE is acquired through
> t4_next_hw_cqe(). In
> > > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
> > > whether it is valid through t4_valid_cqe(). If it is valid, the address of
> > > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the
> > local
> > > memory in create_read_req_cqe(). The problem here is that the CQE is
> > > actually in a DMA region allocated by dma_alloc_coherent() in
> create_cq().
> > > Given that the device also has the permission to access the DMA region, a
> > > malicious device controlled by an attacker can modify the CQE in the DMA
> > > region after the check in t4_next_hw_cqe() but before the copy in
> > > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
> > > which can cause undefined behavior of the kernel and introduce
> potential
> > > security risks.
> > >
> >
> > If the dma device is malicious, couldn't it just dma some incorrect CQE but
> > still valid in the first place?  I don't think this patch actually solves
> > the issue, and it forces a copy of a 64B CQE in a critical data io path.
> 
> Thanks for your response! If the malicious dma device just dma some
> incorrect CQE, it will not be able to pass the verification in
> t4_valid_cqe().
> 

As long as the gen bit is correct, the CQE is considered valid.  You cannot 
protect against a malicious dma device.  Or at least not with the current 
driver/device contract.

Steve.

RE: [PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Steve Wise




> -Original Message-
> From: Wenwen Wang 
> Sent: Saturday, October 20, 2018 6:56 PM
> To: sw...@opengridcomputing.com
> Cc: Kangjie Lu ; sw...@chelsio.com; dledf...@redhat.com;
> j...@ziepe.ca; linux-r...@vger.kernel.org; open list  ker...@vger.kernel.org>; Wenwen Wang 
> Subject: Re: [PATCH] iw_cxgb4: fix a missing-check bug
> 
> On Sat, Oct 20, 2018 at 6:41 PM Steve Wise
>  wrote:
> >
> > Hey Wenwen,
> >
> > > Subject: [PATCH] iw_cxgb4: fix a missing-check bug
> > >
> > > In c4iw_flush_hw_cq, the next CQE is acquired through
> t4_next_hw_cqe(). In
> > > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
> > > whether it is valid through t4_valid_cqe(). If it is valid, the address of
> > > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the
> > local
> > > memory in create_read_req_cqe(). The problem here is that the CQE is
> > > actually in a DMA region allocated by dma_alloc_coherent() in
> create_cq().
> > > Given that the device also has the permission to access the DMA region, a
> > > malicious device controlled by an attacker can modify the CQE in the DMA
> > > region after the check in t4_next_hw_cqe() but before the copy in
> > > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
> > > which can cause undefined behavior of the kernel and introduce
> potential
> > > security risks.
> > >
> >
> > If the dma device is malicious, couldn't it just dma some incorrect CQE but
> > still valid in the first place?  I don't think this patch actually solves
> > the issue, and it forces a copy of a 64B CQE in a critical data io path.
> 
> Thanks for your response! If the malicious dma device just dma some
> incorrect CQE, it will not be able to pass the verification in
> t4_valid_cqe().
> 

As long as the gen bit is correct, the CQE is considered valid.  You cannot 
protect against a malicious dma device.  Or at least not with the current 
driver/device contract.

Steve.

Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address

2018-10-20 Thread Alan Cox

> > Data protection law, reporting laws in some
> > countries and the like mean that anyone expecting an incident to remain
> > confidential from the person it was reported against is living in
> > dreamland and are going to get a nasty shock.  
> 
> OK - you seem to be talking about keeping the incident and reporter
> confidential from the person reported against.
> Certainly the  person reported against has to have the incident
> identified to them, so that part is not confidential.  Many legal
> jurisdictions require that the accused can know their accuser.
> But these things come into play mostly when items have veered
> into legal territory.  Most violation reports are not in the territory.
> There's no legal requirement that the Code of Conduct committee
> tell someone who it was that said they were rude on the mailing list.

The 'who said' is generally safe (except from the it's in court case -
which is fine when that happens the legal process deals with it). The
other details are not so while someone accused of something might not
know who said it they can ask for what personal data (which would include
that email with names etc scrubbed).

You can possibly fight that in court of course, if you've got lots of
money and nothing better to do for six weeks.

> > You should also reserving the right to report serious incidents directly
> > to law enforcement. Unless of course you want to be forced to sit on
> > multiple reports of physical abuse from different people about
> > someone - unable to tell them about each others report, unable to prove
> > anything, and in twenty years time having to explain to the media why
> > nothing was done.  
> 
> The scope of the code of conduct basically means that it covers
> online interactions (communication via mailing list, git commits
> and Bugzilla).  

I don't see it specifically stating that 'If someone is offensive at a
kernel summit we are going to refuse to listen'

Seriously ?

Alan

Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address

2018-10-20 Thread Alan Cox

> > Data protection law, reporting laws in some
> > countries and the like mean that anyone expecting an incident to remain
> > confidential from the person it was reported against is living in
> > dreamland and are going to get a nasty shock.  
> 
> OK - you seem to be talking about keeping the incident and reporter
> confidential from the person reported against.
> Certainly the  person reported against has to have the incident
> identified to them, so that part is not confidential.  Many legal
> jurisdictions require that the accused can know their accuser.
> But these things come into play mostly when items have veered
> into legal territory.  Most violation reports are not in the territory.
> There's no legal requirement that the Code of Conduct committee
> tell someone who it was that said they were rude on the mailing list.

The 'who said' is generally safe (except from the it's in court case -
which is fine when that happens the legal process deals with it). The
other details are not so while someone accused of something might not
know who said it they can ask for what personal data (which would include
that email with names etc scrubbed).

You can possibly fight that in court of course, if you've got lots of
money and nothing better to do for six weeks.

> > You should also reserving the right to report serious incidents directly
> > to law enforcement. Unless of course you want to be forced to sit on
> > multiple reports of physical abuse from different people about
> > someone - unable to tell them about each others report, unable to prove
> > anything, and in twenty years time having to explain to the media why
> > nothing was done.  
> 
> The scope of the code of conduct basically means that it covers
> online interactions (communication via mailing list, git commits
> and Bugzilla).  

I don't see it specifically stating that 'If someone is offensive at a
kernel summit we are going to refuse to listen'

Seriously ?

Alan

Re: [PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Wenwen Wang

On Sat, Oct 20, 2018 at 6:41 PM Steve Wise  wrote:
>
> Hey Wenwen,
>
> > Subject: [PATCH] iw_cxgb4: fix a missing-check bug
> >
> > In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In
> > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
> > whether it is valid through t4_valid_cqe(). If it is valid, the address of
> > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the
> local
> > memory in create_read_req_cqe(). The problem here is that the CQE is
> > actually in a DMA region allocated by dma_alloc_coherent() in create_cq().
> > Given that the device also has the permission to access the DMA region, a
> > malicious device controlled by an attacker can modify the CQE in the DMA
> > region after the check in t4_next_hw_cqe() but before the copy in
> > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
> > which can cause undefined behavior of the kernel and introduce potential
> > security risks.
> >
>
> If the dma device is malicious, couldn't it just dma some incorrect CQE but
> still valid in the first place?  I don't think this patch actually solves
> the issue, and it forces a copy of a 64B CQE in a critical data io path.

Thanks for your response! If the malicious dma device just dma some
incorrect CQE, it will not be able to pass the verification in
t4_valid_cqe().

Wenwen

Re: [PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Wenwen Wang

On Sat, Oct 20, 2018 at 6:41 PM Steve Wise  wrote:
>
> Hey Wenwen,
>
> > Subject: [PATCH] iw_cxgb4: fix a missing-check bug
> >
> > In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In
> > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
> > whether it is valid through t4_valid_cqe(). If it is valid, the address of
> > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the
> local
> > memory in create_read_req_cqe(). The problem here is that the CQE is
> > actually in a DMA region allocated by dma_alloc_coherent() in create_cq().
> > Given that the device also has the permission to access the DMA region, a
> > malicious device controlled by an attacker can modify the CQE in the DMA
> > region after the check in t4_next_hw_cqe() but before the copy in
> > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
> > which can cause undefined behavior of the kernel and introduce potential
> > security risks.
> >
>
> If the dma device is malicious, couldn't it just dma some incorrect CQE but
> still valid in the first place?  I don't think this patch actually solves
> the issue, and it forces a copy of a 64B CQE in a critical data io path.

Thanks for your response! If the malicious dma device just dma some
incorrect CQE, it will not be able to pass the verification in
t4_valid_cqe().

Wenwen

Editors 4

2018-10-20 Thread Linda


We are one image studio who is able to process 300+ photos a day.

If you need any image editing, please let us know. We can do it for you
such as:
Image cut out for photos  and clipping path, masking for your photos,
They are mostly used for ecommerce photos, jewelry photos retouching,
beauty and skin images
and wedding photos.

We do also different kind of beauty retouching, portraits retouching.

We can send editing for your photos if you send us one or two photos.

Thanks,
Linda

Editors 4

2018-10-20 Thread Linda


We are one image studio who is able to process 300+ photos a day.

If you need any image editing, please let us know. We can do it for you
such as:
Image cut out for photos  and clipping path, masking for your photos,
They are mostly used for ecommerce photos, jewelry photos retouching,
beauty and skin images
and wedding photos.

We do also different kind of beauty retouching, portraits retouching.

We can send editing for your photos if you send us one or two photos.

Thanks,
Linda

Editors 4

2018-10-20 Thread Linda


We are one image studio who is able to process 300+ photos a day.

If you need any image editing, please let us know. We can do it for you
such as:
Image cut out for photos  and clipping path, masking for your photos,
They are mostly used for ecommerce photos, jewelry photos retouching,
beauty and skin images
and wedding photos.

We do also different kind of beauty retouching, portraits retouching.

We can send editing for your photos if you send us one or two photos.

Thanks,
Linda

Editors 4

2018-10-20 Thread Linda


We are one image studio who is able to process 300+ photos a day.

If you need any image editing, please let us know. We can do it for you
such as:
Image cut out for photos  and clipping path, masking for your photos,
They are mostly used for ecommerce photos, jewelry photos retouching,
beauty and skin images
and wedding photos.

We do also different kind of beauty retouching, portraits retouching.

We can send editing for your photos if you send us one or two photos.

Thanks,
Linda

RE: [PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Steve Wise

Hey Wenwen,

> Subject: [PATCH] iw_cxgb4: fix a missing-check bug
> 
> In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In
> t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
> whether it is valid through t4_valid_cqe(). If it is valid, the address of
> the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the
local
> memory in create_read_req_cqe(). The problem here is that the CQE is
> actually in a DMA region allocated by dma_alloc_coherent() in create_cq().
> Given that the device also has the permission to access the DMA region, a
> malicious device controlled by an attacker can modify the CQE in the DMA
> region after the check in t4_next_hw_cqe() but before the copy in
> create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
> which can cause undefined behavior of the kernel and introduce potential
> security risks.
> 

If the dma device is malicious, couldn't it just dma some incorrect CQE but
still valid in the first place?  I don't think this patch actually solves
the issue, and it forces a copy of a 64B CQE in a critical data io path.  

So I must NACK this. 

Thanks,

Steve.

RE: [PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Steve Wise

Hey Wenwen,

> Subject: [PATCH] iw_cxgb4: fix a missing-check bug
> 
> In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In
> t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
> whether it is valid through t4_valid_cqe(). If it is valid, the address of
> the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the
local
> memory in create_read_req_cqe(). The problem here is that the CQE is
> actually in a DMA region allocated by dma_alloc_coherent() in create_cq().
> Given that the device also has the permission to access the DMA region, a
> malicious device controlled by an attacker can modify the CQE in the DMA
> region after the check in t4_next_hw_cqe() but before the copy in
> create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
> which can cause undefined behavior of the kernel and introduce potential
> security risks.
> 

If the dma device is malicious, couldn't it just dma some incorrect CQE but
still valid in the first place?  I don't think this patch actually solves
the issue, and it forces a copy of a 64B CQE in a critical data io path.  

So I must NACK this. 

Thanks,

Steve.

Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Al Viro

On Sat, Oct 20, 2018 at 03:26:37PM -0700, Matthew Wilcox wrote:
> On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote:
> > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
> > header and replace with simple assignment. For each case, S_IFx >> 12
> > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
> > us the correct file type. It is expected that for *nix compatibility
> > reasons, the relation between S_IFx and DT_x will not change. For
> > cases where the mode is invalid, upper layer validation catches this
> > anyway, so this improves readability and arguably performance by
> > assigning (mode & S_IFMT) >> 12 directly without any jump table
> > or conditional logic.
> 
> Shouldn't we also do this for other filesystems?  A quick scan suggests
> someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2,
> nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs.

Do what for other filesystems?  E.g. ext* has the type cached in directory
entry - as a straight enum, so encoding is needed.  Which we do.  ocfs2
is pretty certain to have it in ext*-compatible layout.  ubifs stores
them in another enum of its own (also handled).  XFS - another enum (present
on some layout variants), etc.

And... CIFS?  Seriously?  This stuff is about encoding used to cache the
file type in directory entry; how the bleeding hell would CIFS _client_
get anywhere near that?

Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Al Viro

On Sat, Oct 20, 2018 at 03:26:37PM -0700, Matthew Wilcox wrote:
> On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote:
> > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
> > header and replace with simple assignment. For each case, S_IFx >> 12
> > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
> > us the correct file type. It is expected that for *nix compatibility
> > reasons, the relation between S_IFx and DT_x will not change. For
> > cases where the mode is invalid, upper layer validation catches this
> > anyway, so this improves readability and arguably performance by
> > assigning (mode & S_IFMT) >> 12 directly without any jump table
> > or conditional logic.
> 
> Shouldn't we also do this for other filesystems?  A quick scan suggests
> someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2,
> nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs.

Do what for other filesystems?  E.g. ext* has the type cached in directory
entry - as a straight enum, so encoding is needed.  Which we do.  ocfs2
is pretty certain to have it in ext*-compatible layout.  ubifs stores
them in another enum of its own (also handled).  XFS - another enum (present
on some layout variants), etc.

And... CIFS?  Seriously?  This stuff is about encoding used to cache the
file type in directory entry; how the bleeding hell would CIFS _client_
get anywhere near that?

[PATCH v2] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-20 Thread KarimAllah Ahmed

The spec only requires the posted interrupt descriptor address to be
64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
forces the address to be page aligned.

Only validate that the address does not cross the maximum physical address
without enforcing a page alignment.

v1 -> v2:
- Add a missing parenthesis (dropped while merging!)

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on 
vmentry of L2")
Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 38f1a16..bb0fcdb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11667,7 +11667,7 @@ static int nested_vmx_check_apicv_controls(struct 
kvm_vcpu *vcpu,
!nested_exit_intr_ack_set(vcpu) ||
(vmcs12->posted_intr_nv & 0xff00) ||
(vmcs12->posted_intr_desc_addr & 0x3f) ||
-   (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
+   (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu
return -EINVAL;
 
/* tpr shadow is needed by all apicv features. */
-- 
2.7.4

[PATCH v2] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-20 Thread KarimAllah Ahmed

The spec only requires the posted interrupt descriptor address to be
64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
forces the address to be page aligned.

Only validate that the address does not cross the maximum physical address
without enforcing a page alignment.

v1 -> v2:
- Add a missing parenthesis (dropped while merging!)

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on 
vmentry of L2")
Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 38f1a16..bb0fcdb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11667,7 +11667,7 @@ static int nested_vmx_check_apicv_controls(struct 
kvm_vcpu *vcpu,
!nested_exit_intr_ack_set(vcpu) ||
(vmcs12->posted_intr_nv & 0xff00) ||
(vmcs12->posted_intr_desc_addr & 0x3f) ||
-   (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
+   (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu
return -EINVAL;
 
/* tpr shadow is needed by all apicv features. */
-- 
2.7.4

RE: [PATCH] tpm: tpm_try_transmit() refactor error flow.

2018-10-20 Thread Winkler, Tomas

> 
> On Thu, 18 Oct 2018, Winkler, Tomas wrote:
> >> -Original Message-
> >> From: Jarkko Sakkinen [mailto:jarkko.sakki...@linux.intel.com]
> >> Sent: Thursday, October 18, 2018 03:15
> >> To: Winkler, Tomas 
> >> Cc: Jarkko Sakkinen ; Jason
> >> Gunthorpe ; Nayna Jain ;
> >> Usyskin, Alexander ; Struk, Tadeusz
> >> ; linux-integr...@vger.kernel.org;
> >> linux-security- mod...@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> sta...@vger.kernel.org
> >> Subject: Re: [PATCH] tpm: tpm_try_transmit() refactor error flow.
> >>
> >> On Tue, 16 Oct 2018, Tomas Winkler wrote:
> >>> First, rename out_no_locality to out_locality for bailing out on
> >>> both
> >>> tpm_cmd_ready() and tpm_request_locality() failure.
> >>
> >> This is unnecessary change and technically it is not a rename: the
> >> commit message text and the code change do not match. Rename is just
> >> a rename (i.e. change a variable name foo to bar).
> >
> > I'm renaming the label because it doesn't match the code flow anymore,
> > I can change the commit message, but you please review the code.
> > Tomas
> 
> The flow change is unnecessary and does not really have anything to do with
> the bug fix. 

What I see in the original code is that when 
tpm_cmd_ready()  fails it's jumps to 'out' label and trying to do tpm_go_idle()
but instead it should just undoing the locality, so both cmd_read and go idle 
had a wrong 
jump. I see both should  be fixed.

Earlier version was better than this and would have been fine
> when taking account the remark from Jason.

RE: [PATCH] tpm: tpm_try_transmit() refactor error flow.

2018-10-20 Thread Winkler, Tomas

> 
> On Thu, 18 Oct 2018, Winkler, Tomas wrote:
> >> -Original Message-
> >> From: Jarkko Sakkinen [mailto:jarkko.sakki...@linux.intel.com]
> >> Sent: Thursday, October 18, 2018 03:15
> >> To: Winkler, Tomas 
> >> Cc: Jarkko Sakkinen ; Jason
> >> Gunthorpe ; Nayna Jain ;
> >> Usyskin, Alexander ; Struk, Tadeusz
> >> ; linux-integr...@vger.kernel.org;
> >> linux-security- mod...@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> sta...@vger.kernel.org
> >> Subject: Re: [PATCH] tpm: tpm_try_transmit() refactor error flow.
> >>
> >> On Tue, 16 Oct 2018, Tomas Winkler wrote:
> >>> First, rename out_no_locality to out_locality for bailing out on
> >>> both
> >>> tpm_cmd_ready() and tpm_request_locality() failure.
> >>
> >> This is unnecessary change and technically it is not a rename: the
> >> commit message text and the code change do not match. Rename is just
> >> a rename (i.e. change a variable name foo to bar).
> >
> > I'm renaming the label because it doesn't match the code flow anymore,
> > I can change the commit message, but you please review the code.
> > Tomas
> 
> The flow change is unnecessary and does not really have anything to do with
> the bug fix. 

What I see in the original code is that when 
tpm_cmd_ready()  fails it's jumps to 'out' label and trying to do tpm_go_idle()
but instead it should just undoing the locality, so both cmd_read and go idle 
had a wrong 
jump. I see both should  be fixed.

Earlier version was better than this and would have been fine
> when taking account the remark from Jason.

Broken dwarf unwinding - wrong stack pointer register value?

2018-10-20 Thread Milian Wolff

Hey all,

I'm on the quest to figure out why perf regularly fails to unwind (some) 
samples. I am seeing very strange behavior, where an apparently wrong stack 
pointer value is read from the register - see below for more information and 
the end of this (long) mail for my open questions. Any help would be greatly 
appreciated.

I am currently using this trivial C++ code to reproduce the issue:

```
#include 
#include 
#include 
#include 

using namespace std;

int main()
{
uniform_real_distribution uniform(-1E5, 1E5);
default_random_engine engine;
double s = 0;
for (int i = 0; i < 1000; ++i) {
s += norm(complex(uniform(engine), uniform(engine)));
}
cout << s << '\n';
return 0;
}
```

I compile it with `g++ -O2 -g` and then record it with `perf record --call-
graph dwarf`. Using perf script, I then see e.g.:

```
$ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
...
# first frame (working unwinding from __hypot_finite):
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
unwind: find_proc_info dso /usr/lib/libc-2.28.so
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
unwind: main:ip = 0x56382b0fc128 (0x1128)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# second frame (unrelated)
unwind: reg 16, val 56382b0fc114
unwind: reg 7, val 7ffc80811d10
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
unwind: main:ip = 0x56382b0fc114 (0x1114)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# third frame (broken unwinding from __hypot_finite)
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
unwind: '':ip = 0x (0x0)
cpp-inlining 24617 90229.126685606: 711026 cycles:uppp: 
7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.126921551: 714657 cycles:uppp: 
56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.127157818: 719976 cycles:uppp: 
7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
 [unknown] ([unknown])
...
```

Now I'm trying to figure out why one __hypot_finite sample works but the other 
one breaks for no apparent reason. One thing I noticed so far is the following 
discrepancy in memory accesses, i.e. for the working sample we get:

```
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
...
```

But the broken sample only produces:

```
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
```

The user stack of the working sample starts with (in hex):

9c830441254a9800301d8180fc7fc0c10f2b385630

The user stack of the broken sample starts with (in hex):

2f64fd4073bb9700301d8180fc7fc0c10f2b385630

This is a 64bit machine, so word width is 8. So both user stacks start with 
`24 * 2 = 48` words of zeros. Hence offset 32 cannot possibly produce a 
meaningful

Broken dwarf unwinding - wrong stack pointer register value?

2018-10-20 Thread Milian Wolff

Hey all,

I'm on the quest to figure out why perf regularly fails to unwind (some) 
samples. I am seeing very strange behavior, where an apparently wrong stack 
pointer value is read from the register - see below for more information and 
the end of this (long) mail for my open questions. Any help would be greatly 
appreciated.

I am currently using this trivial C++ code to reproduce the issue:

```
#include 
#include 
#include 
#include 

using namespace std;

int main()
{
uniform_real_distribution uniform(-1E5, 1E5);
default_random_engine engine;
double s = 0;
for (int i = 0; i < 1000; ++i) {
s += norm(complex(uniform(engine), uniform(engine)));
}
cout << s << '\n';
return 0;
}
```

I compile it with `g++ -O2 -g` and then record it with `perf record --call-
graph dwarf`. Using perf script, I then see e.g.:

```
$ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
...
# first frame (working unwinding from __hypot_finite):
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
unwind: find_proc_info dso /usr/lib/libc-2.28.so
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
unwind: main:ip = 0x56382b0fc128 (0x1128)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# second frame (unrelated)
unwind: reg 16, val 56382b0fc114
unwind: reg 7, val 7ffc80811d10
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
unwind: main:ip = 0x56382b0fc114 (0x1114)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# third frame (broken unwinding from __hypot_finite)
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
unwind: '':ip = 0x (0x0)
cpp-inlining 24617 90229.126685606: 711026 cycles:uppp: 
7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.126921551: 714657 cycles:uppp: 
56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.127157818: 719976 cycles:uppp: 
7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
 [unknown] ([unknown])
...
```

Now I'm trying to figure out why one __hypot_finite sample works but the other 
one breaks for no apparent reason. One thing I noticed so far is the following 
discrepancy in memory accesses, i.e. for the working sample we get:

```
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
...
```

But the broken sample only produces:

```
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
```

The user stack of the working sample starts with (in hex):

9c830441254a9800301d8180fc7fc0c10f2b385630

The user stack of the broken sample starts with (in hex):

2f64fd4073bb9700301d8180fc7fc0c10f2b385630

This is a 64bit machine, so word width is 8. So both user stacks start with 
`24 * 2 = 48` words of zeros. Hence offset 32 cannot possibly produce a 
meaningful

Re: [PATCH v3 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses

2018-10-20 Thread Michael Tirado

James, and our other friends,

On Tue, Oct 16, 2018 at 2:59 PM James Bottomley
 wrote:
>
> The current code of conduct has an ambiguity

More than one ambiguity. This whole file needs to go.

>* Trolling,

Who decides what is trolling, and what is a technique for raising
awareness or sparking discussion on an issue?

>  * Other conduct which could reasonably be considered inappropriate in a
>professional setting

Why should this last bit remain?  Any literate person with access to a
dictionary should know how ambiguous the word professional is.  As an
amateur contributor to the FOSS ecosystem I am more than a bit
offended by the decision to use such divisive, politically charged,
and financially discriminatory language in a project of such massive
technical importance.  This entire file should be expunged from the
repository and replaced by well defined minimalistic guidelines for
maintaining order on the mailing lists, rather than a set of ambiguous
codes that force maintainers to take politically motivated actions
against contributors for undefined reasons.

Using words like professional is a distressing red flag because it
doesn't add any clarification on the issue (what was the issue
again?), it only raises more questions.  I can't think of any reason
that word would be needed unless you're trying to push out unpaid
contributors.  Why should someones employment status be held against
them when contributing ideas or code to a technical project that has
benefited greatly from amateur contributions?

I fear for the kernels future now that irrational politics are
beginning to creep.

Re: [PATCH v3 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses

2018-10-20 Thread Michael Tirado

James, and our other friends,

On Tue, Oct 16, 2018 at 2:59 PM James Bottomley
 wrote:
>
> The current code of conduct has an ambiguity

More than one ambiguity. This whole file needs to go.

>* Trolling,

Who decides what is trolling, and what is a technique for raising
awareness or sparking discussion on an issue?

>  * Other conduct which could reasonably be considered inappropriate in a
>professional setting

Why should this last bit remain?  Any literate person with access to a
dictionary should know how ambiguous the word professional is.  As an
amateur contributor to the FOSS ecosystem I am more than a bit
offended by the decision to use such divisive, politically charged,
and financially discriminatory language in a project of such massive
technical importance.  This entire file should be expunged from the
repository and replaced by well defined minimalistic guidelines for
maintaining order on the mailing lists, rather than a set of ambiguous
codes that force maintainers to take politically motivated actions
against contributors for undefined reasons.

Using words like professional is a distressing red flag because it
doesn't add any clarification on the issue (what was the issue
again?), it only raises more questions.  I can't think of any reason
that word would be needed unless you're trying to push out unpaid
contributors.  Why should someones employment status be held against
them when contributing ideas or code to a technical project that has
benefited greatly from amateur contributions?

I fear for the kernels future now that irrational politics are
beginning to creep.

Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Matthew Wilcox

On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote:
> Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
> header and replace with simple assignment. For each case, S_IFx >> 12
> is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
> us the correct file type. It is expected that for *nix compatibility
> reasons, the relation between S_IFx and DT_x will not change. For
> cases where the mode is invalid, upper layer validation catches this
> anyway, so this improves readability and arguably performance by
> assigning (mode & S_IFMT) >> 12 directly without any jump table
> or conditional logic.

Shouldn't we also do this for other filesystems?  A quick scan suggests
someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2,
nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs.

> Signed-off-by: Phillip Potter 
> ---
>  fs/ufs/util.h | 30 ++
>  1 file changed, 2 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/ufs/util.h b/fs/ufs/util.h
> index 1fd3011ea623..7e0c0878b9f9 100644
> --- a/fs/ufs/util.h
> +++ b/fs/ufs/util.h
> @@ -16,6 +16,7 @@
>   * some useful macros
>   */
>  #define in_range(b,first,len)((b)>=(first)&&(b)<(first)+(len))
> +#define S_SHIFT  12
>  
>  /*
>   * functions used for retyping
> @@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct 
> ufs_dir_entry *de, int mode)
>   if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD)
>   return;
>  
> - /*
> -  * TODO turn this into a table lookup
> -  */
> - switch (mode & S_IFMT) {
> - case S_IFSOCK:
> - de->d_u.d_44.d_type = DT_SOCK;
> - break;
> - case S_IFLNK:
> - de->d_u.d_44.d_type = DT_LNK;
> - break;
> - case S_IFREG:
> - de->d_u.d_44.d_type = DT_REG;
> - break;
> - case S_IFBLK:
> - de->d_u.d_44.d_type = DT_BLK;
> - break;
> - case S_IFDIR:
> - de->d_u.d_44.d_type = DT_DIR;
> - break;
> - case S_IFCHR:
> - de->d_u.d_44.d_type = DT_CHR;
> - break;
> - case S_IFIFO:
> - de->d_u.d_44.d_type = DT_FIFO;
> - break;
> - default:
> - de->d_u.d_44.d_type = DT_UNKNOWN;
> - }
> + de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT;
>  }
>  
>  static inline u32
> -- 
> 2.17.2
>

Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Matthew Wilcox

On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote:
> Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
> header and replace with simple assignment. For each case, S_IFx >> 12
> is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
> us the correct file type. It is expected that for *nix compatibility
> reasons, the relation between S_IFx and DT_x will not change. For
> cases where the mode is invalid, upper layer validation catches this
> anyway, so this improves readability and arguably performance by
> assigning (mode & S_IFMT) >> 12 directly without any jump table
> or conditional logic.

Shouldn't we also do this for other filesystems?  A quick scan suggests
someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2,
nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs.

> Signed-off-by: Phillip Potter 
> ---
>  fs/ufs/util.h | 30 ++
>  1 file changed, 2 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/ufs/util.h b/fs/ufs/util.h
> index 1fd3011ea623..7e0c0878b9f9 100644
> --- a/fs/ufs/util.h
> +++ b/fs/ufs/util.h
> @@ -16,6 +16,7 @@
>   * some useful macros
>   */
>  #define in_range(b,first,len)((b)>=(first)&&(b)<(first)+(len))
> +#define S_SHIFT  12
>  
>  /*
>   * functions used for retyping
> @@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct 
> ufs_dir_entry *de, int mode)
>   if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD)
>   return;
>  
> - /*
> -  * TODO turn this into a table lookup
> -  */
> - switch (mode & S_IFMT) {
> - case S_IFSOCK:
> - de->d_u.d_44.d_type = DT_SOCK;
> - break;
> - case S_IFLNK:
> - de->d_u.d_44.d_type = DT_LNK;
> - break;
> - case S_IFREG:
> - de->d_u.d_44.d_type = DT_REG;
> - break;
> - case S_IFBLK:
> - de->d_u.d_44.d_type = DT_BLK;
> - break;
> - case S_IFDIR:
> - de->d_u.d_44.d_type = DT_DIR;
> - break;
> - case S_IFCHR:
> - de->d_u.d_44.d_type = DT_CHR;
> - break;
> - case S_IFIFO:
> - de->d_u.d_44.d_type = DT_FIFO;
> - break;
> - default:
> - de->d_u.d_44.d_type = DT_UNKNOWN;
> - }
> + de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT;
>  }
>  
>  static inline u32
> -- 
> 2.17.2
>

[PATCH v3 06/13] KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
---
 arch/x86/kvm/vmx.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d857401..5b15ca2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -847,6 +847,9 @@ struct nested_vmx {
struct page *apic_access_page;
struct page *virtual_apic_page;
struct page *pi_desc_page;
+
+   struct kvm_host_map msr_bitmap_map;
+
struct pi_desc *pi_desc;
bool pi_pending;
u16 posted_intr_nv;
@@ -11546,9 +11549,10 @@ static inline bool 
nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 {
int msr;
-   struct page *page;
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
+   struct kvm_host_map *map = _vmx(vcpu)->nested.msr_bitmap_map;
+
/*
 * pred_cmd & spec_ctrl are trying to verify two things:
 *
@@ -11574,11 +11578,10 @@ static inline bool 
nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
!pred_cmd && !spec_ctrl)
return false;
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
-   if (is_error_page(page))
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map))
return false;
 
-   msr_bitmap_l1 = (unsigned long *)kmap(page);
+   msr_bitmap_l1 = (unsigned long *)map->hva;
if (nested_cpu_has_apic_reg_virt(vmcs12)) {
/*
 * L0 need not intercept reads for MSRs between 0x800 and 
0x8ff, it
@@ -11626,8 +11629,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct 
kvm_vcpu *vcpu,
MSR_IA32_PRED_CMD,
MSR_TYPE_W);
 
-   kunmap(page);
-   kvm_release_page_clean(page);
+   kvm_vcpu_unmap(_vmx(vcpu)->nested.msr_bitmap_map);
 
return true;
 }
-- 
2.7.4

[PATCH v3 01/13] X86/nVMX: handle_vmon: Read 4 bytes from guest memory

2018-10-20 Thread KarimAllah Ahmed

Read the data directly from guest memory instead of the map->read->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.

Suggested-by: Jim Mattson 
Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Massage commit message a bit.
---
 arch/x86/kvm/vmx.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4f5d4bd..358759a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8321,7 +8321,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
 {
int ret;
gpa_t vmptr;
-   struct page *page;
+   uint32_t revision;
struct vcpu_vmx *vmx = to_vmx(vcpu);
const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
| FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
@@ -8373,19 +8373,11 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
-   if (is_error_page(page)) {
+   if (kvm_read_guest(vcpu->kvm, vmptr, , sizeof(revision)) ||
+   revision != VMCS12_REVISION) {
nested_vmx_failInvalid(vcpu);
return kvm_skip_emulated_instruction(vcpu);
}
-   if (*(u32 *)kmap(page) != VMCS12_REVISION) {
-   kunmap(page);
-   kvm_release_page_clean(page);
-   nested_vmx_failInvalid(vcpu);
-   return kvm_skip_emulated_instruction(vcpu);
-   }
-   kunmap(page);
-   kvm_release_page_clean(page);
 
vmx->nested.vmxon_ptr = vmptr;
ret = enter_vmx_operation(vcpu);
-- 
2.7.4

[PATCH v3 05/13] KVM: Introduce a new guest mapping API

2018-10-20 Thread KarimAllah Ahmed

In KVM, specially for nested guests, there is a dominant pattern of:

=> map guest memory -> do_something -> unmap guest memory

In addition to all this unnecessarily noise in the code due to boiler plate
code, most of the time the mapping function does not properly handle memory
that is not backed by "struct page". This new guest mapping API encapsulate
most of this boiler plate code and also handles guest memory that is not
backed by "struct page".

Keep in mind that memremap is horribly slow, so this mapping API should not
be used for high-frequency mapping operations. But rather for low-frequency
mappings.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Drop the caching optimization (pbonzini)
- Use 'hva' instead of 'kaddr' (pbonzini)
- Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
  AMD patch (pbonzini)
- Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
- Only clear map->hva instead of memsetting the whole structure.
- Drop kvm_vcpu_map_valid since it is no longer used.
- Fix EXPORT_MODULE naming.
---
 include/linux/kvm_host.h |  9 +
 virt/kvm/kvm_main.c  | 50 
 2 files changed, 59 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c926698..59e56b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -205,6 +205,13 @@ enum {
READING_SHADOW_PAGE_TABLES,
 };
 
+struct kvm_host_map {
+   struct page *page;
+   void *hva;
+   kvm_pfn_t pfn;
+   kvm_pfn_t gfn;
+};
+
 /*
  * Sometimes a large or cross-page mmio needs to be broken up into separate
  * exits for userspace servicing.
@@ -708,6 +715,8 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu 
*vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t 
gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
+void kvm_vcpu_unmap(struct kvm_host_map *map);
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool 
*writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 94e931f..a79a3c4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1652,6 +1652,56 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_page);
 
+static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
+struct kvm_host_map *map)
+{
+   kvm_pfn_t pfn;
+   void *hva = NULL;
+   struct page *page = NULL;
+
+   pfn = gfn_to_pfn_memslot(slot, gfn);
+   if (is_error_noslot_pfn(pfn))
+   return -EINVAL;
+
+   if (pfn_valid(pfn)) {
+   page = pfn_to_page(pfn);
+   hva = kmap(page);
+   } else {
+   hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
+   }
+
+   if (!hva)
+   return -EFAULT;
+
+   map->page = page;
+   map->hva = hva;
+   map->pfn = pfn;
+   map->gfn = gfn;
+
+   return 0;
+}
+
+int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
+{
+   return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_map);
+
+void kvm_vcpu_unmap(struct kvm_host_map *map)
+{
+   if (!map->hva)
+   return;
+
+   if (map->page)
+   kunmap(map->page);
+   else
+   memunmap(map->hva);
+
+   kvm_release_pfn_dirty(map->pfn);
+   map->hva = NULL;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
+
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
kvm_pfn_t pfn;
-- 
2.7.4

[PATCH v3 03/13] X86/nVMX: Update the PML table without mapping and unmapping the page

2018-10-20 Thread KarimAllah Ahmed

Update the PML table without mapping and unmapping the page. This also
avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct
page" for guest memory.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Use kvm_write_guest_page instead of kvm_write_guest (pbonzini)
- Do not use pointer arithmetic for pml_address (pbonzini)
---
 arch/x86/kvm/vmx.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bc45347..d857401 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -13571,9 +13571,7 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
 {
struct vmcs12 *vmcs12;
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   gpa_t gpa;
-   struct page *page = NULL;
-   u64 *pml_address;
+   gpa_t gpa, dst;
 
if (is_guest_mode(vcpu)) {
WARN_ON_ONCE(vmx->nested.pml_full);
@@ -13593,15 +13591,13 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
}
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull;
+   dst = vmcs12->pml_address + sizeof(u64) * 
vmcs12->guest_pml_index;
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->pml_address);
-   if (is_error_page(page))
+   if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), ,
+offset_in_page(dst), sizeof(gpa)))
return 0;
 
-   pml_address = kmap(page);
-   pml_address[vmcs12->guest_pml_index--] = gpa;
-   kunmap(page);
-   kvm_release_page_clean(page);
+   vmcs12->guest_pml_index--;
}
 
return 0;
-- 
2.7.4

[PATCH v3 06/13] KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
---
 arch/x86/kvm/vmx.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d857401..5b15ca2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -847,6 +847,9 @@ struct nested_vmx {
struct page *apic_access_page;
struct page *virtual_apic_page;
struct page *pi_desc_page;
+
+   struct kvm_host_map msr_bitmap_map;
+
struct pi_desc *pi_desc;
bool pi_pending;
u16 posted_intr_nv;
@@ -11546,9 +11549,10 @@ static inline bool 
nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 {
int msr;
-   struct page *page;
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
+   struct kvm_host_map *map = _vmx(vcpu)->nested.msr_bitmap_map;
+
/*
 * pred_cmd & spec_ctrl are trying to verify two things:
 *
@@ -11574,11 +11578,10 @@ static inline bool 
nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
!pred_cmd && !spec_ctrl)
return false;
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
-   if (is_error_page(page))
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map))
return false;
 
-   msr_bitmap_l1 = (unsigned long *)kmap(page);
+   msr_bitmap_l1 = (unsigned long *)map->hva;
if (nested_cpu_has_apic_reg_virt(vmcs12)) {
/*
 * L0 need not intercept reads for MSRs between 0x800 and 
0x8ff, it
@@ -11626,8 +11629,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct 
kvm_vcpu *vcpu,
MSR_IA32_PRED_CMD,
MSR_TYPE_W);
 
-   kunmap(page);
-   kvm_release_page_clean(page);
+   kvm_vcpu_unmap(_vmx(vcpu)->nested.msr_bitmap_map);
 
return true;
 }
-- 
2.7.4

[PATCH v3 01/13] X86/nVMX: handle_vmon: Read 4 bytes from guest memory

2018-10-20 Thread KarimAllah Ahmed

Read the data directly from guest memory instead of the map->read->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.

Suggested-by: Jim Mattson 
Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Massage commit message a bit.
---
 arch/x86/kvm/vmx.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4f5d4bd..358759a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8321,7 +8321,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
 {
int ret;
gpa_t vmptr;
-   struct page *page;
+   uint32_t revision;
struct vcpu_vmx *vmx = to_vmx(vcpu);
const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
| FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
@@ -8373,19 +8373,11 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
-   if (is_error_page(page)) {
+   if (kvm_read_guest(vcpu->kvm, vmptr, , sizeof(revision)) ||
+   revision != VMCS12_REVISION) {
nested_vmx_failInvalid(vcpu);
return kvm_skip_emulated_instruction(vcpu);
}
-   if (*(u32 *)kmap(page) != VMCS12_REVISION) {
-   kunmap(page);
-   kvm_release_page_clean(page);
-   nested_vmx_failInvalid(vcpu);
-   return kvm_skip_emulated_instruction(vcpu);
-   }
-   kunmap(page);
-   kvm_release_page_clean(page);
 
vmx->nested.vmxon_ptr = vmptr;
ret = enter_vmx_operation(vcpu);
-- 
2.7.4

[PATCH v3 05/13] KVM: Introduce a new guest mapping API

2018-10-20 Thread KarimAllah Ahmed

In KVM, specially for nested guests, there is a dominant pattern of:

=> map guest memory -> do_something -> unmap guest memory

In addition to all this unnecessarily noise in the code due to boiler plate
code, most of the time the mapping function does not properly handle memory
that is not backed by "struct page". This new guest mapping API encapsulate
most of this boiler plate code and also handles guest memory that is not
backed by "struct page".

Keep in mind that memremap is horribly slow, so this mapping API should not
be used for high-frequency mapping operations. But rather for low-frequency
mappings.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Drop the caching optimization (pbonzini)
- Use 'hva' instead of 'kaddr' (pbonzini)
- Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
  AMD patch (pbonzini)
- Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
- Only clear map->hva instead of memsetting the whole structure.
- Drop kvm_vcpu_map_valid since it is no longer used.
- Fix EXPORT_MODULE naming.
---
 include/linux/kvm_host.h |  9 +
 virt/kvm/kvm_main.c  | 50 
 2 files changed, 59 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c926698..59e56b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -205,6 +205,13 @@ enum {
READING_SHADOW_PAGE_TABLES,
 };
 
+struct kvm_host_map {
+   struct page *page;
+   void *hva;
+   kvm_pfn_t pfn;
+   kvm_pfn_t gfn;
+};
+
 /*
  * Sometimes a large or cross-page mmio needs to be broken up into separate
  * exits for userspace servicing.
@@ -708,6 +715,8 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu 
*vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t 
gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
+void kvm_vcpu_unmap(struct kvm_host_map *map);
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool 
*writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 94e931f..a79a3c4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1652,6 +1652,56 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_page);
 
+static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
+struct kvm_host_map *map)
+{
+   kvm_pfn_t pfn;
+   void *hva = NULL;
+   struct page *page = NULL;
+
+   pfn = gfn_to_pfn_memslot(slot, gfn);
+   if (is_error_noslot_pfn(pfn))
+   return -EINVAL;
+
+   if (pfn_valid(pfn)) {
+   page = pfn_to_page(pfn);
+   hva = kmap(page);
+   } else {
+   hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
+   }
+
+   if (!hva)
+   return -EFAULT;
+
+   map->page = page;
+   map->hva = hva;
+   map->pfn = pfn;
+   map->gfn = gfn;
+
+   return 0;
+}
+
+int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
+{
+   return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_map);
+
+void kvm_vcpu_unmap(struct kvm_host_map *map)
+{
+   if (!map->hva)
+   return;
+
+   if (map->page)
+   kunmap(map->page);
+   else
+   memunmap(map->hva);
+
+   kvm_release_pfn_dirty(map->pfn);
+   map->hva = NULL;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
+
 struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
kvm_pfn_t pfn;
-- 
2.7.4

[PATCH v3 03/13] X86/nVMX: Update the PML table without mapping and unmapping the page

2018-10-20 Thread KarimAllah Ahmed

Update the PML table without mapping and unmapping the page. This also
avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct
page" for guest memory.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Use kvm_write_guest_page instead of kvm_write_guest (pbonzini)
- Do not use pointer arithmetic for pml_address (pbonzini)
---
 arch/x86/kvm/vmx.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bc45347..d857401 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -13571,9 +13571,7 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
 {
struct vmcs12 *vmcs12;
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   gpa_t gpa;
-   struct page *page = NULL;
-   u64 *pml_address;
+   gpa_t gpa, dst;
 
if (is_guest_mode(vcpu)) {
WARN_ON_ONCE(vmx->nested.pml_full);
@@ -13593,15 +13591,13 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
}
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull;
+   dst = vmcs12->pml_address + sizeof(u64) * 
vmcs12->guest_pml_index;
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->pml_address);
-   if (is_error_page(page))
+   if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), ,
+offset_in_page(dst), sizeof(gpa)))
return 0;
 
-   pml_address = kmap(page);
-   pml_address[vmcs12->guest_pml_index--] = gpa;
-   kunmap(page);
-   kvm_release_page_clean(page);
+   vmcs12->guest_pml_index--;
}
 
return 0;
-- 
2.7.4

[PATCH v3 11/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map in synic_deliver_msg since using kvm_vcpu_gpa_to_page()
and kmap() will only work for guest memory that has a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Update to match the new API return codes
---
 arch/x86/kvm/hyperv.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 8bdc78d..5310b8b 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -581,7 +581,7 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic 
*synic, u32 sint,
 struct hv_message *src_msg)
 {
struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
-   struct page *page;
+   struct kvm_host_map map;
gpa_t gpa;
struct hv_message *dst_msg;
int r;
@@ -591,11 +591,11 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic 
*synic, u32 sint,
return -ENOENT;
 
gpa = synic->msg_page & PAGE_MASK;
-   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page))
+
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), ))
return -EFAULT;
 
-   msg_page = kmap_atomic(page);
+   msg_page = map.hva;
dst_msg = _page->sint_message[sint];
if (sync_cmpxchg(_msg->header.message_type, HVMSG_NONE,
 src_msg->header.message_type) != HVMSG_NONE) {
@@ -612,8 +612,8 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic 
*synic, u32 sint,
else if (r == 0)
r = -EFAULT;
}
-   kunmap_atomic(msg_page);
-   kvm_release_page_dirty(page);
+
+   kvm_vcpu_unmap();
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
return r;
 }
-- 
2.7.4

[PATCH v3 10/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map in synic_clear_sint_msg_pending since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Update to match the new API return codes
---
 arch/x86/kvm/hyperv.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 01d209a..8bdc78d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -158,26 +158,22 @@ static void synic_clear_sint_msg_pending(struct 
kvm_vcpu_hv_synic *synic,
u32 sint)
 {
struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
-   struct page *page;
-   gpa_t gpa;
+   struct kvm_host_map map;
struct hv_message *msg;
struct hv_message_page *msg_page;
 
-   gpa = synic->msg_page & PAGE_MASK;
-   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page)) {
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(synic->msg_page), )) {
vcpu_err(vcpu, "Hyper-V SynIC can't get msg page, gpa 0x%llx\n",
-gpa);
+synic->msg_page);
return;
}
-   msg_page = kmap_atomic(page);
 
+   msg_page = map.hva;
msg = _page->sint_message[sint];
msg->header.message_flags.msg_pending = 0;
 
-   kunmap_atomic(msg_page);
-   kvm_release_page_dirty(page);
-   kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+   kvm_vcpu_unmap();
+   kvm_vcpu_mark_page_dirty(vcpu, gpa_to_gfn(synic->msg_page));
 }
 
 static void kvm_hv_notify_acked_sint(struct kvm_vcpu *vcpu, u32 sint)
-- 
2.7.4

[PATCH v3 11/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map in synic_deliver_msg since using kvm_vcpu_gpa_to_page()
and kmap() will only work for guest memory that has a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Update to match the new API return codes
---
 arch/x86/kvm/hyperv.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 8bdc78d..5310b8b 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -581,7 +581,7 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic 
*synic, u32 sint,
 struct hv_message *src_msg)
 {
struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
-   struct page *page;
+   struct kvm_host_map map;
gpa_t gpa;
struct hv_message *dst_msg;
int r;
@@ -591,11 +591,11 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic 
*synic, u32 sint,
return -ENOENT;
 
gpa = synic->msg_page & PAGE_MASK;
-   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page))
+
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), ))
return -EFAULT;
 
-   msg_page = kmap_atomic(page);
+   msg_page = map.hva;
dst_msg = _page->sint_message[sint];
if (sync_cmpxchg(_msg->header.message_type, HVMSG_NONE,
 src_msg->header.message_type) != HVMSG_NONE) {
@@ -612,8 +612,8 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic 
*synic, u32 sint,
else if (r == 0)
r = -EFAULT;
}
-   kunmap_atomic(msg_page);
-   kvm_release_page_dirty(page);
+
+   kvm_vcpu_unmap();
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
return r;
 }
-- 
2.7.4

[PATCH v3 10/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map in synic_clear_sint_msg_pending since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Update to match the new API return codes
---
 arch/x86/kvm/hyperv.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 01d209a..8bdc78d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -158,26 +158,22 @@ static void synic_clear_sint_msg_pending(struct 
kvm_vcpu_hv_synic *synic,
u32 sint)
 {
struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
-   struct page *page;
-   gpa_t gpa;
+   struct kvm_host_map map;
struct hv_message *msg;
struct hv_message_page *msg_page;
 
-   gpa = synic->msg_page & PAGE_MASK;
-   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page)) {
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(synic->msg_page), )) {
vcpu_err(vcpu, "Hyper-V SynIC can't get msg page, gpa 0x%llx\n",
-gpa);
+synic->msg_page);
return;
}
-   msg_page = kmap_atomic(page);
 
+   msg_page = map.hva;
msg = _page->sint_message[sint];
msg->header.message_flags.msg_pending = 0;
 
-   kunmap_atomic(msg_page);
-   kvm_release_page_dirty(page);
-   kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+   kvm_vcpu_unmap();
+   kvm_vcpu_mark_page_dirty(vcpu, gpa_to_gfn(synic->msg_page));
 }
 
 static void kvm_hv_notify_acked_sint(struct kvm_vcpu *vcpu, u32 sint)
-- 
2.7.4

[PATCH v3 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map when mapping the virtual APIC page since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
virtual APIC page on the host side.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
- Use pfn_to_hpa instead of gfn_to_gpa
---
 arch/x86/kvm/vmx.c | 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5b15ca2..83a5e95 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -845,9 +845,8 @@ struct nested_vmx {
 * pointers, so we must keep them pinned while L2 runs.
 */
struct page *apic_access_page;
-   struct page *virtual_apic_page;
+   struct kvm_host_map virtual_apic_map;
struct page *pi_desc_page;
-
struct kvm_host_map msr_bitmap_map;
 
struct pi_desc *pi_desc;
@@ -6152,11 +6151,12 @@ static void vmx_complete_nested_posted_interrupt(struct 
kvm_vcpu *vcpu)
 
max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
if (max_irr != 256) {
-   vapic_page = kmap(vmx->nested.virtual_apic_page);
+   vapic_page = vmx->nested.virtual_apic_map.hva;
+   if (!vapic_page)
+   return;
+
__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
vapic_page, _irr);
-   kunmap(vmx->nested.virtual_apic_page);
-
status = vmcs_read16(GUEST_INTR_STATUS);
if ((u8)max_irr > ((u8)status & 0xff)) {
status &= ~0xff;
@@ -8468,10 +8468,7 @@ static void free_nested(struct vcpu_vmx *vmx)
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
-   if (vmx->nested.virtual_apic_page) {
-   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
-   vmx->nested.virtual_apic_page = NULL;
-   }
+   kvm_vcpu_unmap(>nested.virtual_apic_map);
if (vmx->nested.pi_desc_page) {
kunmap(vmx->nested.pi_desc_page);
kvm_release_page_dirty(vmx->nested.pi_desc_page);
@@ -11394,6 +11391,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct kvm_host_map *map;
struct page *page;
u64 hpa;
 
@@ -11426,11 +11424,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
}
 
if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
-   if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
-   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
-   vmx->nested.virtual_apic_page = NULL;
-   }
-   page = kvm_vcpu_gpa_to_page(vcpu, 
vmcs12->virtual_apic_page_addr);
+   map = >nested.virtual_apic_map;
 
/*
 * If translation failed, VM entry will fail because
@@ -11445,11 +11439,8 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
 * control.  But such a configuration is useless, so
 * let's keep the code simple.
 */
-   if (!is_error_page(page)) {
-   vmx->nested.virtual_apic_page = page;
-   hpa = page_to_phys(vmx->nested.virtual_apic_page);
-   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
-   }
+   if (!kvm_vcpu_map(vcpu, 
gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
+   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 
pfn_to_hpa(map->pfn));
}
 
if (nested_cpu_has_posted_intr(vmcs12)) {
@@ -13353,10 +13344,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
u32 exit_reason,
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
-   if (vmx->nested.virtual_apic_page) {
-   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
-   vmx->nested.virtual_apic_page = NULL;
-   }
+   kvm_vcpu_unmap(>nested.virtual_apic_map);
if (vmx->nested.pi_desc_page) {
kunmap(vmx->nested.pi_desc_page);
kvm_release_page_dirty(vmx->nested.pi_desc_page);
-- 
2.7.4

[PATCH v3 09/13] KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map in emulator_cmpxchg_emulated since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Update to match the new API return codes
---
 arch/x86/kvm/x86.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ca71773..f083e53 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5280,9 +5280,9 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
 unsigned int bytes,
 struct x86_exception *exception)
 {
+   struct kvm_host_map map;
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
gpa_t gpa;
-   struct page *page;
char *kaddr;
bool exchanged;
 
@@ -5299,12 +5299,11 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK))
goto emul_write;
 
-   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page))
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), ))
goto emul_write;
 
-   kaddr = kmap_atomic(page);
-   kaddr += offset_in_page(gpa);
+   kaddr = map.hva + offset_in_page(gpa);
+
switch (bytes) {
case 1:
exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
@@ -5321,8 +5320,8 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
default:
BUG();
}
-   kunmap_atomic(kaddr);
-   kvm_release_page_dirty(page);
+
+   kvm_vcpu_unmap();
 
if (!exchanged)
return X86EMUL_CMPXCHG_FAILED;
-- 
2.7.4

[PATCH v3 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map when mapping the virtual APIC page since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
virtual APIC page on the host side.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
- Use pfn_to_hpa instead of gfn_to_gpa
---
 arch/x86/kvm/vmx.c | 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5b15ca2..83a5e95 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -845,9 +845,8 @@ struct nested_vmx {
 * pointers, so we must keep them pinned while L2 runs.
 */
struct page *apic_access_page;
-   struct page *virtual_apic_page;
+   struct kvm_host_map virtual_apic_map;
struct page *pi_desc_page;
-
struct kvm_host_map msr_bitmap_map;
 
struct pi_desc *pi_desc;
@@ -6152,11 +6151,12 @@ static void vmx_complete_nested_posted_interrupt(struct 
kvm_vcpu *vcpu)
 
max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
if (max_irr != 256) {
-   vapic_page = kmap(vmx->nested.virtual_apic_page);
+   vapic_page = vmx->nested.virtual_apic_map.hva;
+   if (!vapic_page)
+   return;
+
__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
vapic_page, _irr);
-   kunmap(vmx->nested.virtual_apic_page);
-
status = vmcs_read16(GUEST_INTR_STATUS);
if ((u8)max_irr > ((u8)status & 0xff)) {
status &= ~0xff;
@@ -8468,10 +8468,7 @@ static void free_nested(struct vcpu_vmx *vmx)
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
-   if (vmx->nested.virtual_apic_page) {
-   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
-   vmx->nested.virtual_apic_page = NULL;
-   }
+   kvm_vcpu_unmap(>nested.virtual_apic_map);
if (vmx->nested.pi_desc_page) {
kunmap(vmx->nested.pi_desc_page);
kvm_release_page_dirty(vmx->nested.pi_desc_page);
@@ -11394,6 +11391,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct kvm_host_map *map;
struct page *page;
u64 hpa;
 
@@ -11426,11 +11424,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
}
 
if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
-   if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
-   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
-   vmx->nested.virtual_apic_page = NULL;
-   }
-   page = kvm_vcpu_gpa_to_page(vcpu, 
vmcs12->virtual_apic_page_addr);
+   map = >nested.virtual_apic_map;
 
/*
 * If translation failed, VM entry will fail because
@@ -11445,11 +11439,8 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
 * control.  But such a configuration is useless, so
 * let's keep the code simple.
 */
-   if (!is_error_page(page)) {
-   vmx->nested.virtual_apic_page = page;
-   hpa = page_to_phys(vmx->nested.virtual_apic_page);
-   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
-   }
+   if (!kvm_vcpu_map(vcpu, 
gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
+   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 
pfn_to_hpa(map->pfn));
}
 
if (nested_cpu_has_posted_intr(vmcs12)) {
@@ -13353,10 +13344,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
u32 exit_reason,
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
-   if (vmx->nested.virtual_apic_page) {
-   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
-   vmx->nested.virtual_apic_page = NULL;
-   }
+   kvm_vcpu_unmap(>nested.virtual_apic_map);
if (vmx->nested.pi_desc_page) {
kunmap(vmx->nested.pi_desc_page);
kvm_release_page_dirty(vmx->nested.pi_desc_page);
-- 
2.7.4

[PATCH v3 09/13] KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map in emulator_cmpxchg_emulated since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Update to match the new API return codes
---
 arch/x86/kvm/x86.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ca71773..f083e53 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5280,9 +5280,9 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
 unsigned int bytes,
 struct x86_exception *exception)
 {
+   struct kvm_host_map map;
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
gpa_t gpa;
-   struct page *page;
char *kaddr;
bool exchanged;
 
@@ -5299,12 +5299,11 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK))
goto emul_write;
 
-   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page))
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), ))
goto emul_write;
 
-   kaddr = kmap_atomic(page);
-   kaddr += offset_in_page(gpa);
+   kaddr = map.hva + offset_in_page(gpa);
+
switch (bytes) {
case 1:
exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
@@ -5321,8 +5320,8 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
default:
BUG();
}
-   kunmap_atomic(kaddr);
-   kvm_release_page_dirty(page);
+
+   kvm_vcpu_unmap();
 
if (!exchanged)
return X86EMUL_CMPXCHG_FAILED;
-- 
2.7.4

[PATCH v3 00/13] KVM/X86: Introduce a new guest mapping interface

2018-10-20 Thread KarimAllah Ahmed

Guest memory can either be directly managed by the kernel (i.e. have a "struct
page") or they can simply live outside kernel control (i.e. do not have a
"struct page"). KVM mostly support these two modes, except in a few places
where the code seems to assume that guest memory must have a "struct page".

This patchset introduces a new mapping interface to map guest memory into host
kernel memory which also supports PFN-based memory (i.e. memory without 'struct
page'). It also converts all offending code to this interface or simply
read/write directly from guest memory.

As far as I can see all offending code is now fixed except the APIC-access page
which I will handle in a seperate series along with dropping
kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.

v3 -> v2:

- rebase
- Add a new patch to also fix the newly introducing shadow VMCS support for
  nested.

Filippo Sironi (1):
  X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs

KarimAllah Ahmed (12):
  X86/nVMX: handle_vmon: Read 4 bytes from guest memory
  X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
  X86/nVMX: Update the PML table without mapping and unmapping the page
  KVM: Introduce a new guest mapping API
  KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
  KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
  KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
descriptor table
  KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
  KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
  KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
  KVM/nSVM: Use the new mapping API for mapping guest memory
  KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS

 arch/x86/kvm/hyperv.c  |  28 
 arch/x86/kvm/paging_tmpl.h |  38 ---
 arch/x86/kvm/svm.c |  97 +-
 arch/x86/kvm/vmx.c | 167 +
 arch/x86/kvm/x86.c |  13 ++--
 include/linux/kvm_host.h   |   9 +++
 virt/kvm/kvm_main.c|  50 ++
 7 files changed, 217 insertions(+), 185 deletions(-)

-- 
2.7.4

[PATCH v3 00/13] KVM/X86: Introduce a new guest mapping interface

2018-10-20 Thread KarimAllah Ahmed

Guest memory can either be directly managed by the kernel (i.e. have a "struct
page") or they can simply live outside kernel control (i.e. do not have a
"struct page"). KVM mostly support these two modes, except in a few places
where the code seems to assume that guest memory must have a "struct page".

This patchset introduces a new mapping interface to map guest memory into host
kernel memory which also supports PFN-based memory (i.e. memory without 'struct
page'). It also converts all offending code to this interface or simply
read/write directly from guest memory.

As far as I can see all offending code is now fixed except the APIC-access page
which I will handle in a seperate series along with dropping
kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.

v3 -> v2:

- rebase
- Add a new patch to also fix the newly introducing shadow VMCS support for
  nested.

Filippo Sironi (1):
  X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs

KarimAllah Ahmed (12):
  X86/nVMX: handle_vmon: Read 4 bytes from guest memory
  X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
  X86/nVMX: Update the PML table without mapping and unmapping the page
  KVM: Introduce a new guest mapping API
  KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
  KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
  KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
descriptor table
  KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
  KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
  KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
  KVM/nSVM: Use the new mapping API for mapping guest memory
  KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS

 arch/x86/kvm/hyperv.c  |  28 
 arch/x86/kvm/paging_tmpl.h |  38 ---
 arch/x86/kvm/svm.c |  97 +-
 arch/x86/kvm/vmx.c | 167 +
 arch/x86/kvm/x86.c |  13 ++--
 include/linux/kvm_host.h   |   9 +++
 virt/kvm/kvm_main.c|  50 ++
 7 files changed, 217 insertions(+), 185 deletions(-)

-- 
2.7.4

[PATCH v3 13/13] KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map for accessing the shadow VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/vmx.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ae25c3..2892770 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11616,20 +11616,20 @@ static inline bool 
nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 static void nested_cache_shadow_vmcs12(struct kvm_vcpu *vcpu,
   struct vmcs12 *vmcs12)
 {
+   struct kvm_host_map map;
struct vmcs12 *shadow;
-   struct page *page;
 
if (!nested_cpu_has_shadow_vmcs(vmcs12) ||
vmcs12->vmcs_link_pointer == -1ull)
return;
 
shadow = get_shadow_vmcs12(vcpu);
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer);
 
-   memcpy(shadow, kmap(page), VMCS12_SIZE);
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), ))
+   return;
 
-   kunmap(page);
-   kvm_release_page_clean(page);
+   memcpy(shadow, map.hva, VMCS12_SIZE);
+   kvm_vcpu_unmap();
 }
 
 static void nested_flush_cached_shadow_vmcs12(struct kvm_vcpu *vcpu,
@@ -12494,9 +12494,9 @@ static int check_vmentry_prereqs(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu,
  struct vmcs12 *vmcs12)
 {
-   int r;
-   struct page *page;
+   int r = 0;
struct vmcs12 *shadow;
+   struct kvm_host_map map;
 
if (vmcs12->vmcs_link_pointer == -1ull)
return 0;
@@ -12504,17 +12504,16 @@ static int nested_vmx_check_vmcs_link_ptr(struct 
kvm_vcpu *vcpu,
if (!page_address_valid(vcpu, vmcs12->vmcs_link_pointer))
return -EINVAL;
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer);
-   if (is_error_page(page))
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), ))
return -EINVAL;
 
-   r = 0;
-   shadow = kmap(page);
+   shadow = map.hva;
+
if (shadow->hdr.revision_id != VMCS12_REVISION ||
shadow->hdr.shadow_vmcs != nested_cpu_has_shadow_vmcs(vmcs12))
r = -EINVAL;
-   kunmap(page);
-   kvm_release_page_clean(page);
+
+   kvm_vcpu_unmap();
return r;
 }
 
-- 
2.7.4

[PATCH v3 12/13] KVM/nSVM: Use the new mapping API for mapping guest memory

2018-10-20 Thread KarimAllah Ahmed

Use the new mapping API for mapping guest memory to avoid depending on
"struct page".

Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/svm.c | 97 +++---
 1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d96092b..911d853 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3036,32 +3036,6 @@ static inline bool nested_svm_nmi(struct vcpu_svm *svm)
return false;
 }
 
-static void *nested_svm_map(struct vcpu_svm *svm, u64 gpa, struct page **_page)
-{
-   struct page *page;
-
-   might_sleep();
-
-   page = kvm_vcpu_gfn_to_page(>vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page))
-   goto error;
-
-   *_page = page;
-
-   return kmap(page);
-
-error:
-   kvm_inject_gp(>vcpu, 0);
-
-   return NULL;
-}
-
-static void nested_svm_unmap(struct page *page)
-{
-   kunmap(page);
-   kvm_release_page_dirty(page);
-}
-
 static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
 {
unsigned port, size, iopm_len;
@@ -3262,10 +3236,11 @@ static inline void copy_vmcb_control_area(struct vmcb 
*dst_vmcb, struct vmcb *fr
 
 static int nested_svm_vmexit(struct vcpu_svm *svm)
 {
+   int rc;
struct vmcb *nested_vmcb;
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;
-   struct page *page;
+   struct kvm_host_map map;
 
trace_kvm_nested_vmexit_inject(vmcb->control.exit_code,
   vmcb->control.exit_info_1,
@@ -3274,9 +3249,14 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
   vmcb->control.exit_int_info_err,
   KVM_ISA_SVM);
 
-   nested_vmcb = nested_svm_map(svm, svm->nested.vmcb, );
-   if (!nested_vmcb)
+   rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(svm->nested.vmcb), );
+   if (rc) {
+   if (rc == -EINVAL)
+   kvm_inject_gp(>vcpu, 0);
return 1;
+   }
+
+   nested_vmcb = map.hva;
 
/* Exit Guest-Mode */
leave_guest_mode(>vcpu);
@@ -3375,7 +3355,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
 
mark_all_dirty(svm->vmcb);
 
-   nested_svm_unmap(page);
+   kvm_vcpu_unmap();
 
nested_svm_uninit_mmu_context(>vcpu);
kvm_mmu_reset_context(>vcpu);
@@ -3433,7 +3413,7 @@ static bool nested_vmcb_checks(struct vmcb *vmcb)
 }
 
 static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
-struct vmcb *nested_vmcb, struct page *page)
+struct vmcb *nested_vmcb, struct kvm_host_map 
*map)
 {
if (kvm_get_rflags(>vcpu) & X86_EFLAGS_IF)
svm->vcpu.arch.hflags |= HF_HIF_MASK;
@@ -3513,7 +3493,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, 
u64 vmcb_gpa,
svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
 
-   nested_svm_unmap(page);
+   kvm_vcpu_unmap(map);
 
/* Enter Guest-Mode */
enter_guest_mode(>vcpu);
@@ -3533,17 +3513,23 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, 
u64 vmcb_gpa,
 
 static bool nested_svm_vmrun(struct vcpu_svm *svm)
 {
+   int rc;
struct vmcb *nested_vmcb;
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;
-   struct page *page;
+   struct kvm_host_map map;
u64 vmcb_gpa;
 
vmcb_gpa = svm->vmcb->save.rax;
 
-   nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, );
-   if (!nested_vmcb)
+   rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(vmcb_gpa), );
+   if (rc) {
+   if (rc == -EINVAL)
+   kvm_inject_gp(>vcpu, 0);
return false;
+   }
+
+   nested_vmcb = map.hva;
 
if (!nested_vmcb_checks(nested_vmcb)) {
nested_vmcb->control.exit_code= SVM_EXIT_ERR;
@@ -3551,7 +3537,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
nested_vmcb->control.exit_info_1  = 0;
nested_vmcb->control.exit_info_2  = 0;
 
-   nested_svm_unmap(page);
+   kvm_vcpu_unmap();
 
return false;
}
@@ -3595,7 +3581,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
 
copy_vmcb_control_area(hsave, vmcb);
 
-   enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, page);
+   enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, );
 
return true;
 }
@@ -3619,21 +3605,26 @@ static void nested_svm_vmloadsave(struct vmcb 
*from_vmcb, struct vmcb *to_vmcb)
 static int vmload_interception(struct vcpu_svm *svm)
 {
struct vmcb *nested_vmcb;
-   struct page *page;
+   struct kvm_host_map map;
int ret;
 
if (nested_svm_check_permissions(svm))

[PATCH v3 13/13] KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map for accessing the shadow VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/vmx.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ae25c3..2892770 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11616,20 +11616,20 @@ static inline bool 
nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 static void nested_cache_shadow_vmcs12(struct kvm_vcpu *vcpu,
   struct vmcs12 *vmcs12)
 {
+   struct kvm_host_map map;
struct vmcs12 *shadow;
-   struct page *page;
 
if (!nested_cpu_has_shadow_vmcs(vmcs12) ||
vmcs12->vmcs_link_pointer == -1ull)
return;
 
shadow = get_shadow_vmcs12(vcpu);
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer);
 
-   memcpy(shadow, kmap(page), VMCS12_SIZE);
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), ))
+   return;
 
-   kunmap(page);
-   kvm_release_page_clean(page);
+   memcpy(shadow, map.hva, VMCS12_SIZE);
+   kvm_vcpu_unmap();
 }
 
 static void nested_flush_cached_shadow_vmcs12(struct kvm_vcpu *vcpu,
@@ -12494,9 +12494,9 @@ static int check_vmentry_prereqs(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu,
  struct vmcs12 *vmcs12)
 {
-   int r;
-   struct page *page;
+   int r = 0;
struct vmcs12 *shadow;
+   struct kvm_host_map map;
 
if (vmcs12->vmcs_link_pointer == -1ull)
return 0;
@@ -12504,17 +12504,16 @@ static int nested_vmx_check_vmcs_link_ptr(struct 
kvm_vcpu *vcpu,
if (!page_address_valid(vcpu, vmcs12->vmcs_link_pointer))
return -EINVAL;
 
-   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer);
-   if (is_error_page(page))
+   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), ))
return -EINVAL;
 
-   r = 0;
-   shadow = kmap(page);
+   shadow = map.hva;
+
if (shadow->hdr.revision_id != VMCS12_REVISION ||
shadow->hdr.shadow_vmcs != nested_cpu_has_shadow_vmcs(vmcs12))
r = -EINVAL;
-   kunmap(page);
-   kvm_release_page_clean(page);
+
+   kvm_vcpu_unmap();
return r;
 }
 
-- 
2.7.4

[PATCH v3 12/13] KVM/nSVM: Use the new mapping API for mapping guest memory

2018-10-20 Thread KarimAllah Ahmed

Use the new mapping API for mapping guest memory to avoid depending on
"struct page".

Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/svm.c | 97 +++---
 1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d96092b..911d853 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3036,32 +3036,6 @@ static inline bool nested_svm_nmi(struct vcpu_svm *svm)
return false;
 }
 
-static void *nested_svm_map(struct vcpu_svm *svm, u64 gpa, struct page **_page)
-{
-   struct page *page;
-
-   might_sleep();
-
-   page = kvm_vcpu_gfn_to_page(>vcpu, gpa >> PAGE_SHIFT);
-   if (is_error_page(page))
-   goto error;
-
-   *_page = page;
-
-   return kmap(page);
-
-error:
-   kvm_inject_gp(>vcpu, 0);
-
-   return NULL;
-}
-
-static void nested_svm_unmap(struct page *page)
-{
-   kunmap(page);
-   kvm_release_page_dirty(page);
-}
-
 static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
 {
unsigned port, size, iopm_len;
@@ -3262,10 +3236,11 @@ static inline void copy_vmcb_control_area(struct vmcb 
*dst_vmcb, struct vmcb *fr
 
 static int nested_svm_vmexit(struct vcpu_svm *svm)
 {
+   int rc;
struct vmcb *nested_vmcb;
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;
-   struct page *page;
+   struct kvm_host_map map;
 
trace_kvm_nested_vmexit_inject(vmcb->control.exit_code,
   vmcb->control.exit_info_1,
@@ -3274,9 +3249,14 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
   vmcb->control.exit_int_info_err,
   KVM_ISA_SVM);
 
-   nested_vmcb = nested_svm_map(svm, svm->nested.vmcb, );
-   if (!nested_vmcb)
+   rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(svm->nested.vmcb), );
+   if (rc) {
+   if (rc == -EINVAL)
+   kvm_inject_gp(>vcpu, 0);
return 1;
+   }
+
+   nested_vmcb = map.hva;
 
/* Exit Guest-Mode */
leave_guest_mode(>vcpu);
@@ -3375,7 +3355,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
 
mark_all_dirty(svm->vmcb);
 
-   nested_svm_unmap(page);
+   kvm_vcpu_unmap();
 
nested_svm_uninit_mmu_context(>vcpu);
kvm_mmu_reset_context(>vcpu);
@@ -3433,7 +3413,7 @@ static bool nested_vmcb_checks(struct vmcb *vmcb)
 }
 
 static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
-struct vmcb *nested_vmcb, struct page *page)
+struct vmcb *nested_vmcb, struct kvm_host_map 
*map)
 {
if (kvm_get_rflags(>vcpu) & X86_EFLAGS_IF)
svm->vcpu.arch.hflags |= HF_HIF_MASK;
@@ -3513,7 +3493,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, 
u64 vmcb_gpa,
svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
 
-   nested_svm_unmap(page);
+   kvm_vcpu_unmap(map);
 
/* Enter Guest-Mode */
enter_guest_mode(>vcpu);
@@ -3533,17 +3513,23 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, 
u64 vmcb_gpa,
 
 static bool nested_svm_vmrun(struct vcpu_svm *svm)
 {
+   int rc;
struct vmcb *nested_vmcb;
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;
-   struct page *page;
+   struct kvm_host_map map;
u64 vmcb_gpa;
 
vmcb_gpa = svm->vmcb->save.rax;
 
-   nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, );
-   if (!nested_vmcb)
+   rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(vmcb_gpa), );
+   if (rc) {
+   if (rc == -EINVAL)
+   kvm_inject_gp(>vcpu, 0);
return false;
+   }
+
+   nested_vmcb = map.hva;
 
if (!nested_vmcb_checks(nested_vmcb)) {
nested_vmcb->control.exit_code= SVM_EXIT_ERR;
@@ -3551,7 +3537,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
nested_vmcb->control.exit_info_1  = 0;
nested_vmcb->control.exit_info_2  = 0;
 
-   nested_svm_unmap(page);
+   kvm_vcpu_unmap();
 
return false;
}
@@ -3595,7 +3581,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
 
copy_vmcb_control_area(hsave, vmcb);
 
-   enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, page);
+   enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, );
 
return true;
 }
@@ -3619,21 +3605,26 @@ static void nested_svm_vmloadsave(struct vmcb 
*from_vmcb, struct vmcb *to_vmcb)
 static int vmload_interception(struct vcpu_svm *svm)
 {
struct vmcb *nested_vmcb;
-   struct page *page;
+   struct kvm_host_map map;
int ret;
 
if (nested_svm_check_permissions(svm))

[PATCH v3 02/13] X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory

2018-10-20 Thread KarimAllah Ahmed

Copy the VMCS12 directly from guest memory instead of the map->copy->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Massage commit message a bit.
---
 arch/x86/kvm/vmx.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 358759a..bc45347 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8879,33 +8879,28 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
}
 
if (vmx->nested.current_vmptr != vmptr) {
-   struct vmcs12 *new_vmcs12;
-   struct page *page;
-   page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
-   if (is_error_page(page)) {
+   struct vmcs12 *new_vmcs12 = (struct vmcs12 
*)__get_free_page(GFP_KERNEL);
+
+   if (!new_vmcs12 ||
+   kvm_read_guest(vcpu->kvm, vmptr, new_vmcs12,
+  sizeof(*new_vmcs12))) {
+   free_page((unsigned long)new_vmcs12);
nested_vmx_failInvalid(vcpu);
return kvm_skip_emulated_instruction(vcpu);
}
-   new_vmcs12 = kmap(page);
+
if (new_vmcs12->hdr.revision_id != VMCS12_REVISION ||
(new_vmcs12->hdr.shadow_vmcs &&
 !nested_cpu_has_vmx_shadow_vmcs(vcpu))) {
-   kunmap(page);
-   kvm_release_page_clean(page);
+   free_page((unsigned long)new_vmcs12);
nested_vmx_failValid(vcpu,
VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
return kvm_skip_emulated_instruction(vcpu);
}
 
nested_release_vmcs12(vmx);
-   /*
-* Load VMCS12 from guest memory since it is not already
-* cached.
-*/
memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE);
-   kunmap(page);
-   kvm_release_page_clean(page);
-
+   free_page((unsigned long)new_vmcs12);
set_current_vmptr(vmx, vmptr);
}
 
-- 
2.7.4

[PATCH V4 3/5] KVM: X86: Adding skeleton for Memory ROE

2018-10-20 Thread Ahmed Abd El Mawgood

This patch introduces a hypercall implemented for X86 that can assist
against subset of kernel rootkits, it works by place readonly protection in
shadow PTE. The end result protection is also kept in a bitmap for each
kvm_memory_slot and is used as reference when updating SPTEs. The whole
goal is to protect the guest kernel static data from modification if
attacker is running from guest ring 0, for this reason there is no
hypercall to revert effect of Memory ROE hypercall. This patch doesn't
implement integrity check on guest TLB so obvious attack on the current
implementation will involve guest virtual address -> guest physical
address remapping, but there are plans to fix that.

Signed-off-by: Ahmed Abd El Mawgood 
---
 arch/x86/include/asm/kvm_host.h |  11 ++-
 arch/x86/kvm/Kconfig|   7 ++
 arch/x86/kvm/mmu.c  |  72 +---
 arch/x86/kvm/x86.c  | 143 +++-
 include/linux/kvm_host.h|   3 +
 include/uapi/linux/kvm_para.h   |   4 +
 virt/kvm/kvm_main.c |  34 +++-
 7 files changed, 255 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09b2e3e2cf1b..aa080c3e302e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -238,6 +238,15 @@ struct kvm_mmu_memory_cache {
void *objects[KVM_NR_MEM_OBJS];
 };
 
+/*
+ * This is internal structure used to be be able to access kvm memory slot and
+ * have track of the number of current PTE when doing shadow PTE walk
+ */
+struct kvm_write_access_data {
+   int i;
+   struct kvm_memory_slot *memslot;
+};
+
 /*
  * the pages used as guest page table on soft mmu are tracked by
  * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
@@ -1178,7 +1187,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
u64 acc_track_mask, u64 me_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+void kvm_mmu_slot_apply_write_access(struct kvm *kvm,
  struct kvm_memory_slot *memslot);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
   const struct kvm_memory_slot *memslot);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1bbec387d289..2fcbb1788a24 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 auditing of KVM MMU events at runtime.
 
+config KVM_ROE
+   bool "Hypercall Memory Read-Only Enforcement"
+   depends on KVM && X86
+   help
+   This option adds KVM_HC_ROE hypercall to kvm as a hardening
+   mechanism to protect memory pages from being edited.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cc36abe1ee44..c54aa5287e14 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1484,9 +1484,8 @@ static bool spte_write_protect(u64 *sptep, bool 
pt_protect)
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm,
-struct kvm_rmap_head *rmap_head,
-bool pt_protect, void *data)
+static bool __rmap_write_protection(struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head, bool pt_protect)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1498,6 +1497,38 @@ static bool __rmap_write_protect(struct kvm *kvm,
return flush;
 }
 
+#ifdef CONFIG_KVM_ROE
+static bool __rmap_write_protect_roe(struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head,
+   bool pt_protect,
+   struct kvm_write_access_data *d)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   bool prot;
+   bool flush = false;
+
+   for_each_rmap_spte(rmap_head, , sptep) {
+   prot = !test_bit(d->i, d->memslot->roe_bitmap) && pt_protect;
+   flush |= spte_write_protect(sptep, prot);
+   d->i++;
+   }
+   return flush;
+}
+#endif
+
+static bool __rmap_write_protect(struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head,
+   bool pt_protect,
+   struct kvm_write_access_data *d)
+{
+#ifdef CONFIG_KVM_ROE
+   if (d != NULL)
+   return __rmap_write_protect_roe(kvm, rmap_head, pt_protect, d);
+#endif
+   return __rmap_write_protection(kvm, rmap_head, pt_protect);
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
u64 spte = *sptep;
@@ -1585,7 +1616,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm 
*kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + 
__ffs(mask),

[PATCH v3 08/13] KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
that has a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
interrupt descriptor table page on the host side.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
---
 arch/x86/kvm/vmx.c | 45 +++--
 1 file changed, 15 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83a5e95..0ae25c3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -846,7 +846,7 @@ struct nested_vmx {
 */
struct page *apic_access_page;
struct kvm_host_map virtual_apic_map;
-   struct page *pi_desc_page;
+   struct kvm_host_map pi_desc_map;
struct kvm_host_map msr_bitmap_map;
 
struct pi_desc *pi_desc;
@@ -8469,12 +8469,8 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->nested.apic_access_page = NULL;
}
kvm_vcpu_unmap(>nested.virtual_apic_map);
-   if (vmx->nested.pi_desc_page) {
-   kunmap(vmx->nested.pi_desc_page);
-   kvm_release_page_dirty(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc_page = NULL;
-   vmx->nested.pi_desc = NULL;
-   }
+   kvm_vcpu_unmap(>nested.pi_desc_map);
+   vmx->nested.pi_desc = NULL;
 
free_loaded_vmcs(>nested.vmcs02);
 }
@@ -11444,24 +11440,16 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
}
 
if (nested_cpu_has_posted_intr(vmcs12)) {
-   if (vmx->nested.pi_desc_page) { /* shouldn't happen */
-   kunmap(vmx->nested.pi_desc_page);
-   kvm_release_page_dirty(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc_page = NULL;
+   map = >nested.pi_desc_map;
+
+   if (!kvm_vcpu_map(vcpu, 
gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) {
+   vmx->nested.pi_desc =
+   (struct pi_desc *)(((void *)map->hva) +
+   offset_in_page(vmcs12->posted_intr_desc_addr));
+   vmcs_write64(POSTED_INTR_DESC_ADDR, 
pfn_to_hpa(map->pfn) +
+   
offset_in_page(vmcs12->posted_intr_desc_addr));
}
-   page = kvm_vcpu_gpa_to_page(vcpu, 
vmcs12->posted_intr_desc_addr);
-   if (is_error_page(page))
-   return;
-   vmx->nested.pi_desc_page = page;
-   vmx->nested.pi_desc = kmap(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc =
-   (struct pi_desc *)((void *)vmx->nested.pi_desc +
-   (unsigned long)(vmcs12->posted_intr_desc_addr &
-   (PAGE_SIZE - 1)));
-   vmcs_write64(POSTED_INTR_DESC_ADDR,
-   page_to_phys(vmx->nested.pi_desc_page) +
-   (unsigned long)(vmcs12->posted_intr_desc_addr &
-   (PAGE_SIZE - 1)));
+
}
if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12))
vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
@@ -13344,13 +13332,10 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
u32 exit_reason,
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
+
kvm_vcpu_unmap(>nested.virtual_apic_map);
-   if (vmx->nested.pi_desc_page) {
-   kunmap(vmx->nested.pi_desc_page);
-   kvm_release_page_dirty(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc_page = NULL;
-   vmx->nested.pi_desc = NULL;
-   }
+   kvm_vcpu_unmap(>nested.pi_desc_map);
+   vmx->nested.pi_desc = NULL;
 
/*
 * We are now running in L2, mmu_notifier will force to reload the
-- 
2.7.4

[PATCH v3 04/13] X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs

2018-10-20 Thread KarimAllah Ahmed

From: Filippo Sironi 

cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct page to map in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g., when
limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that back the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().

Signed-off-by: Filippo Sironi 
Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/paging_tmpl.h | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 14ffd97..8f7bc8f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -141,15 +141,35 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
struct page *page;
 
npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, );
-   /* Check if the user is doing something meaningless. */
-   if (unlikely(npages != 1))
-   return -EFAULT;
-
-   table = kmap_atomic(page);
-   ret = CMPXCHG([index], orig_pte, new_pte);
-   kunmap_atomic(table);
-
-   kvm_release_page_dirty(page);
+   if (likely(npages == 1)) {
+   table = kmap_atomic(page);
+   ret = CMPXCHG([index], orig_pte, new_pte);
+   kunmap_atomic(table);
+
+   kvm_release_page_dirty(page);
+   } else {
+   struct vm_area_struct *vma;
+   unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
+   unsigned long pfn;
+   unsigned long paddr;
+
+   down_read(>mm->mmap_sem);
+   vma = find_vma_intersection(current->mm, vaddr, vaddr + 
PAGE_SIZE);
+   if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
+   up_read(>mm->mmap_sem);
+   return -EFAULT;
+   }
+   pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+   paddr = pfn << PAGE_SHIFT;
+   table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
+   if (!table) {
+   up_read(>mm->mmap_sem);
+   return -EFAULT;
+   }
+   ret = CMPXCHG([index], orig_pte, new_pte);
+   memunmap(table);
+   up_read(>mm->mmap_sem);
+   }
 
return (ret != orig_pte);
 }
-- 
2.7.4

[PATCH v3 02/13] X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory

2018-10-20 Thread KarimAllah Ahmed

Copy the VMCS12 directly from guest memory instead of the map->copy->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Massage commit message a bit.
---
 arch/x86/kvm/vmx.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 358759a..bc45347 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8879,33 +8879,28 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
}
 
if (vmx->nested.current_vmptr != vmptr) {
-   struct vmcs12 *new_vmcs12;
-   struct page *page;
-   page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
-   if (is_error_page(page)) {
+   struct vmcs12 *new_vmcs12 = (struct vmcs12 
*)__get_free_page(GFP_KERNEL);
+
+   if (!new_vmcs12 ||
+   kvm_read_guest(vcpu->kvm, vmptr, new_vmcs12,
+  sizeof(*new_vmcs12))) {
+   free_page((unsigned long)new_vmcs12);
nested_vmx_failInvalid(vcpu);
return kvm_skip_emulated_instruction(vcpu);
}
-   new_vmcs12 = kmap(page);
+
if (new_vmcs12->hdr.revision_id != VMCS12_REVISION ||
(new_vmcs12->hdr.shadow_vmcs &&
 !nested_cpu_has_vmx_shadow_vmcs(vcpu))) {
-   kunmap(page);
-   kvm_release_page_clean(page);
+   free_page((unsigned long)new_vmcs12);
nested_vmx_failValid(vcpu,
VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
return kvm_skip_emulated_instruction(vcpu);
}
 
nested_release_vmcs12(vmx);
-   /*
-* Load VMCS12 from guest memory since it is not already
-* cached.
-*/
memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE);
-   kunmap(page);
-   kvm_release_page_clean(page);
-
+   free_page((unsigned long)new_vmcs12);
set_current_vmptr(vmx, vmptr);
}
 
-- 
2.7.4

[PATCH V4 3/5] KVM: X86: Adding skeleton for Memory ROE

2018-10-20 Thread Ahmed Abd El Mawgood

This patch introduces a hypercall implemented for X86 that can assist
against subset of kernel rootkits, it works by place readonly protection in
shadow PTE. The end result protection is also kept in a bitmap for each
kvm_memory_slot and is used as reference when updating SPTEs. The whole
goal is to protect the guest kernel static data from modification if
attacker is running from guest ring 0, for this reason there is no
hypercall to revert effect of Memory ROE hypercall. This patch doesn't
implement integrity check on guest TLB so obvious attack on the current
implementation will involve guest virtual address -> guest physical
address remapping, but there are plans to fix that.

Signed-off-by: Ahmed Abd El Mawgood 
---
 arch/x86/include/asm/kvm_host.h |  11 ++-
 arch/x86/kvm/Kconfig|   7 ++
 arch/x86/kvm/mmu.c  |  72 +---
 arch/x86/kvm/x86.c  | 143 +++-
 include/linux/kvm_host.h|   3 +
 include/uapi/linux/kvm_para.h   |   4 +
 virt/kvm/kvm_main.c |  34 +++-
 7 files changed, 255 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09b2e3e2cf1b..aa080c3e302e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -238,6 +238,15 @@ struct kvm_mmu_memory_cache {
void *objects[KVM_NR_MEM_OBJS];
 };
 
+/*
+ * This is internal structure used to be be able to access kvm memory slot and
+ * have track of the number of current PTE when doing shadow PTE walk
+ */
+struct kvm_write_access_data {
+   int i;
+   struct kvm_memory_slot *memslot;
+};
+
 /*
  * the pages used as guest page table on soft mmu are tracked by
  * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
@@ -1178,7 +1187,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
u64 acc_track_mask, u64 me_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+void kvm_mmu_slot_apply_write_access(struct kvm *kvm,
  struct kvm_memory_slot *memslot);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
   const struct kvm_memory_slot *memslot);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1bbec387d289..2fcbb1788a24 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 auditing of KVM MMU events at runtime.
 
+config KVM_ROE
+   bool "Hypercall Memory Read-Only Enforcement"
+   depends on KVM && X86
+   help
+   This option adds KVM_HC_ROE hypercall to kvm as a hardening
+   mechanism to protect memory pages from being edited.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cc36abe1ee44..c54aa5287e14 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1484,9 +1484,8 @@ static bool spte_write_protect(u64 *sptep, bool 
pt_protect)
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm,
-struct kvm_rmap_head *rmap_head,
-bool pt_protect, void *data)
+static bool __rmap_write_protection(struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head, bool pt_protect)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1498,6 +1497,38 @@ static bool __rmap_write_protect(struct kvm *kvm,
return flush;
 }
 
+#ifdef CONFIG_KVM_ROE
+static bool __rmap_write_protect_roe(struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head,
+   bool pt_protect,
+   struct kvm_write_access_data *d)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   bool prot;
+   bool flush = false;
+
+   for_each_rmap_spte(rmap_head, , sptep) {
+   prot = !test_bit(d->i, d->memslot->roe_bitmap) && pt_protect;
+   flush |= spte_write_protect(sptep, prot);
+   d->i++;
+   }
+   return flush;
+}
+#endif
+
+static bool __rmap_write_protect(struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head,
+   bool pt_protect,
+   struct kvm_write_access_data *d)
+{
+#ifdef CONFIG_KVM_ROE
+   if (d != NULL)
+   return __rmap_write_protect_roe(kvm, rmap_head, pt_protect, d);
+#endif
+   return __rmap_write_protection(kvm, rmap_head, pt_protect);
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
u64 spte = *sptep;
@@ -1585,7 +1616,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm 
*kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + 
__ffs(mask),

[PATCH v3 08/13] KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table

2018-10-20 Thread KarimAllah Ahmed

Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
that has a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
interrupt descriptor table page on the host side.

Signed-off-by: KarimAllah Ahmed 
---
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
---
 arch/x86/kvm/vmx.c | 45 +++--
 1 file changed, 15 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83a5e95..0ae25c3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -846,7 +846,7 @@ struct nested_vmx {
 */
struct page *apic_access_page;
struct kvm_host_map virtual_apic_map;
-   struct page *pi_desc_page;
+   struct kvm_host_map pi_desc_map;
struct kvm_host_map msr_bitmap_map;
 
struct pi_desc *pi_desc;
@@ -8469,12 +8469,8 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->nested.apic_access_page = NULL;
}
kvm_vcpu_unmap(>nested.virtual_apic_map);
-   if (vmx->nested.pi_desc_page) {
-   kunmap(vmx->nested.pi_desc_page);
-   kvm_release_page_dirty(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc_page = NULL;
-   vmx->nested.pi_desc = NULL;
-   }
+   kvm_vcpu_unmap(>nested.pi_desc_map);
+   vmx->nested.pi_desc = NULL;
 
free_loaded_vmcs(>nested.vmcs02);
 }
@@ -11444,24 +11440,16 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
}
 
if (nested_cpu_has_posted_intr(vmcs12)) {
-   if (vmx->nested.pi_desc_page) { /* shouldn't happen */
-   kunmap(vmx->nested.pi_desc_page);
-   kvm_release_page_dirty(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc_page = NULL;
+   map = >nested.pi_desc_map;
+
+   if (!kvm_vcpu_map(vcpu, 
gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) {
+   vmx->nested.pi_desc =
+   (struct pi_desc *)(((void *)map->hva) +
+   offset_in_page(vmcs12->posted_intr_desc_addr));
+   vmcs_write64(POSTED_INTR_DESC_ADDR, 
pfn_to_hpa(map->pfn) +
+   
offset_in_page(vmcs12->posted_intr_desc_addr));
}
-   page = kvm_vcpu_gpa_to_page(vcpu, 
vmcs12->posted_intr_desc_addr);
-   if (is_error_page(page))
-   return;
-   vmx->nested.pi_desc_page = page;
-   vmx->nested.pi_desc = kmap(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc =
-   (struct pi_desc *)((void *)vmx->nested.pi_desc +
-   (unsigned long)(vmcs12->posted_intr_desc_addr &
-   (PAGE_SIZE - 1)));
-   vmcs_write64(POSTED_INTR_DESC_ADDR,
-   page_to_phys(vmx->nested.pi_desc_page) +
-   (unsigned long)(vmcs12->posted_intr_desc_addr &
-   (PAGE_SIZE - 1)));
+
}
if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12))
vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
@@ -13344,13 +13332,10 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
u32 exit_reason,
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
+
kvm_vcpu_unmap(>nested.virtual_apic_map);
-   if (vmx->nested.pi_desc_page) {
-   kunmap(vmx->nested.pi_desc_page);
-   kvm_release_page_dirty(vmx->nested.pi_desc_page);
-   vmx->nested.pi_desc_page = NULL;
-   vmx->nested.pi_desc = NULL;
-   }
+   kvm_vcpu_unmap(>nested.pi_desc_map);
+   vmx->nested.pi_desc = NULL;
 
/*
 * We are now running in L2, mmu_notifier will force to reload the
-- 
2.7.4

[PATCH v3 04/13] X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs

2018-10-20 Thread KarimAllah Ahmed

From: Filippo Sironi 

cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct page to map in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g., when
limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that back the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().

Signed-off-by: Filippo Sironi 
Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/paging_tmpl.h | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 14ffd97..8f7bc8f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -141,15 +141,35 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
struct page *page;
 
npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, );
-   /* Check if the user is doing something meaningless. */
-   if (unlikely(npages != 1))
-   return -EFAULT;
-
-   table = kmap_atomic(page);
-   ret = CMPXCHG([index], orig_pte, new_pte);
-   kunmap_atomic(table);
-
-   kvm_release_page_dirty(page);
+   if (likely(npages == 1)) {
+   table = kmap_atomic(page);
+   ret = CMPXCHG([index], orig_pte, new_pte);
+   kunmap_atomic(table);
+
+   kvm_release_page_dirty(page);
+   } else {
+   struct vm_area_struct *vma;
+   unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
+   unsigned long pfn;
+   unsigned long paddr;
+
+   down_read(>mm->mmap_sem);
+   vma = find_vma_intersection(current->mm, vaddr, vaddr + 
PAGE_SIZE);
+   if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
+   up_read(>mm->mmap_sem);
+   return -EFAULT;
+   }
+   pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+   paddr = pfn << PAGE_SHIFT;
+   table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
+   if (!table) {
+   up_read(>mm->mmap_sem);
+   return -EFAULT;
+   }
+   ret = CMPXCHG([index], orig_pte, new_pte);
+   memunmap(table);
+   up_read(>mm->mmap_sem);
+   }
 
return (ret != orig_pte);
 }
-- 
2.7.4

[PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Phillip Potter

Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
header and replace with simple assignment. For each case, S_IFx >> 12
is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
us the correct file type. It is expected that for *nix compatibility
reasons, the relation between S_IFx and DT_x will not change. For
cases where the mode is invalid, upper layer validation catches this
anyway, so this improves readability and arguably performance by
assigning (mode & S_IFMT) >> 12 directly without any jump table
or conditional logic.

Signed-off-by: Phillip Potter 
---
 fs/ufs/util.h | 30 ++
 1 file changed, 2 insertions(+), 28 deletions(-)

diff --git a/fs/ufs/util.h b/fs/ufs/util.h
index 1fd3011ea623..7e0c0878b9f9 100644
--- a/fs/ufs/util.h
+++ b/fs/ufs/util.h
@@ -16,6 +16,7 @@
  * some useful macros
  */
 #define in_range(b,first,len)  ((b)>=(first)&&(b)<(first)+(len))
+#define S_SHIFT12
 
 /*
  * functions used for retyping
@@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct 
ufs_dir_entry *de, int mode)
if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD)
return;
 
-   /*
-* TODO turn this into a table lookup
-*/
-   switch (mode & S_IFMT) {
-   case S_IFSOCK:
-   de->d_u.d_44.d_type = DT_SOCK;
-   break;
-   case S_IFLNK:
-   de->d_u.d_44.d_type = DT_LNK;
-   break;
-   case S_IFREG:
-   de->d_u.d_44.d_type = DT_REG;
-   break;
-   case S_IFBLK:
-   de->d_u.d_44.d_type = DT_BLK;
-   break;
-   case S_IFDIR:
-   de->d_u.d_44.d_type = DT_DIR;
-   break;
-   case S_IFCHR:
-   de->d_u.d_44.d_type = DT_CHR;
-   break;
-   case S_IFIFO:
-   de->d_u.d_44.d_type = DT_FIFO;
-   break;
-   default:
-   de->d_u.d_44.d_type = DT_UNKNOWN;
-   }
+   de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT;
 }
 
 static inline u32
-- 
2.17.2

[PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function

2018-10-20 Thread Phillip Potter

Remove switch statement from ufs_set_de_type function in fs/ufs/util.h
header and replace with simple assignment. For each case, S_IFx >> 12
is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give
us the correct file type. It is expected that for *nix compatibility
reasons, the relation between S_IFx and DT_x will not change. For
cases where the mode is invalid, upper layer validation catches this
anyway, so this improves readability and arguably performance by
assigning (mode & S_IFMT) >> 12 directly without any jump table
or conditional logic.

Signed-off-by: Phillip Potter 
---
 fs/ufs/util.h | 30 ++
 1 file changed, 2 insertions(+), 28 deletions(-)

diff --git a/fs/ufs/util.h b/fs/ufs/util.h
index 1fd3011ea623..7e0c0878b9f9 100644
--- a/fs/ufs/util.h
+++ b/fs/ufs/util.h
@@ -16,6 +16,7 @@
  * some useful macros
  */
 #define in_range(b,first,len)  ((b)>=(first)&&(b)<(first)+(len))
+#define S_SHIFT12
 
 /*
  * functions used for retyping
@@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct 
ufs_dir_entry *de, int mode)
if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD)
return;
 
-   /*
-* TODO turn this into a table lookup
-*/
-   switch (mode & S_IFMT) {
-   case S_IFSOCK:
-   de->d_u.d_44.d_type = DT_SOCK;
-   break;
-   case S_IFLNK:
-   de->d_u.d_44.d_type = DT_LNK;
-   break;
-   case S_IFREG:
-   de->d_u.d_44.d_type = DT_REG;
-   break;
-   case S_IFBLK:
-   de->d_u.d_44.d_type = DT_BLK;
-   break;
-   case S_IFDIR:
-   de->d_u.d_44.d_type = DT_DIR;
-   break;
-   case S_IFCHR:
-   de->d_u.d_44.d_type = DT_CHR;
-   break;
-   case S_IFIFO:
-   de->d_u.d_44.d_type = DT_FIFO;
-   break;
-   default:
-   de->d_u.d_44.d_type = DT_UNKNOWN;
-   }
+   de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT;
 }
 
 static inline u32
-- 
2.17.2

[PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Wenwen Wang

In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In
t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
whether it is valid through t4_valid_cqe(). If it is valid, the address of
the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the local
memory in create_read_req_cqe(). The problem here is that the CQE is
actually in a DMA region allocated by dma_alloc_coherent() in create_cq().
Given that the device also has the permission to access the DMA region, a
malicious device controlled by an attacker can modify the CQE in the DMA
region after the check in t4_next_hw_cqe() but before the copy in
create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
which can cause undefined behavior of the kernel and introduce potential
security risks.

This patch avoids the above issue by saving the CQE to a local variable if
it is verified to be a valid CQE in t4_next_hw_cqe(). Also, the local
variable will be used for the copy in create_read_req_ceq().

Signed-off-by: Wenwen Wang 
---
 drivers/infiniband/hw/cxgb4/cq.c | 8 +---
 drivers/infiniband/hw/cxgb4/t4.h | 4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 6d30427..09918ca 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -335,13 +335,15 @@ static void advance_oldest_read(struct t4_wq *wq)
  */
 void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp *flush_qhp)
 {
-   struct t4_cqe *hw_cqe, *swcqe, read_cqe;
+   struct t4_cqe hw_cqe_obj;
+   struct t4_cqe *hw_cqe = _cqe_obj;
+   struct t4_cqe *swcqe, read_cqe;
struct c4iw_qp *qhp;
struct t4_swsqe *swsqe;
int ret;
 
pr_debug("cqid 0x%x\n", chp->cq.cqid);
-   ret = t4_next_hw_cqe(>cq, _cqe);
+   ret = t4_next_hw_cqe(>cq, hw_cqe);
 
/*
 * This logic is similar to poll_cq(), but not quite the same
@@ -414,7 +416,7 @@ void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp 
*flush_qhp)
}
 next_cqe:
t4_hwcq_consume(>cq);
-   ret = t4_next_hw_cqe(>cq, _cqe);
+   ret = t4_next_hw_cqe(>cq, hw_cqe);
if (qhp && flush_qhp != qhp)
spin_unlock(>lock);
}
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index e42021f..9a9eea5 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -791,7 +791,7 @@ static inline int t4_cq_notempty(struct t4_cq *cq)
return cq->sw_in_use || t4_valid_cqe(cq, >queue[cq->cidx]);
 }
 
-static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe)
+static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe *cqe)
 {
int ret;
u16 prev_cidx;
@@ -809,7 +809,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct 
t4_cqe **cqe)
 
/* Ensure CQE is flushed to memory */
rmb();
-   *cqe = >queue[cq->cidx];
+   *cqe = cq->queue[cq->cidx];
ret = 0;
} else
ret = -ENODATA;
-- 
2.7.4

[PATCH] iw_cxgb4: fix a missing-check bug

2018-10-20 Thread Wenwen Wang

In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In
t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see
whether it is valid through t4_valid_cqe(). If it is valid, the address of
the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the local
memory in create_read_req_cqe(). The problem here is that the CQE is
actually in a DMA region allocated by dma_alloc_coherent() in create_cq().
Given that the device also has the permission to access the DMA region, a
malicious device controlled by an attacker can modify the CQE in the DMA
region after the check in t4_next_hw_cqe() but before the copy in
create_read_req_cqe(). By doing so, the attacker can supply invalid CQE,
which can cause undefined behavior of the kernel and introduce potential
security risks.

This patch avoids the above issue by saving the CQE to a local variable if
it is verified to be a valid CQE in t4_next_hw_cqe(). Also, the local
variable will be used for the copy in create_read_req_ceq().

Signed-off-by: Wenwen Wang 
---
 drivers/infiniband/hw/cxgb4/cq.c | 8 +---
 drivers/infiniband/hw/cxgb4/t4.h | 4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 6d30427..09918ca 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -335,13 +335,15 @@ static void advance_oldest_read(struct t4_wq *wq)
  */
 void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp *flush_qhp)
 {
-   struct t4_cqe *hw_cqe, *swcqe, read_cqe;
+   struct t4_cqe hw_cqe_obj;
+   struct t4_cqe *hw_cqe = _cqe_obj;
+   struct t4_cqe *swcqe, read_cqe;
struct c4iw_qp *qhp;
struct t4_swsqe *swsqe;
int ret;
 
pr_debug("cqid 0x%x\n", chp->cq.cqid);
-   ret = t4_next_hw_cqe(>cq, _cqe);
+   ret = t4_next_hw_cqe(>cq, hw_cqe);
 
/*
 * This logic is similar to poll_cq(), but not quite the same
@@ -414,7 +416,7 @@ void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp 
*flush_qhp)
}
 next_cqe:
t4_hwcq_consume(>cq);
-   ret = t4_next_hw_cqe(>cq, _cqe);
+   ret = t4_next_hw_cqe(>cq, hw_cqe);
if (qhp && flush_qhp != qhp)
spin_unlock(>lock);
}
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index e42021f..9a9eea5 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -791,7 +791,7 @@ static inline int t4_cq_notempty(struct t4_cq *cq)
return cq->sw_in_use || t4_valid_cqe(cq, >queue[cq->cidx]);
 }
 
-static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe)
+static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe *cqe)
 {
int ret;
u16 prev_cidx;
@@ -809,7 +809,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct 
t4_cqe **cqe)
 
/* Ensure CQE is flushed to memory */
rmb();
-   *cqe = >queue[cq->cidx];
+   *cqe = cq->queue[cq->cidx];
ret = 0;
} else
ret = -ENODATA;
-- 
2.7.4

[PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-20 Thread KarimAllah Ahmed

The spec only requires the posted interrupt descriptor address to be
64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
forces the address to be page aligned.

Only validate that the address does not cross the maximum physical address
without enforcing a page alignment.

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on 
vmentry of L2")
Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 30bf860..47962f2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct 
kvm_vcpu *vcpu,
!nested_exit_intr_ack_set(vcpu) ||
(vmcs12->posted_intr_nv & 0xff00) ||
(vmcs12->posted_intr_desc_addr & 0x3f) ||
-   (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
+   (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu)))
return -EINVAL;
 
/* tpr shadow is needed by all apicv features. */
-- 
2.7.4

[PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-20 Thread KarimAllah Ahmed

The spec only requires the posted interrupt descriptor address to be
64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
forces the address to be page aligned.

Only validate that the address does not cross the maximum physical address
without enforcing a page alignment.

Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on 
vmentry of L2")
Signed-off-by: KarimAllah Ahmed 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 30bf860..47962f2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct 
kvm_vcpu *vcpu,
!nested_exit_intr_ack_set(vcpu) ||
(vmcs12->posted_intr_nv & 0xff00) ||
(vmcs12->posted_intr_desc_addr & 0x3f) ||
-   (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
+   (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu)))
return -EINVAL;
 
/* tpr shadow is needed by all apicv features. */
-- 
2.7.4

Re: [PATCH] mfd: tps6586x: Handle interrupts on suspend

2018-10-20 Thread Dmitry Osipenko

On 10/19/18 4:22 PM, Jon Hunter wrote:
> From: Jonathan Hunter 
> 
> The tps6586x driver creates an irqchip that is used by its various child
> devices for managing interrupts. The tps6586x-rtc device is one of its
> children that uses the tps6586x irqchip. When using the tps6586x-rtc as
> a wake-up device from suspend, the following is seen:
> 
>  PM: Syncing filesystems ... done.
>  Freezing user space processes ... (elapsed 0.001 seconds) done.
>  OOM killer disabled.
>  Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
>  Disabling non-boot CPUs ...
>  Entering suspend state LP1
>  Enabling non-boot CPUs ...
>  CPU1 is up
>  tps6586x 3-0034: failed to read interrupt status
>  tps6586x 3-0034: failed to read interrupt status
> 
> The reason why the tps6586x interrupt status cannot be read is because
> the tps6586x interrupt is not masked during suspend and when the
> tps6586x-rtc interrupt occurs, to wake-up the device, the interrupt is
> seen before the i2c controller has been resumed in order to read the
> tps6586x interrupt status.
> 
> The tps6586x-rtc driver sets it's interrupt as a wake-up source during
> suspend, which gets propagated to the parent tps6586x interrupt.
> However, the tps6586x-rtc driver cannot disable it's interrupt during
> suspend otherwise we would never be woken up and so the tps6586x must
> disable it's interrupt instead.
> 
> Prevent the tps6586x interrupt handler from executing on exiting suspend
> before the i2c controller has been resumed by disabling the tps6586x
> interrupt on entering suspend and re-enabling it on resuming from
> suspend.
> 
> Cc: sta...@vger.kernel.org
> 
> Signed-off-by: Jon Hunter 
> ---

Tested patch on linux-next and v4.14, works fine. Thank you!

Reviewed-by: Dmitry Osipenko 
Tested-by: Dmitry Osipenko

Re: [PATCH] mfd: tps6586x: Handle interrupts on suspend

2018-10-20 Thread Dmitry Osipenko

On 10/19/18 4:22 PM, Jon Hunter wrote:
> From: Jonathan Hunter 
> 
> The tps6586x driver creates an irqchip that is used by its various child
> devices for managing interrupts. The tps6586x-rtc device is one of its
> children that uses the tps6586x irqchip. When using the tps6586x-rtc as
> a wake-up device from suspend, the following is seen:
> 
>  PM: Syncing filesystems ... done.
>  Freezing user space processes ... (elapsed 0.001 seconds) done.
>  OOM killer disabled.
>  Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
>  Disabling non-boot CPUs ...
>  Entering suspend state LP1
>  Enabling non-boot CPUs ...
>  CPU1 is up
>  tps6586x 3-0034: failed to read interrupt status
>  tps6586x 3-0034: failed to read interrupt status
> 
> The reason why the tps6586x interrupt status cannot be read is because
> the tps6586x interrupt is not masked during suspend and when the
> tps6586x-rtc interrupt occurs, to wake-up the device, the interrupt is
> seen before the i2c controller has been resumed in order to read the
> tps6586x interrupt status.
> 
> The tps6586x-rtc driver sets it's interrupt as a wake-up source during
> suspend, which gets propagated to the parent tps6586x interrupt.
> However, the tps6586x-rtc driver cannot disable it's interrupt during
> suspend otherwise we would never be woken up and so the tps6586x must
> disable it's interrupt instead.
> 
> Prevent the tps6586x interrupt handler from executing on exiting suspend
> before the i2c controller has been resumed by disabling the tps6586x
> interrupt on entering suspend and re-enabling it on resuming from
> suspend.
> 
> Cc: sta...@vger.kernel.org
> 
> Signed-off-by: Jon Hunter 
> ---

Tested patch on linux-next and v4.14, works fine. Thank you!

Reviewed-by: Dmitry Osipenko 
Tested-by: Dmitry Osipenko

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Paul E. McKenney

On Sat, Oct 20, 2018 at 10:22:29PM +0200, Andrea Parri wrote:
> [...]
> 
> > The second (informal) litmus test has a more interesting Linux-kernel
> > counterpart:
> > 
> > void t1_interrupt(void)
> > {
> > r0 = READ_ONCE(y);
> > smp_store_release(, 1);
> > }
> > 
> > void t1(void)
> > {
> > smp_store_release(, 1);
> > }
> > 
> > void t2(void)
> > {
> > r1 = smp_load_acquire();
> > r2 = smp_load_acquire();
> > }
> > 
> > On store-reordering architectures that implement smp_store_release()
> > as a memory-barrier instruction followed by a store, the interrupt could
> > arrive betweentimes in t1(), so that there would be no ordering between
> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
> 
> FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-)
> but I admit that I've only run this test in *my mind*: in an SC world,
> 
>   CPU1CPU2
> 
>   t1()
> t1_interrupt()
>   r0 = READ_ONCE(y); // =0
>   t2()
> r1 = smp_load_acquire(); // =0
>   smp_store_release(, 1);
> smp_store_release(, 1);
> r2 = smp_load_acquire(); // =1

OK, so did I get the outcome messed up again?  :-/

Thanx, Paul

> > So how paranoid should we be with respect to interrupt handlers for
> > smp_store_release(), smp_load_acquire(), and the various RMW atomic
> > operations that are sometimes implemented with separate memory-barrier
> > instructions?  ;-)
> 
> Good question! ;-)
> 
>   Andrea
> 
> 
> > 
> > Thanx, Paul
> > 
>

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Paul E. McKenney

On Sat, Oct 20, 2018 at 10:22:29PM +0200, Andrea Parri wrote:
> [...]
> 
> > The second (informal) litmus test has a more interesting Linux-kernel
> > counterpart:
> > 
> > void t1_interrupt(void)
> > {
> > r0 = READ_ONCE(y);
> > smp_store_release(, 1);
> > }
> > 
> > void t1(void)
> > {
> > smp_store_release(, 1);
> > }
> > 
> > void t2(void)
> > {
> > r1 = smp_load_acquire();
> > r2 = smp_load_acquire();
> > }
> > 
> > On store-reordering architectures that implement smp_store_release()
> > as a memory-barrier instruction followed by a store, the interrupt could
> > arrive betweentimes in t1(), so that there would be no ordering between
> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
> 
> FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-)
> but I admit that I've only run this test in *my mind*: in an SC world,
> 
>   CPU1CPU2
> 
>   t1()
> t1_interrupt()
>   r0 = READ_ONCE(y); // =0
>   t2()
> r1 = smp_load_acquire(); // =0
>   smp_store_release(, 1);
> smp_store_release(, 1);
> r2 = smp_load_acquire(); // =1

OK, so did I get the outcome messed up again?  :-/

Thanx, Paul

> > So how paranoid should we be with respect to interrupt handlers for
> > smp_store_release(), smp_load_acquire(), and the various RMW atomic
> > operations that are sometimes implemented with separate memory-barrier
> > instructions?  ;-)
> 
> Good question! ;-)
> 
>   Andrea
> 
> 
> > 
> > Thanx, Paul
> > 
>

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Paul E. McKenney

On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote:
> On Sat, 20 Oct 2018, Paul E. McKenney wrote:
> 
> > The second (informal) litmus test has a more interesting Linux-kernel
> > counterpart:
> > 
> > void t1_interrupt(void)
> > {
> > r0 = READ_ONCE(y);
> > smp_store_release(, 1);
> > }
> > 
> > void t1(void)
> > {
> > smp_store_release(, 1);
> > }
> > 
> > void t2(void)
> > {
> > r1 = smp_load_acquire();
> > r2 = smp_load_acquire();
> > }
> > 
> > On store-reordering architectures that implement smp_store_release()
> > as a memory-barrier instruction followed by a store, the interrupt could
> > arrive betweentimes in t1(), so that there would be no ordering between
> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
> 
> This is disconcerting only if we assume that t1_interrupt() has to be
> executed by the same CPU as t1().  If the interrupt could be fielded by
> a different CPU then the paranoid outcome is perfectly understandable,
> even in an SC context.
> 
> So the question really should be limited to situations where a handler 
> is forced to execute in the context of a particular thread.  While 
> POSIX does allow such restrictions for user programs, I'm not aware of 
> any similar mechanism in the kernel.

Good point, and I was in fact assuming that t1() and t1_interrupt()
were executing on the same CPU.

This sort of thing happens naturally in the kernel when both t1()
and t1_interrupt() are accessing per-CPU variables.

Thanx, Paul

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Paul E. McKenney

On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote:
> On Sat, 20 Oct 2018, Paul E. McKenney wrote:
> 
> > The second (informal) litmus test has a more interesting Linux-kernel
> > counterpart:
> > 
> > void t1_interrupt(void)
> > {
> > r0 = READ_ONCE(y);
> > smp_store_release(, 1);
> > }
> > 
> > void t1(void)
> > {
> > smp_store_release(, 1);
> > }
> > 
> > void t2(void)
> > {
> > r1 = smp_load_acquire();
> > r2 = smp_load_acquire();
> > }
> > 
> > On store-reordering architectures that implement smp_store_release()
> > as a memory-barrier instruction followed by a store, the interrupt could
> > arrive betweentimes in t1(), so that there would be no ordering between
> > t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.
> 
> This is disconcerting only if we assume that t1_interrupt() has to be
> executed by the same CPU as t1().  If the interrupt could be fielded by
> a different CPU then the paranoid outcome is perfectly understandable,
> even in an SC context.
> 
> So the question really should be limited to situations where a handler 
> is forced to execute in the context of a particular thread.  While 
> POSIX does allow such restrictions for user programs, I'm not aware of 
> any similar mechanism in the kernel.

Good point, and I was in fact assuming that t1() and t1_interrupt()
were executing on the same CPU.

This sort of thing happens naturally in the kernel when both t1()
and t1_interrupt() are accessing per-CPU variables.

Thanx, Paul

Imaging 39

2018-10-20 Thread Linda


We are one image studio who is able to process 300+ photos a day.

If you need any image editing, please let us know. We can do it for you
such as:
Image cut out for photos  and clipping path, masking for your photos,
They are mostly used for ecommerce photos, jewelry photos retouching,
beauty and skin images
and wedding photos.

We do also different kind of beauty retouching, portraits retouching.

We can send editing for your photos if you send us one or two photos.

Thanks,
Linda

Imaging 39

2018-10-20 Thread Linda


We are one image studio who is able to process 300+ photos a day.

If you need any image editing, please let us know. We can do it for you
such as:
Image cut out for photos  and clipping path, masking for your photos,
They are mostly used for ecommerce photos, jewelry photos retouching,
beauty and skin images
and wedding photos.

We do also different kind of beauty retouching, portraits retouching.

We can send editing for your photos if you send us one or two photos.

Thanks,
Linda

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Andrea Parri

[...]

> The second (informal) litmus test has a more interesting Linux-kernel
> counterpart:
> 
>   void t1_interrupt(void)
>   {
>   r0 = READ_ONCE(y);
>   smp_store_release(, 1);
>   }
> 
>   void t1(void)
>   {
>   smp_store_release(, 1);
>   }
> 
>   void t2(void)
>   {
>   r1 = smp_load_acquire();
>   r2 = smp_load_acquire();
>   }
> 
> On store-reordering architectures that implement smp_store_release()
> as a memory-barrier instruction followed by a store, the interrupt could
> arrive betweentimes in t1(), so that there would be no ordering between
> t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.

FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-)
but I admit that I've only run this test in *my mind*: in an SC world,

  CPU1  CPU2

  t1()
t1_interrupt()
  r0 = READ_ONCE(y); // =0
t2()
  r1 = smp_load_acquire(); // =0
  smp_store_release(, 1);
smp_store_release(, 1);
  r2 = smp_load_acquire(); // =1


> So how paranoid should we be with respect to interrupt handlers for
> smp_store_release(), smp_load_acquire(), and the various RMW atomic
> operations that are sometimes implemented with separate memory-barrier
> instructions?  ;-)

Good question! ;-)

  Andrea


> 
>   Thanx, Paul
>

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Andrea Parri

[...]

> The second (informal) litmus test has a more interesting Linux-kernel
> counterpart:
> 
>   void t1_interrupt(void)
>   {
>   r0 = READ_ONCE(y);
>   smp_store_release(, 1);
>   }
> 
>   void t1(void)
>   {
>   smp_store_release(, 1);
>   }
> 
>   void t2(void)
>   {
>   r1 = smp_load_acquire();
>   r2 = smp_load_acquire();
>   }
> 
> On store-reordering architectures that implement smp_store_release()
> as a memory-barrier instruction followed by a store, the interrupt could
> arrive betweentimes in t1(), so that there would be no ordering between
> t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.

FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-)
but I admit that I've only run this test in *my mind*: in an SC world,

  CPU1  CPU2

  t1()
t1_interrupt()
  r0 = READ_ONCE(y); // =0
t2()
  r1 = smp_load_acquire(); // =0
  smp_store_release(, 1);
smp_store_release(, 1);
  r2 = smp_load_acquire(); // =1


> So how paranoid should we be with respect to interrupt handlers for
> smp_store_release(), smp_load_acquire(), and the various RMW atomic
> operations that are sometimes implemented with separate memory-barrier
> instructions?  ;-)

Good question! ;-)

  Andrea


> 
>   Thanx, Paul
>

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Alan Stern

On Sat, 20 Oct 2018, Paul E. McKenney wrote:

> The second (informal) litmus test has a more interesting Linux-kernel
> counterpart:
> 
>   void t1_interrupt(void)
>   {
>   r0 = READ_ONCE(y);
>   smp_store_release(, 1);
>   }
> 
>   void t1(void)
>   {
>   smp_store_release(, 1);
>   }
> 
>   void t2(void)
>   {
>   r1 = smp_load_acquire();
>   r2 = smp_load_acquire();
>   }
> 
> On store-reordering architectures that implement smp_store_release()
> as a memory-barrier instruction followed by a store, the interrupt could
> arrive betweentimes in t1(), so that there would be no ordering between
> t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.

This is disconcerting only if we assume that t1_interrupt() has to be
executed by the same CPU as t1().  If the interrupt could be fielded by
a different CPU then the paranoid outcome is perfectly understandable,
even in an SC context.

So the question really should be limited to situations where a handler 
is forced to execute in the context of a particular thread.  While 
POSIX does allow such restrictions for user programs, I'm not aware of 
any similar mechanism in the kernel.

Alan

Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.

2018-10-20 Thread Alan Stern

On Sat, 20 Oct 2018, Paul E. McKenney wrote:

> The second (informal) litmus test has a more interesting Linux-kernel
> counterpart:
> 
>   void t1_interrupt(void)
>   {
>   r0 = READ_ONCE(y);
>   smp_store_release(, 1);
>   }
> 
>   void t1(void)
>   {
>   smp_store_release(, 1);
>   }
> 
>   void t2(void)
>   {
>   r1 = smp_load_acquire();
>   r2 = smp_load_acquire();
>   }
> 
> On store-reordering architectures that implement smp_store_release()
> as a memory-barrier instruction followed by a store, the interrupt could
> arrive betweentimes in t1(), so that there would be no ordering between
> t1_interrupt()'s store to x and t1()'s store to y.  This could (again,
> in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1.

This is disconcerting only if we assume that t1_interrupt() has to be
executed by the same CPU as t1().  If the interrupt could be fielded by
a different CPU then the paranoid outcome is perfectly understandable,
even in an SC context.

So the question really should be limited to situations where a handler 
is forced to execute in the context of a particular thread.  While 
POSIX does allow such restrictions for user programs, I'm not aware of 
any similar mechanism in the kernel.

Alan

[PATCH] thunderbolt: fix a missing-check bug

2018-10-20 Thread Wenwen Wang

In tb_ring_poll(), the flag of the frame, i.e.,
'ring->descriptors[ring->tail].flags', is checked to see whether the frame
is completed. If yes, the frame including the flag will be read from the
ring and returned to the caller. The problem here is that the flag is
actually in a DMA region, which is allocated through dma_alloc_coherent()
in tb_ring_alloc(). Given that the device can also access this DMA region,
it is possible that a malicious device controlled by an attacker can modify
the flag between the check and the copy. By doing so, the attacker can
bypass the check and supply uncompleted frame, which can cause undefined
behavior of the kernel and introduce potential security risk.

This patch firstly copies the flag into a local variable 'desc_flags' and
then performs the check and copy using 'desc_flags'. Through this way, the
above issue can be avoided.

Signed-off-by: Wenwen Wang 
---
 drivers/thunderbolt/nhi.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index 5cd6bdf..481b1f2 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -289,6 +289,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring)
 {
struct ring_frame *frame = NULL;
unsigned long flags;
+   enum ring_desc_flags desc_flags;
 
spin_lock_irqsave(>lock, flags);
if (!ring->running)
@@ -296,7 +297,8 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring)
if (ring_empty(ring))
goto unlock;
 
-   if (ring->descriptors[ring->tail].flags & RING_DESC_COMPLETED) {
+   desc_flags = ring->descriptors[ring->tail].flags;
+   if (desc_flags & RING_DESC_COMPLETED) {
frame = list_first_entry(>in_flight, typeof(*frame),
 list);
list_del_init(>list);
@@ -305,7 +307,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring)
frame->size = ring->descriptors[ring->tail].length;
frame->eof = ring->descriptors[ring->tail].eof;
frame->sof = ring->descriptors[ring->tail].sof;
-   frame->flags = ring->descriptors[ring->tail].flags;
+   frame->flags = desc_flags;
}
 
ring->tail = (ring->tail + 1) % ring->size;
-- 
2.7.4

[PATCH] thunderbolt: fix a missing-check bug

2018-10-20 Thread Wenwen Wang

In tb_ring_poll(), the flag of the frame, i.e.,
'ring->descriptors[ring->tail].flags', is checked to see whether the frame
is completed. If yes, the frame including the flag will be read from the
ring and returned to the caller. The problem here is that the flag is
actually in a DMA region, which is allocated through dma_alloc_coherent()
in tb_ring_alloc(). Given that the device can also access this DMA region,
it is possible that a malicious device controlled by an attacker can modify
the flag between the check and the copy. By doing so, the attacker can
bypass the check and supply uncompleted frame, which can cause undefined
behavior of the kernel and introduce potential security risk.

This patch firstly copies the flag into a local variable 'desc_flags' and
then performs the check and copy using 'desc_flags'. Through this way, the
above issue can be avoided.

Signed-off-by: Wenwen Wang 
---
 drivers/thunderbolt/nhi.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index 5cd6bdf..481b1f2 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -289,6 +289,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring)
 {
struct ring_frame *frame = NULL;
unsigned long flags;
+   enum ring_desc_flags desc_flags;
 
spin_lock_irqsave(>lock, flags);
if (!ring->running)
@@ -296,7 +297,8 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring)
if (ring_empty(ring))
goto unlock;
 
-   if (ring->descriptors[ring->tail].flags & RING_DESC_COMPLETED) {
+   desc_flags = ring->descriptors[ring->tail].flags;
+   if (desc_flags & RING_DESC_COMPLETED) {
frame = list_first_entry(>in_flight, typeof(*frame),
 list);
list_del_init(>list);
@@ -305,7 +307,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring)
frame->size = ring->descriptors[ring->tail].length;
frame->eof = ring->descriptors[ring->tail].eof;
frame->sof = ring->descriptors[ring->tail].sof;
-   frame->flags = ring->descriptors[ring->tail].flags;
+   frame->flags = desc_flags;
}
 
ring->tail = (ring->tail + 1) % ring->size;
-- 
2.7.4

[GIT PULL] code of conduct fixes for 4.19-rc8

2018-10-20 Thread James Bottomley

This is the series of patches which has been discussed on both ksummit-
discuss and linux-kernel for the past few weeks.  As Shuah said when
kicking off the process, it's designed as a starting point for the next
phase of the discussion, not as the end point, so it's only really a
set of minor updates to further that goal.

The merger of the three patches to show the combined effect is attached
below.  However, Greg recently posted the next phase of the discussion,
so people will be asking what the merger of the series looks like. 
Ignoring the non-CoC documents, I think it looks like this

---
diff --git a/Documentation/process/code-of-conduct.rst 
b/Documentation/process/code-of-conduct.rst
index eec768471a4d..8913851dab89 100644
--- a/Documentation/process/code-of-conduct.rst
+++ b/Documentation/process/code-of-conduct.rst
@@ -59,19 +59,27 @@ address, posting via an official social media account, or 
acting as an appointed
 representative at an online or offline event. Representation of a project may 
be
 further defined and clarified by project maintainers.
 
-Reporting
-=
+Enforcement
+===
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the Technical Advisory Board (TAB) at
-. All complaints will be reviewed and
-investigated and will result in a response that is deemed necessary and
-appropriate to the circumstances. The TAB is obligated to maintain
-confidentiality with regard to the reporter of an incident (except where
-required by law).
+reported by contacting the Code of Conduct Committee at
+. All complaints will be reviewed and investigated
+and will result in a response that is deemed necessary and appropriate
+to the circumstances. The Code of Conduct Committee is obligated to
+maintain confidentiality with regard to the reporter of an incident
+(except where required by law). Further details of specific enforcement
+policies may be posted separately.
+
 
 Attribution
 ===
 
 This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
 available at 
https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+Interpretation
+==
+
+See the :ref:`code_of_conduct_interpretation` document for how the Linux
+kernel community will be interpreting this document.
---

And I'm sure it can be rebased to this without disturbing the currently 
gathered tags.

The patch can be pulled from

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/linux-coc.git coc-fixes

With the diffstat:

 Documentation/process/code-of-conduct.rst | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

With full diff below.

James

---

diff --git a/Documentation/process/code-of-conduct.rst 
b/Documentation/process/code-of-conduct.rst
index ab7c24b5478c..eec768471a4d 100644
--- a/Documentation/process/code-of-conduct.rst
+++ b/Documentation/process/code-of-conduct.rst
@@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include:
 * Trolling, insulting/derogatory comments, and personal or political attacks
 * Public or private harassment
 * Publishing others’ private information, such as a physical or electronic
-  address, without explicit permission
+  address not ordinarily collected by the project, without explicit permission
 * Other conduct which could reasonably be considered inappropriate in a
   professional setting
 
@@ -59,20 +59,16 @@ address, posting via an official social media account, or 
acting as an appointed
 representative at an online or offline event. Representation of a project may 
be
 further defined and clarified by project maintainers.
 
-Enforcement
-===
+Reporting
+=
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
 reported by contacting the Technical Advisory Board (TAB) at
 . All complaints will be reviewed and
 investigated and will result in a response that is deemed necessary and
 appropriate to the circumstances. The TAB is obligated to maintain
-confidentiality with regard to the reporter of an incident.  Further details of
-specific enforcement policies may be posted separately.
-
-Maintainers who do not follow or enforce the Code of Conduct in good faith may
-face temporary or permanent repercussions as determined by other members of the
-project’s leadership.
+confidentiality with regard to the reporter of an incident (except where
+required by law).
 
 Attribution
 ===

[GIT PULL] code of conduct fixes for 4.19-rc8

2018-10-20 Thread James Bottomley

This is the series of patches which has been discussed on both ksummit-
discuss and linux-kernel for the past few weeks.  As Shuah said when
kicking off the process, it's designed as a starting point for the next
phase of the discussion, not as the end point, so it's only really a
set of minor updates to further that goal.

The merger of the three patches to show the combined effect is attached
below.  However, Greg recently posted the next phase of the discussion,
so people will be asking what the merger of the series looks like. 
Ignoring the non-CoC documents, I think it looks like this

---
diff --git a/Documentation/process/code-of-conduct.rst 
b/Documentation/process/code-of-conduct.rst
index eec768471a4d..8913851dab89 100644
--- a/Documentation/process/code-of-conduct.rst
+++ b/Documentation/process/code-of-conduct.rst
@@ -59,19 +59,27 @@ address, posting via an official social media account, or 
acting as an appointed
 representative at an online or offline event. Representation of a project may 
be
 further defined and clarified by project maintainers.
 
-Reporting
-=
+Enforcement
+===
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the Technical Advisory Board (TAB) at
-. All complaints will be reviewed and
-investigated and will result in a response that is deemed necessary and
-appropriate to the circumstances. The TAB is obligated to maintain
-confidentiality with regard to the reporter of an incident (except where
-required by law).
+reported by contacting the Code of Conduct Committee at
+. All complaints will be reviewed and investigated
+and will result in a response that is deemed necessary and appropriate
+to the circumstances. The Code of Conduct Committee is obligated to
+maintain confidentiality with regard to the reporter of an incident
+(except where required by law). Further details of specific enforcement
+policies may be posted separately.
+
 
 Attribution
 ===
 
 This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
 available at 
https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+Interpretation
+==
+
+See the :ref:`code_of_conduct_interpretation` document for how the Linux
+kernel community will be interpreting this document.
---

And I'm sure it can be rebased to this without disturbing the currently 
gathered tags.

The patch can be pulled from

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/linux-coc.git coc-fixes

With the diffstat:

 Documentation/process/code-of-conduct.rst | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

With full diff below.

James

---

diff --git a/Documentation/process/code-of-conduct.rst 
b/Documentation/process/code-of-conduct.rst
index ab7c24b5478c..eec768471a4d 100644
--- a/Documentation/process/code-of-conduct.rst
+++ b/Documentation/process/code-of-conduct.rst
@@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include:
 * Trolling, insulting/derogatory comments, and personal or political attacks
 * Public or private harassment
 * Publishing others’ private information, such as a physical or electronic
-  address, without explicit permission
+  address not ordinarily collected by the project, without explicit permission
 * Other conduct which could reasonably be considered inappropriate in a
   professional setting
 
@@ -59,20 +59,16 @@ address, posting via an official social media account, or 
acting as an appointed
 representative at an online or offline event. Representation of a project may 
be
 further defined and clarified by project maintainers.
 
-Enforcement
-===
+Reporting
+=
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
 reported by contacting the Technical Advisory Board (TAB) at
 . All complaints will be reviewed and
 investigated and will result in a response that is deemed necessary and
 appropriate to the circumstances. The TAB is obligated to maintain
-confidentiality with regard to the reporter of an incident.  Further details of
-specific enforcement policies may be posted separately.
-
-Maintainers who do not follow or enforce the Code of Conduct in good faith may
-face temporary or permanent repercussions as determined by other members of the
-project’s leadership.
+confidentiality with regard to the reporter of an incident (except where
+required by law).
 
 Attribution
 ===

Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address

2018-10-20 Thread James Bottomley

On Sat, 2018-10-20 at 19:28 +0100, Alan Cox wrote:
> > +to the circumstances. The Code of Conduct Committee is obligated
> > to
> > +maintain confidentiality with regard to the reporter of an
> > incident.
> > +Further details of specific enforcement policies may be posted
> > +separately.
> 
> Unfortunately by ignoring the other suggestions on this you've left
> this bit broken.
> 
> The committee can't keep most stuff confidential so it's misleading
> and wrong to imply they can. Data protection law, reporting laws in
> some countries and the like mean that anyone expecting an incident to
> remain confidential from the person it was reported against is living
> in dreamland and are going to get a nasty shock.
> 
> At the very least it should say '(except where required by law)'.

I've got a solution for this: the patches I've been curating also
modify the section so the merger will look like what I have below.

The intent of the series I'm curating was only the beginning to show
desire to change in 4.19 but to correct the obvious defect before we
started the debate, so after suitable discussion, this one can be
the final set.

> There is a separate issue that serious things should always go to law
> enforcement - you are setting up a policy akin to the one that got
> the catholic church and many others in trouble.
> 
> You should also reserving the right to report serious incidents
> directly to law enforcement. Unless of course you want to be forced
> to sit on multiple reports of physical abuse from different people
> about someone - unable to tell them about each others report, unable
> to prove anything, and in twenty years time having to explain to the
> media why nothing was done.

I think we should debate that.  Most legal systems provide significant
deference to victims wishing for confidentiality and we should both
respect that and remember that an automatic crime report is a
significant deterrent  to vulnerable people in a lot of places.

James

---

diff --git a/Documentation/process/code-of-conduct.rst 
b/Documentation/process/code-of-conduct.rst
index eec768471a4d..8913851dab89 100644
--- a/Documentation/process/code-of-conduct.rst
+++ b/Documentation/process/code-of-conduct.rst
@@ -59,19 +59,27 @@ address, posting via an official social media account, or 
acting as an appointed
 representative at an online or offline event. Representation of a project may 
be
 further defined and clarified by project maintainers.
 
-Reporting
-=
+Enforcement
+===
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the Technical Advisory Board (TAB) at
-. All complaints will be reviewed and
-investigated and will result in a response that is deemed necessary and
-appropriate to the circumstances. The TAB is obligated to maintain
-confidentiality with regard to the reporter of an incident (except where
-required by law).
+reported by contacting the Code of Conduct Committee at
+. All complaints will be reviewed and investigated
+and will result in a response that is deemed necessary and appropriate
+to the circumstances. The Code of Conduct Committee is obligated to
+maintain confidentiality with regard to the reporter of an incident
+(except where required by law). Further details of specific enforcement
+policies may be posted separately.
+
 
 Attribution
 ===
 
 This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
 available at 
https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+Interpretation
+==
+
+See the :ref:`code_of_conduct_interpretation` document for how the Linux
+kernel community will be interpreting this document.

Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address

2018-10-20 Thread James Bottomley

On Sat, 2018-10-20 at 19:28 +0100, Alan Cox wrote:
> > +to the circumstances. The Code of Conduct Committee is obligated
> > to
> > +maintain confidentiality with regard to the reporter of an
> > incident.
> > +Further details of specific enforcement policies may be posted
> > +separately.
> 
> Unfortunately by ignoring the other suggestions on this you've left
> this bit broken.
> 
> The committee can't keep most stuff confidential so it's misleading
> and wrong to imply they can. Data protection law, reporting laws in
> some countries and the like mean that anyone expecting an incident to
> remain confidential from the person it was reported against is living
> in dreamland and are going to get a nasty shock.
> 
> At the very least it should say '(except where required by law)'.

I've got a solution for this: the patches I've been curating also
modify the section so the merger will look like what I have below.

The intent of the series I'm curating was only the beginning to show
desire to change in 4.19 but to correct the obvious defect before we
started the debate, so after suitable discussion, this one can be
the final set.

> There is a separate issue that serious things should always go to law
> enforcement - you are setting up a policy akin to the one that got
> the catholic church and many others in trouble.
> 
> You should also reserving the right to report serious incidents
> directly to law enforcement. Unless of course you want to be forced
> to sit on multiple reports of physical abuse from different people
> about someone - unable to tell them about each others report, unable
> to prove anything, and in twenty years time having to explain to the
> media why nothing was done.

I think we should debate that.  Most legal systems provide significant
deference to victims wishing for confidentiality and we should both
respect that and remember that an automatic crime report is a
significant deterrent  to vulnerable people in a lot of places.

James

---

diff --git a/Documentation/process/code-of-conduct.rst 
b/Documentation/process/code-of-conduct.rst
index eec768471a4d..8913851dab89 100644
--- a/Documentation/process/code-of-conduct.rst
+++ b/Documentation/process/code-of-conduct.rst
@@ -59,19 +59,27 @@ address, posting via an official social media account, or 
acting as an appointed
 representative at an online or offline event. Representation of a project may 
be
 further defined and clarified by project maintainers.
 
-Reporting
-=
+Enforcement
+===
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the Technical Advisory Board (TAB) at
-. All complaints will be reviewed and
-investigated and will result in a response that is deemed necessary and
-appropriate to the circumstances. The TAB is obligated to maintain
-confidentiality with regard to the reporter of an incident (except where
-required by law).
+reported by contacting the Code of Conduct Committee at
+. All complaints will be reviewed and investigated
+and will result in a response that is deemed necessary and appropriate
+to the circumstances. The Code of Conduct Committee is obligated to
+maintain confidentiality with regard to the reporter of an incident
+(except where required by law). Further details of specific enforcement
+policies may be posted separately.
+
 
 Attribution
 ===
 
 This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
 available at 
https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+Interpretation
+==
+
+See the :ref:`code_of_conduct_interpretation` document for how the Linux
+kernel community will be interpreting this document.

Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address

2018-10-20 Thread Trond Myklebust

On Sat, 2018-10-20 at 19:24 +, tim.b...@sony.com wrote:

> The scope of the code of conduct basically means that it covers
> online interactions (communication via mailing list, git commits
> and Bugzilla).  Not to be flippant, but those are hardly mediums
> that are susceptible to executing physical abuse.  Also, they are
> all mediums that leave a persistent, public trail.  So I don't think
> the
> comparison is very apt here.
>  -- Tim

If that is the case, then why does this need to go into the Linux
kernel in the first place? The mailing lists, the kernel.org git
repository, and bugzilla presumably all have "terms of use" pages that
could specify the expected behaviour very explicitly, and could specify
how arbitration works as part of those terms of use (and if enforcement
is required, then it could specify legal venues etc).

IOW: if the scope is just communication online, then I would think
there are better tools for that.

Putting a code of conduct into the kernel code itself wants to be
justified by more than just regulating online behaviour.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com

Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address

2018-10-20 Thread Trond Myklebust

On Sat, 2018-10-20 at 19:24 +, tim.b...@sony.com wrote:

> The scope of the code of conduct basically means that it covers
> online interactions (communication via mailing list, git commits
> and Bugzilla).  Not to be flippant, but those are hardly mediums
> that are susceptible to executing physical abuse.  Also, they are
> all mediums that leave a persistent, public trail.  So I don't think
> the
> comparison is very apt here.
>  -- Tim

If that is the case, then why does this need to go into the Linux
kernel in the first place? The mailing lists, the kernel.org git
repository, and bugzilla presumably all have "terms of use" pages that
could specify the expected behaviour very explicitly, and could specify
how arbitration works as part of those terms of use (and if enforcement
is required, then it could specify legal venues etc).

IOW: if the scope is just communication online, then I would think
there are better tools for that.

Putting a code of conduct into the kernel code itself wants to be
justified by more than just regulating online behaviour.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com

1 2 3 >

1 - 100 of 294 matches

Mail list logo