Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
On Sun, Oct 21, 2018 at 1:27 AM Matthew Wilcox wrote: > > On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote: > > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h > > header and replace with simple assignment. For each case, S_IFx >> 12 > > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give > > us the correct file type. It is expected that for *nix compatibility > > reasons, the relation between S_IFx and DT_x will not change. For > > cases where the mode is invalid, upper layer validation catches this > > anyway, so this improves readability and arguably performance by > > assigning (mode & S_IFMT) >> 12 directly without any jump table > > or conditional logic. > > Shouldn't we also do this for other filesystems? A quick scan suggests > someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2, > nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs. > I've tried to do that 2 years ago, not even for readability, but to fix a lurking bug in conversion table implementation that was copy from ext2 to many other fs: https://marc.info/?l=linux-fsdevel=148217829301701=2 https://marc.info/?l=linux-fsdevel=2=1=amir73il+file+type+conversion=b The push back from ext4/xfs maintainers was that while all fs use common values to encode d_type in on disk format, those on-disk format values are private to the fs. Eventually, the fix that got merged (only to xfs) did the opposite, it converted the conversion table lookup to a switch statement: https://marc.info/?l=linux-fsdevel=14824386620=2 Some fs maintainers were happy about the initial version, but I did not pursue pushing that cleanup: https://marc.info/?l=linux-fsdevel=14824386620=2 Phillip, you are welcome to re-spin those patches if you like. Note that the generic conversion helpers do not rely on upper layer to catch invalid mode values. Cheers, Amir.
Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
On Sun, Oct 21, 2018 at 1:27 AM Matthew Wilcox wrote: > > On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote: > > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h > > header and replace with simple assignment. For each case, S_IFx >> 12 > > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give > > us the correct file type. It is expected that for *nix compatibility > > reasons, the relation between S_IFx and DT_x will not change. For > > cases where the mode is invalid, upper layer validation catches this > > anyway, so this improves readability and arguably performance by > > assigning (mode & S_IFMT) >> 12 directly without any jump table > > or conditional logic. > > Shouldn't we also do this for other filesystems? A quick scan suggests > someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2, > nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs. > I've tried to do that 2 years ago, not even for readability, but to fix a lurking bug in conversion table implementation that was copy from ext2 to many other fs: https://marc.info/?l=linux-fsdevel=148217829301701=2 https://marc.info/?l=linux-fsdevel=2=1=amir73il+file+type+conversion=b The push back from ext4/xfs maintainers was that while all fs use common values to encode d_type in on disk format, those on-disk format values are private to the fs. Eventually, the fix that got merged (only to xfs) did the opposite, it converted the conversion table lookup to a switch statement: https://marc.info/?l=linux-fsdevel=14824386620=2 Some fs maintainers were happy about the initial version, but I did not pursue pushing that cleanup: https://marc.info/?l=linux-fsdevel=14824386620=2 Phillip, you are welcome to re-spin those patches if you like. Note that the generic conversion helpers do not rely on upper layer to catch invalid mode values. Cheers, Amir.
Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug
On Sun, Oct 21, 2018 at 09:16:59AM +0530, Sai Prakash Ranjan wrote: > On 10/20/2018 9:57 PM, Joel Fernandes wrote: > > On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote: > > > On 10/20/2018 10:55 AM, Joel Fernandes wrote: > > > > On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote: > > > > > Hi, > > > > > > > > > > This patch series adds Event tracing support to pstore and is > > > > > continuation > > > > > to the RFC patch introduced to add a new tracing facility for register > > > > > accesses called Register Trace Buffer(RTB). Since we decided to not > > > > > introduce > > > > > a separate framework to trace register accesses and use existing > > > > > framework > > > > > like tracepoints, I have moved from RFC. Details of the RFC in link > > > > > below: > > > > > > > > > > Link: > > > > > https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ > > > > > > > > > > MSR tracing example given by Steven was helpful in using tracepoints > > > > > for > > > > > register accesses instead of using separate trace. But just having > > > > > these > > > > > IO traces would not help much unless we could have them in some > > > > > persistent > > > > > ram buffer for debugging unclocked access or some kind of bus hang or > > > > > an > > > > > unexpected reset caused by some buggy driver which happens a lot > > > > > during > > > > > initial development stages. By analyzing the last few entries of this > > > > > buffer, > > > > > we could identify the register access which is causing the issue. > > > > > > > > Hi Sai, > > > > > > > > I wanted to see if I could make some time to get your patches working. > > > > We are > > > > hitting usecases that need something like this as well. Basically > > > > devices > > > > hanging and then the ramdump does not tell us much, so in this case > > > > pstore > > > > events can be really helpful. This usecase came up last year as well. > > > > > > > > Anyway while I was going through your patches, I cleaned up some pstore > > > > code > > > > as well and I have 3 more patches on top of yours for this clean up. I > > > > prefer > > > > we submit the patches together and sync our work together so that there > > > > is > > > > least conflict. > > > > > > > > Here's my latest tree: > > > > https://github.com/joelagnel/linux-kernel/commits/pstore-events > > > > (note that I have only build tested the patches since I just wrote them > > > > and > > > > its quite late in the night here ;-)) > > > > > > > > > > Hi Joel, > > > > > > Thanks for looking into this. Sure, I will be happy to sync up with you on > > > > Thanks. And added a fourth patch in the tree too. While at it, I was thinking about the problem we are trying to solve in another way. If ftrace itself can use pages from the persistent ram store, instead of the kernel's buddy allocator, then the ftrace ring buffer itself could persist across a system reboot. The clear advantage of that for Sai's pstore events work is, not having to duplicate a lot of the ring buffer code and stuff into pstore (for the pstore events for example, I wanted time stamps as well and ftrace's ring buffer has some nice time management code to deal with time deltas). We already have other ring buffers maintained in other parts of the kernel for tracing right? So I'm a bit averse to duplicating that into pstore as well for tracing. The other advantage of persisting the ftrace ring buffer would also mean data from other tracers such as function-graph tracer and irqsoff tracers would also persist and then we can also probably get rid of ftrace-in-pstore stuff which is kind of incomplete anyway since it does not have time stamps for recorded functions. Steven and Kees: what do you think, is persisting ftrace ring buffer across reboots a worthwhile idea? Any thoughts on how feasible something like that could be, code wise? Off the top, I think the ring buffer state that ftrace needs other than the trace data itself will also have to be persisted. thanks, - Joel
Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug
On Sun, Oct 21, 2018 at 09:16:59AM +0530, Sai Prakash Ranjan wrote: > On 10/20/2018 9:57 PM, Joel Fernandes wrote: > > On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote: > > > On 10/20/2018 10:55 AM, Joel Fernandes wrote: > > > > On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote: > > > > > Hi, > > > > > > > > > > This patch series adds Event tracing support to pstore and is > > > > > continuation > > > > > to the RFC patch introduced to add a new tracing facility for register > > > > > accesses called Register Trace Buffer(RTB). Since we decided to not > > > > > introduce > > > > > a separate framework to trace register accesses and use existing > > > > > framework > > > > > like tracepoints, I have moved from RFC. Details of the RFC in link > > > > > below: > > > > > > > > > > Link: > > > > > https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ > > > > > > > > > > MSR tracing example given by Steven was helpful in using tracepoints > > > > > for > > > > > register accesses instead of using separate trace. But just having > > > > > these > > > > > IO traces would not help much unless we could have them in some > > > > > persistent > > > > > ram buffer for debugging unclocked access or some kind of bus hang or > > > > > an > > > > > unexpected reset caused by some buggy driver which happens a lot > > > > > during > > > > > initial development stages. By analyzing the last few entries of this > > > > > buffer, > > > > > we could identify the register access which is causing the issue. > > > > > > > > Hi Sai, > > > > > > > > I wanted to see if I could make some time to get your patches working. > > > > We are > > > > hitting usecases that need something like this as well. Basically > > > > devices > > > > hanging and then the ramdump does not tell us much, so in this case > > > > pstore > > > > events can be really helpful. This usecase came up last year as well. > > > > > > > > Anyway while I was going through your patches, I cleaned up some pstore > > > > code > > > > as well and I have 3 more patches on top of yours for this clean up. I > > > > prefer > > > > we submit the patches together and sync our work together so that there > > > > is > > > > least conflict. > > > > > > > > Here's my latest tree: > > > > https://github.com/joelagnel/linux-kernel/commits/pstore-events > > > > (note that I have only build tested the patches since I just wrote them > > > > and > > > > its quite late in the night here ;-)) > > > > > > > > > > Hi Joel, > > > > > > Thanks for looking into this. Sure, I will be happy to sync up with you on > > > > Thanks. And added a fourth patch in the tree too. While at it, I was thinking about the problem we are trying to solve in another way. If ftrace itself can use pages from the persistent ram store, instead of the kernel's buddy allocator, then the ftrace ring buffer itself could persist across a system reboot. The clear advantage of that for Sai's pstore events work is, not having to duplicate a lot of the ring buffer code and stuff into pstore (for the pstore events for example, I wanted time stamps as well and ftrace's ring buffer has some nice time management code to deal with time deltas). We already have other ring buffers maintained in other parts of the kernel for tracing right? So I'm a bit averse to duplicating that into pstore as well for tracing. The other advantage of persisting the ftrace ring buffer would also mean data from other tracers such as function-graph tracer and irqsoff tracers would also persist and then we can also probably get rid of ftrace-in-pstore stuff which is kind of incomplete anyway since it does not have time stamps for recorded functions. Steven and Kees: what do you think, is persisting ftrace ring buffer across reboots a worthwhile idea? Any thoughts on how feasible something like that could be, code wise? Off the top, I think the ring buffer state that ftrace needs other than the trace data itself will also have to be persisted. thanks, - Joel
Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug
On 10/21/2018 9:16 AM, Sai Prakash Ranjan wrote: On 10/20/2018 9:57 PM, Joel Fernandes wrote: On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote: On 10/20/2018 10:55 AM, Joel Fernandes wrote: On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote: Hi, This patch series adds Event tracing support to pstore and is continuation to the RFC patch introduced to add a new tracing facility for register accesses called Register Trace Buffer(RTB). Since we decided to not introduce a separate framework to trace register accesses and use existing framework like tracepoints, I have moved from RFC. Details of the RFC in link below: Link: https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ MSR tracing example given by Steven was helpful in using tracepoints for register accesses instead of using separate trace. But just having these IO traces would not help much unless we could have them in some persistent ram buffer for debugging unclocked access or some kind of bus hang or an unexpected reset caused by some buggy driver which happens a lot during initial development stages. By analyzing the last few entries of this buffer, we could identify the register access which is causing the issue. Hi Sai, I wanted to see if I could make some time to get your patches working. We are hitting usecases that need something like this as well. Basically devices hanging and then the ramdump does not tell us much, so in this case pstore events can be really helpful. This usecase came up last year as well. Anyway while I was going through your patches, I cleaned up some pstore code as well and I have 3 more patches on top of yours for this clean up. I prefer we submit the patches together and sync our work together so that there is least conflict. Here's my latest tree: https://github.com/joelagnel/linux-kernel/commits/pstore-events (note that I have only build tested the patches since I just wrote them and its quite late in the night here ;-)) Hi Joel, Thanks for looking into this. Sure, I will be happy to sync up with you on Thanks. And added a fourth patch in the tree too. this. I can test your additional patches on top of my pstore patches. BTW, I'm still stuck at copying binary record into pstore and then extract it during read time. Seems like I'm missing something. Sure, push your latest somewhere and let me know. I'll try to get you unstuck. Thanks Joel, I will push my changes and let you know in some time. Hi Joel, Here's my tree: https://github.com/saiprakash-ranjan/linux/commits/pstore-events I have one patch extra on top of your patches. Nothing much on binary record storage in this patch, only removed kmalloc in pstore event call to avoid loop. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug
On 10/21/2018 9:16 AM, Sai Prakash Ranjan wrote: On 10/20/2018 9:57 PM, Joel Fernandes wrote: On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote: On 10/20/2018 10:55 AM, Joel Fernandes wrote: On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote: Hi, This patch series adds Event tracing support to pstore and is continuation to the RFC patch introduced to add a new tracing facility for register accesses called Register Trace Buffer(RTB). Since we decided to not introduce a separate framework to trace register accesses and use existing framework like tracepoints, I have moved from RFC. Details of the RFC in link below: Link: https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ MSR tracing example given by Steven was helpful in using tracepoints for register accesses instead of using separate trace. But just having these IO traces would not help much unless we could have them in some persistent ram buffer for debugging unclocked access or some kind of bus hang or an unexpected reset caused by some buggy driver which happens a lot during initial development stages. By analyzing the last few entries of this buffer, we could identify the register access which is causing the issue. Hi Sai, I wanted to see if I could make some time to get your patches working. We are hitting usecases that need something like this as well. Basically devices hanging and then the ramdump does not tell us much, so in this case pstore events can be really helpful. This usecase came up last year as well. Anyway while I was going through your patches, I cleaned up some pstore code as well and I have 3 more patches on top of yours for this clean up. I prefer we submit the patches together and sync our work together so that there is least conflict. Here's my latest tree: https://github.com/joelagnel/linux-kernel/commits/pstore-events (note that I have only build tested the patches since I just wrote them and its quite late in the night here ;-)) Hi Joel, Thanks for looking into this. Sure, I will be happy to sync up with you on Thanks. And added a fourth patch in the tree too. this. I can test your additional patches on top of my pstore patches. BTW, I'm still stuck at copying binary record into pstore and then extract it during read time. Seems like I'm missing something. Sure, push your latest somewhere and let me know. I'll try to get you unstuck. Thanks Joel, I will push my changes and let you know in some time. Hi Joel, Here's my tree: https://github.com/saiprakash-ranjan/linux/commits/pstore-events I have one patch extra on top of your patches. Nothing much on binary record storage in this patch, only removed kmalloc in pstore event call to avoid loop. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Your Consignment at JFK International Airport.
Attention Beneficiary I am Mr.Will Clark, A senior officer at JFK International Airport. I am contacting you regarding an abandoned diplomatic consignment box and the x-ray scan report box revealed some U.S dollar bill in it which could be approximately 12,5Million Dollars and the official paper of the box indicates your contact details.To confirm you as the authentic beneficiary, do send me your full name, your home address, your mobile telephone number,occupation and the nearest airport close to you. For your information, the box was abandoned by the diplomat who was on transit to your location because he was not able to pay the ATL clearance fee he abandoned it with the authority without any notification of his arrival for clearance. I have taken it upon myself to contact you personally about this abandoned box so that we can transact this as a deal and share the total money 80% for you and 20% for me. My main purpose of contacting you is that the U.S. Custom's Border and Protection are on clearance for some beneficiaries abandoned consignment for 2017 who will be listed among the list for confiscation to be transfer to the US Teasury for 2018 few weeks from now. I want both of us to transact this as a deal all i want is your trust as the arrangement for the box to be delivered to you which can be concluded within 4-6hours after confirmation is made and upon your acceptance and willingness to co-operate. All communication must be held extremely confidential to ensure a successful delivery. Send your response to my private email.( willclark0...@gmail.com ) I will send you an SMS to confirm i got your information. Warmest Regards Mr.Will Clark
Your Consignment at JFK International Airport.
Attention Beneficiary I am Mr.Will Clark, A senior officer at JFK International Airport. I am contacting you regarding an abandoned diplomatic consignment box and the x-ray scan report box revealed some U.S dollar bill in it which could be approximately 12,5Million Dollars and the official paper of the box indicates your contact details.To confirm you as the authentic beneficiary, do send me your full name, your home address, your mobile telephone number,occupation and the nearest airport close to you. For your information, the box was abandoned by the diplomat who was on transit to your location because he was not able to pay the ATL clearance fee he abandoned it with the authority without any notification of his arrival for clearance. I have taken it upon myself to contact you personally about this abandoned box so that we can transact this as a deal and share the total money 80% for you and 20% for me. My main purpose of contacting you is that the U.S. Custom's Border and Protection are on clearance for some beneficiaries abandoned consignment for 2017 who will be listed among the list for confiscation to be transfer to the US Teasury for 2018 few weeks from now. I want both of us to transact this as a deal all i want is your trust as the arrangement for the box to be delivered to you which can be concluded within 4-6hours after confirmation is made and upon your acceptance and willingness to co-operate. All communication must be held extremely confidential to ensure a successful delivery. Send your response to my private email.( willclark0...@gmail.com ) I will send you an SMS to confirm i got your information. Warmest Regards Mr.Will Clark
Re: [PATCH 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
Hi Dan, On Sat, Oct 20, 2018 at 9:22 PM Dan Carpenter wrote: > > On Sat, Oct 20, 2018 at 04:42:07PM +0200, Miguel Ojeda wrote: > > Using an attribute is indeed better whenever possible. In C++17 it is > > an standard attribute and there have been proposals to include some of > > them for C as well since 2016 at least, e.g. the latest for > > fallthrough at: > > > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2268.pdf > > > > I have taken a quick look into supporting it (typing it here to save > > it on the mailing list :-), and we have: > > > > * gcc >= 7.1 supports -Wimplicit-fallthrough with > > __attribute__((fallthrough)), the comment syntax and the C++ > > [[fallthrough]] syntax. > > * gcc < 7.1 complains about empty declarations (it does not know > > about attributes for null statements) and also > > -Wdeclaration-after-statement. > > I'm not sure I understand about empty decalarations. The idea is that > we would do: That paragraph tried to explain that gcc < 7.1 did not know about __attribute__((fallthrough)); i.e. that everything was introduced in gcc 7.1. Anyway, the conclusion was a neuron misfiring of mine -- in my mind I was thinking clang supported the comment syntax (when I clearly wrote that it didn't). Never mind --- thanks for pointing it out! > > case 3: > frob(); > __fall_through(); > case 4: > frob(); > > #if GCC > 7.1 > #define __fall_through() __attribute__((fallthrough)) > #else > #define __fall_through() > #endif Yes, of course, that is what we do for other attributes -- actually in -next we have pending a better way for checking, using __has_attribute: #if __has_attribute(fallthrough) #define __fallthrough __attribute__((fallthrough)) #else #define __fallthrough #endif > > So long as GCC and static analysis tools understand about the attribute > that's enought to catch the bugs. No one cares what clang and icc do. Not so sure about that -- there are quite some people looking forward to building Linux with clang, even if only to have another compiler to check for warnings and to use the clang/llvm-related tools (and to write new ones). > We would just disable the fall through warnings on those until they are > updated to support the fallthrough attribute. You mean if they start supporting the warning but not the attribute? I don't think that would be likely (i.e. if clang enables the warning on C mode, they will have to introduce a way to suppress it; which should be the attribute (at least), since they typically like to be compatible with gcc and since they already support it in C++ >= 11), but everything is possible. > > We wouldn't update all the fall through comments until later, but going > forward people could use the __fall_through() macro if they want. Agreed. I will introduce it in the compiler-attributes tree -- should be there for -rc1 if no one complains. CC'ing all of you in the patch. Cheers, Miguel
Re: [PATCH 3/5] Drivers: hv: kvp: Fix the recent regression caused by incorrect clean-up
Hi Dan, On Sat, Oct 20, 2018 at 9:22 PM Dan Carpenter wrote: > > On Sat, Oct 20, 2018 at 04:42:07PM +0200, Miguel Ojeda wrote: > > Using an attribute is indeed better whenever possible. In C++17 it is > > an standard attribute and there have been proposals to include some of > > them for C as well since 2016 at least, e.g. the latest for > > fallthrough at: > > > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2268.pdf > > > > I have taken a quick look into supporting it (typing it here to save > > it on the mailing list :-), and we have: > > > > * gcc >= 7.1 supports -Wimplicit-fallthrough with > > __attribute__((fallthrough)), the comment syntax and the C++ > > [[fallthrough]] syntax. > > * gcc < 7.1 complains about empty declarations (it does not know > > about attributes for null statements) and also > > -Wdeclaration-after-statement. > > I'm not sure I understand about empty decalarations. The idea is that > we would do: That paragraph tried to explain that gcc < 7.1 did not know about __attribute__((fallthrough)); i.e. that everything was introduced in gcc 7.1. Anyway, the conclusion was a neuron misfiring of mine -- in my mind I was thinking clang supported the comment syntax (when I clearly wrote that it didn't). Never mind --- thanks for pointing it out! > > case 3: > frob(); > __fall_through(); > case 4: > frob(); > > #if GCC > 7.1 > #define __fall_through() __attribute__((fallthrough)) > #else > #define __fall_through() > #endif Yes, of course, that is what we do for other attributes -- actually in -next we have pending a better way for checking, using __has_attribute: #if __has_attribute(fallthrough) #define __fallthrough __attribute__((fallthrough)) #else #define __fallthrough #endif > > So long as GCC and static analysis tools understand about the attribute > that's enought to catch the bugs. No one cares what clang and icc do. Not so sure about that -- there are quite some people looking forward to building Linux with clang, even if only to have another compiler to check for warnings and to use the clang/llvm-related tools (and to write new ones). > We would just disable the fall through warnings on those until they are > updated to support the fallthrough attribute. You mean if they start supporting the warning but not the attribute? I don't think that would be likely (i.e. if clang enables the warning on C mode, they will have to introduce a way to suppress it; which should be the attribute (at least), since they typically like to be compatible with gcc and since they already support it in C++ >= 11), but everything is possible. > > We wouldn't update all the fall through comments until later, but going > forward people could use the __fall_through() macro if they want. Agreed. I will introduce it in the compiler-attributes tree -- should be there for -rc1 if no one complains. CC'ing all of you in the patch. Cheers, Miguel
Re: [RFC 00/20] ns: Introduce Time Namespace
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote: > On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > > Thomas Gleixner writes: > > > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > > >> Reading the code the calling sequence there is: > > >> tick_sched_do_timer > > >>tick_do_update_jiffies64 > > >> update_wall_time > > >> timekeeping_advance > > >> timekeepging_update > > >> > > >> If I read that properly under the right nohz circumstances that update > > >> can be delayed indefinitely. > > >> > > >> So I think we could prototype a time namespace that was per > > >> timekeeping_update and just had update_wall_time iterate through > > >> all of the time namespaces. > > > > > > Please don't go there. timekeeping_update() is already heavy and walking > > > through a gazillion of namespaces will just make it horrible, > > > > > >> I don't think the naive version would scale to very many time > > >> namespaces. > > > > > > :) > > > > > >> At the same time using the techniques from the nohz work and a little > > >> smarts I expect we could get the code to scale. > > > > > > You'd need to invoke the update when the namespace is switched in and > > > hasn't been updated since the last tick happened. That might be doable, > > > but > > > you also need to take the wraparound constraints of the underlying > > > clocksources into account, which again can cause walking all name spaces > > > when they are all idle long enough. > > > > The wrap around constraints being how long before the time sources wrap > > around so you have to read them once per wrap around? I have not dug > > deeply enough into the code to see that yet. > > > > > From there it becomes hairy, because it's not only timekeeping, > > > i.e. reading time, this is also affecting all timers which are armed from > > > a > > > namespace. > > > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > > a particular namespace, then you have to search for all armed timers of > > > that namespace and adjust them. > > > > > > The original posix timer code had the same issue because it mapped the > > > clock realtime timers to the timer wheel so any setting of the clock > > > caused > > > a full walk of all armed timers, disarming, adjusting and requeing > > > them. That's horrible not only performance wise, it's also a locking > > > nightmare of all sorts. > > > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > Then it sounds like this will take some more digging. > > > > Please pardon me for thinking out load. > > > > There are one or more time sources that we use to compute the time > > and for each time source we have a conversion from ticks of the > > time source to nanoseconds. > > > > Each time source needs to be sampled at least once per wrap-around > > and something incremented so that we don't loose time when looking > > at that time source. > > > > There are several clocks presented to userspace and they all share the > > same length of second and are all fundamentally offsets from > > CLOCK_MONOTONIC. > > > > I see two fundamental driving cases for a time namespace. > > 1) Migration from one node to another node in a cluster in almost > >real time. > > > >The problem is that CLOCK_MONOTONIC between nodes in the cluster > >has not relation ship to each other (except a synchronized length of > >the second). So applications that migrate can see CLOCK_MONOTONIC > >and CLOCK_BOOTTIME go backwards. > > > >This is the truly pressing problem and adding some kind of offset > >sounds like it would be the solution. Possibly by allowing a boot > >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > > > 2) Dealing with two separate time management domains. Say a machine > >that needes to deal with both something inside of google where they > >slew time to avoid leap time seconds and something in the outside > >world proper UTC time is kept as an offset from TAI with the > >occasional leap seconds. > > > >In the later case it would fundamentally require having seconds of > >different length. > > > > I want to add that the second case should be optional. > > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea
Re: [RFC 00/20] ns: Introduce Time Namespace
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote: > On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > > Thomas Gleixner writes: > > > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > > >> Reading the code the calling sequence there is: > > >> tick_sched_do_timer > > >>tick_do_update_jiffies64 > > >> update_wall_time > > >> timekeeping_advance > > >> timekeepging_update > > >> > > >> If I read that properly under the right nohz circumstances that update > > >> can be delayed indefinitely. > > >> > > >> So I think we could prototype a time namespace that was per > > >> timekeeping_update and just had update_wall_time iterate through > > >> all of the time namespaces. > > > > > > Please don't go there. timekeeping_update() is already heavy and walking > > > through a gazillion of namespaces will just make it horrible, > > > > > >> I don't think the naive version would scale to very many time > > >> namespaces. > > > > > > :) > > > > > >> At the same time using the techniques from the nohz work and a little > > >> smarts I expect we could get the code to scale. > > > > > > You'd need to invoke the update when the namespace is switched in and > > > hasn't been updated since the last tick happened. That might be doable, > > > but > > > you also need to take the wraparound constraints of the underlying > > > clocksources into account, which again can cause walking all name spaces > > > when they are all idle long enough. > > > > The wrap around constraints being how long before the time sources wrap > > around so you have to read them once per wrap around? I have not dug > > deeply enough into the code to see that yet. > > > > > From there it becomes hairy, because it's not only timekeeping, > > > i.e. reading time, this is also affecting all timers which are armed from > > > a > > > namespace. > > > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > > a particular namespace, then you have to search for all armed timers of > > > that namespace and adjust them. > > > > > > The original posix timer code had the same issue because it mapped the > > > clock realtime timers to the timer wheel so any setting of the clock > > > caused > > > a full walk of all armed timers, disarming, adjusting and requeing > > > them. That's horrible not only performance wise, it's also a locking > > > nightmare of all sorts. > > > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > Then it sounds like this will take some more digging. > > > > Please pardon me for thinking out load. > > > > There are one or more time sources that we use to compute the time > > and for each time source we have a conversion from ticks of the > > time source to nanoseconds. > > > > Each time source needs to be sampled at least once per wrap-around > > and something incremented so that we don't loose time when looking > > at that time source. > > > > There are several clocks presented to userspace and they all share the > > same length of second and are all fundamentally offsets from > > CLOCK_MONOTONIC. > > > > I see two fundamental driving cases for a time namespace. > > 1) Migration from one node to another node in a cluster in almost > >real time. > > > >The problem is that CLOCK_MONOTONIC between nodes in the cluster > >has not relation ship to each other (except a synchronized length of > >the second). So applications that migrate can see CLOCK_MONOTONIC > >and CLOCK_BOOTTIME go backwards. > > > >This is the truly pressing problem and adding some kind of offset > >sounds like it would be the solution. Possibly by allowing a boot > >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > > > 2) Dealing with two separate time management domains. Say a machine > >that needes to deal with both something inside of google where they > >slew time to avoid leap time seconds and something in the outside > >world proper UTC time is kept as an offset from TAI with the > >occasional leap seconds. > > > >In the later case it would fundamentally require having seconds of > >different length. > > > > I want to add that the second case should be optional. > > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea
Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug
On 10/20/2018 9:57 PM, Joel Fernandes wrote: On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote: On 10/20/2018 10:55 AM, Joel Fernandes wrote: On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote: Hi, This patch series adds Event tracing support to pstore and is continuation to the RFC patch introduced to add a new tracing facility for register accesses called Register Trace Buffer(RTB). Since we decided to not introduce a separate framework to trace register accesses and use existing framework like tracepoints, I have moved from RFC. Details of the RFC in link below: Link: https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ MSR tracing example given by Steven was helpful in using tracepoints for register accesses instead of using separate trace. But just having these IO traces would not help much unless we could have them in some persistent ram buffer for debugging unclocked access or some kind of bus hang or an unexpected reset caused by some buggy driver which happens a lot during initial development stages. By analyzing the last few entries of this buffer, we could identify the register access which is causing the issue. Hi Sai, I wanted to see if I could make some time to get your patches working. We are hitting usecases that need something like this as well. Basically devices hanging and then the ramdump does not tell us much, so in this case pstore events can be really helpful. This usecase came up last year as well. Anyway while I was going through your patches, I cleaned up some pstore code as well and I have 3 more patches on top of yours for this clean up. I prefer we submit the patches together and sync our work together so that there is least conflict. Here's my latest tree: https://github.com/joelagnel/linux-kernel/commits/pstore-events (note that I have only build tested the patches since I just wrote them and its quite late in the night here ;-)) Hi Joel, Thanks for looking into this. Sure, I will be happy to sync up with you on Thanks. And added a fourth patch in the tree too. this. I can test your additional patches on top of my pstore patches. BTW, I'm still stuck at copying binary record into pstore and then extract it during read time. Seems like I'm missing something. Sure, push your latest somewhere and let me know. I'll try to get you unstuck. Thanks Joel, I will push my changes and let you know in some time. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH 0/6] Tracing register accesses with pstore and dynamic debug
On 10/20/2018 9:57 PM, Joel Fernandes wrote: On Sat, Oct 20, 2018 at 12:02:37PM +0530, Sai Prakash Ranjan wrote: On 10/20/2018 10:55 AM, Joel Fernandes wrote: On Sun, Sep 09, 2018 at 01:57:01AM +0530, Sai Prakash Ranjan wrote: Hi, This patch series adds Event tracing support to pstore and is continuation to the RFC patch introduced to add a new tracing facility for register accesses called Register Trace Buffer(RTB). Since we decided to not introduce a separate framework to trace register accesses and use existing framework like tracepoints, I have moved from RFC. Details of the RFC in link below: Link: https://lore.kernel.org/lkml/cover.1535119710.git.saiprakash.ran...@codeaurora.org/ MSR tracing example given by Steven was helpful in using tracepoints for register accesses instead of using separate trace. But just having these IO traces would not help much unless we could have them in some persistent ram buffer for debugging unclocked access or some kind of bus hang or an unexpected reset caused by some buggy driver which happens a lot during initial development stages. By analyzing the last few entries of this buffer, we could identify the register access which is causing the issue. Hi Sai, I wanted to see if I could make some time to get your patches working. We are hitting usecases that need something like this as well. Basically devices hanging and then the ramdump does not tell us much, so in this case pstore events can be really helpful. This usecase came up last year as well. Anyway while I was going through your patches, I cleaned up some pstore code as well and I have 3 more patches on top of yours for this clean up. I prefer we submit the patches together and sync our work together so that there is least conflict. Here's my latest tree: https://github.com/joelagnel/linux-kernel/commits/pstore-events (note that I have only build tested the patches since I just wrote them and its quite late in the night here ;-)) Hi Joel, Thanks for looking into this. Sure, I will be happy to sync up with you on Thanks. And added a fourth patch in the tree too. this. I can test your additional patches on top of my pstore patches. BTW, I'm still stuck at copying binary record into pstore and then extract it during read time. Seems like I'm missing something. Sure, push your latest somewhere and let me know. I'll try to get you unstuck. Thanks Joel, I will push my changes and let you know in some time. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [RFC 00/20] ns: Introduce Time Namespace
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > Thomas Gleixner writes: > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > >> Reading the code the calling sequence there is: > >> tick_sched_do_timer > >>tick_do_update_jiffies64 > >> update_wall_time > >> timekeeping_advance > >> timekeepging_update > >> > >> If I read that properly under the right nohz circumstances that update > >> can be delayed indefinitely. > >> > >> So I think we could prototype a time namespace that was per > >> timekeeping_update and just had update_wall_time iterate through > >> all of the time namespaces. > > > > Please don't go there. timekeeping_update() is already heavy and walking > > through a gazillion of namespaces will just make it horrible, > > > >> I don't think the naive version would scale to very many time > >> namespaces. > > > > :) > > > >> At the same time using the techniques from the nohz work and a little > >> smarts I expect we could get the code to scale. > > > > You'd need to invoke the update when the namespace is switched in and > > hasn't been updated since the last tick happened. That might be doable, but > > you also need to take the wraparound constraints of the underlying > > clocksources into account, which again can cause walking all name spaces > > when they are all idle long enough. > > The wrap around constraints being how long before the time sources wrap > around so you have to read them once per wrap around? I have not dug > deeply enough into the code to see that yet. > > > From there it becomes hairy, because it's not only timekeeping, > > i.e. reading time, this is also affecting all timers which are armed from a > > namespace. > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > a particular namespace, then you have to search for all armed timers of > > that namespace and adjust them. > > > > The original posix timer code had the same issue because it mapped the > > clock realtime timers to the timer wheel so any setting of the clock caused > > a full walk of all armed timers, disarming, adjusting and requeing > > them. That's horrible not only performance wise, it's also a locking > > nightmare of all sorts. > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > timers as well, because you need to guarantee that they are not expiring > > early. > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > can work at all without introducing subtle issues all over the place. > > Then it sounds like this will take some more digging. > > Please pardon me for thinking out load. > > There are one or more time sources that we use to compute the time > and for each time source we have a conversion from ticks of the > time source to nanoseconds. > > Each time source needs to be sampled at least once per wrap-around > and something incremented so that we don't loose time when looking > at that time source. > > There are several clocks presented to userspace and they all share the > same length of second and are all fundamentally offsets from > CLOCK_MONOTONIC. > > I see two fundamental driving cases for a time namespace. > 1) Migration from one node to another node in a cluster in almost >real time. > >The problem is that CLOCK_MONOTONIC between nodes in the cluster >has not relation ship to each other (except a synchronized length of >the second). So applications that migrate can see CLOCK_MONOTONIC >and CLOCK_BOOTTIME go backwards. > >This is the truly pressing problem and adding some kind of offset >sounds like it would be the solution. Possibly by allowing a boot >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > 2) Dealing with two separate time management domains. Say a machine >that needes to deal with both something inside of google where they >slew time to avoid leap time seconds and something in the outside >world proper UTC time is kept as an offset from TAI with the >occasional leap seconds. > >In the later case it would fundamentally require having seconds of >different length. > I want to add that the second case should be optional. When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock. Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently. Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this
Re: [RFC 00/20] ns: Introduce Time Namespace
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > Thomas Gleixner writes: > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > >> Reading the code the calling sequence there is: > >> tick_sched_do_timer > >>tick_do_update_jiffies64 > >> update_wall_time > >> timekeeping_advance > >> timekeepging_update > >> > >> If I read that properly under the right nohz circumstances that update > >> can be delayed indefinitely. > >> > >> So I think we could prototype a time namespace that was per > >> timekeeping_update and just had update_wall_time iterate through > >> all of the time namespaces. > > > > Please don't go there. timekeeping_update() is already heavy and walking > > through a gazillion of namespaces will just make it horrible, > > > >> I don't think the naive version would scale to very many time > >> namespaces. > > > > :) > > > >> At the same time using the techniques from the nohz work and a little > >> smarts I expect we could get the code to scale. > > > > You'd need to invoke the update when the namespace is switched in and > > hasn't been updated since the last tick happened. That might be doable, but > > you also need to take the wraparound constraints of the underlying > > clocksources into account, which again can cause walking all name spaces > > when they are all idle long enough. > > The wrap around constraints being how long before the time sources wrap > around so you have to read them once per wrap around? I have not dug > deeply enough into the code to see that yet. > > > From there it becomes hairy, because it's not only timekeeping, > > i.e. reading time, this is also affecting all timers which are armed from a > > namespace. > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > a particular namespace, then you have to search for all armed timers of > > that namespace and adjust them. > > > > The original posix timer code had the same issue because it mapped the > > clock realtime timers to the timer wheel so any setting of the clock caused > > a full walk of all armed timers, disarming, adjusting and requeing > > them. That's horrible not only performance wise, it's also a locking > > nightmare of all sorts. > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > timers as well, because you need to guarantee that they are not expiring > > early. > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > can work at all without introducing subtle issues all over the place. > > Then it sounds like this will take some more digging. > > Please pardon me for thinking out load. > > There are one or more time sources that we use to compute the time > and for each time source we have a conversion from ticks of the > time source to nanoseconds. > > Each time source needs to be sampled at least once per wrap-around > and something incremented so that we don't loose time when looking > at that time source. > > There are several clocks presented to userspace and they all share the > same length of second and are all fundamentally offsets from > CLOCK_MONOTONIC. > > I see two fundamental driving cases for a time namespace. > 1) Migration from one node to another node in a cluster in almost >real time. > >The problem is that CLOCK_MONOTONIC between nodes in the cluster >has not relation ship to each other (except a synchronized length of >the second). So applications that migrate can see CLOCK_MONOTONIC >and CLOCK_BOOTTIME go backwards. > >This is the truly pressing problem and adding some kind of offset >sounds like it would be the solution. Possibly by allowing a boot >time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > 2) Dealing with two separate time management domains. Say a machine >that needes to deal with both something inside of google where they >slew time to avoid leap time seconds and something in the outside >world proper UTC time is kept as an offset from TAI with the >occasional leap seconds. > >In the later case it would fundamentally require having seconds of >different length. > I want to add that the second case should be optional. When a container is migrated to another host, we have to restore its monotonic and boottime clocks, but we still expect that the container will continue using the host real-time clock. Before stating this series, I was thinking about this, I decided that these cases can be solved independently. Probably, the full isolation of the time sub-system will have much higher overhead than just offsets for a few clocks. And the idea that isolation of the real-time clock should be optional gives us another hint that offsets for monotonic and boot-time clocks can be implemented independently. Eric and Tomas, what do you think about this? If you agree that these two cases can be implemented separately, what should we do with this
[ANNOUNCE] Call for Papers - linux.conf.au Kernel Miniconf, Christchurch NZ, 21-25 Jan 2019
The linux.conf.au Kernel Miniconf is happening once again, this time in Christchurch on 22 Jan 2019. *** Submissions close on 2018-12-16, 23:59 AoE, with early submissions (before 2018-11-16, 23:59 AoE) given priority. *** *** Submission details: http://lca-kernel.ozlabs.org/2019-cfp.html *** The Kernel Miniconf is a 1 day stream alongside the main LCA conference to talk about kernel stuff. We invite submissions on anything related to kernel and low-level systems programming. We welcome submissions from developers of all levels of experience in the kernel community, covering a broad range of topics. Past Kernel Miniconfs have included technical talks on topics such as memory management, RCU, scheduling and filesystems, as well as talks on Linux kernel community topics such as licensing and Linux kernel development process. We strongly encourage both first-time and seasoned speakers from all backgrounds, ages, genders, nationalities, ethnicities, religions and abilities. Like the main LCA conference itself, we respect and encourage diversity at our miniconf. Speakers will need to purchase an LCA ticket to attend. See http://lca-kernel.ozlabs.org/2019-cfp.html for full details and the submission form. -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
[ANNOUNCE] Call for Papers - linux.conf.au Kernel Miniconf, Christchurch NZ, 21-25 Jan 2019
The linux.conf.au Kernel Miniconf is happening once again, this time in Christchurch on 22 Jan 2019. *** Submissions close on 2018-12-16, 23:59 AoE, with early submissions (before 2018-11-16, 23:59 AoE) given priority. *** *** Submission details: http://lca-kernel.ozlabs.org/2019-cfp.html *** The Kernel Miniconf is a 1 day stream alongside the main LCA conference to talk about kernel stuff. We invite submissions on anything related to kernel and low-level systems programming. We welcome submissions from developers of all levels of experience in the kernel community, covering a broad range of topics. Past Kernel Miniconfs have included technical talks on topics such as memory management, RCU, scheduling and filesystems, as well as talks on Linux kernel community topics such as licensing and Linux kernel development process. We strongly encourage both first-time and seasoned speakers from all backgrounds, ages, genders, nationalities, ethnicities, religions and abilities. Like the main LCA conference itself, we respect and encourage diversity at our miniconf. Speakers will need to purchase an LCA ticket to attend. See http://lca-kernel.ozlabs.org/2019-cfp.html for full details and the submission form. -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]
Alan Jenkins wrote: > diff --git a/fs/namespace.c b/fs/namespace.c > index 4dfe7e23b7ee..e8d61d5f581d 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt) > { > namespace_lock(); > lock_mount_hash(); > - if (!real_mount(mnt)->mnt_ns) { > + if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) { > mntget(mnt); > umount_tree(real_mount(mnt), UMOUNT_CONNECTED); > } > @@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct > path *new_path) > if (old->mnt_ns && !attached) > goto out1; > > - if (old->mnt.mnt_flags & MNT_LOCKED) > + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT)) > goto out1; > > if (old_path->dentry != old_path->mnt->mnt_root) I've already got one of these; I'll fold in the other also. David
Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]
Alan Jenkins wrote: > diff --git a/fs/namespace.c b/fs/namespace.c > index 4dfe7e23b7ee..e8d61d5f581d 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt) > { > namespace_lock(); > lock_mount_hash(); > - if (!real_mount(mnt)->mnt_ns) { > + if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) { > mntget(mnt); > umount_tree(real_mount(mnt), UMOUNT_CONNECTED); > } > @@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct > path *new_path) > if (old->mnt_ns && !attached) > goto out1; > > - if (old->mnt.mnt_flags & MNT_LOCKED) > + if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT)) > goto out1; > > if (old_path->dentry != old_path->mnt->mnt_root) I've already got one of these; I'll fold in the other also. David
RE: [PATCH] iw_cxgb4: fix a missing-check bug
> -Original Message- > From: Wenwen Wang > Sent: Saturday, October 20, 2018 6:56 PM > To: sw...@opengridcomputing.com > Cc: Kangjie Lu ; sw...@chelsio.com; dledf...@redhat.com; > j...@ziepe.ca; linux-r...@vger.kernel.org; open list ker...@vger.kernel.org>; Wenwen Wang > Subject: Re: [PATCH] iw_cxgb4: fix a missing-check bug > > On Sat, Oct 20, 2018 at 6:41 PM Steve Wise > wrote: > > > > Hey Wenwen, > > > > > Subject: [PATCH] iw_cxgb4: fix a missing-check bug > > > > > > In c4iw_flush_hw_cq, the next CQE is acquired through > t4_next_hw_cqe(). In > > > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see > > > whether it is valid through t4_valid_cqe(). If it is valid, the address of > > > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the > > local > > > memory in create_read_req_cqe(). The problem here is that the CQE is > > > actually in a DMA region allocated by dma_alloc_coherent() in > create_cq(). > > > Given that the device also has the permission to access the DMA region, a > > > malicious device controlled by an attacker can modify the CQE in the DMA > > > region after the check in t4_next_hw_cqe() but before the copy in > > > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, > > > which can cause undefined behavior of the kernel and introduce > potential > > > security risks. > > > > > > > If the dma device is malicious, couldn't it just dma some incorrect CQE but > > still valid in the first place? I don't think this patch actually solves > > the issue, and it forces a copy of a 64B CQE in a critical data io path. > > Thanks for your response! If the malicious dma device just dma some > incorrect CQE, it will not be able to pass the verification in > t4_valid_cqe(). > As long as the gen bit is correct, the CQE is considered valid. You cannot protect against a malicious dma device. Or at least not with the current driver/device contract. Steve.
RE: [PATCH] iw_cxgb4: fix a missing-check bug
> -Original Message- > From: Wenwen Wang > Sent: Saturday, October 20, 2018 6:56 PM > To: sw...@opengridcomputing.com > Cc: Kangjie Lu ; sw...@chelsio.com; dledf...@redhat.com; > j...@ziepe.ca; linux-r...@vger.kernel.org; open list ker...@vger.kernel.org>; Wenwen Wang > Subject: Re: [PATCH] iw_cxgb4: fix a missing-check bug > > On Sat, Oct 20, 2018 at 6:41 PM Steve Wise > wrote: > > > > Hey Wenwen, > > > > > Subject: [PATCH] iw_cxgb4: fix a missing-check bug > > > > > > In c4iw_flush_hw_cq, the next CQE is acquired through > t4_next_hw_cqe(). In > > > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see > > > whether it is valid through t4_valid_cqe(). If it is valid, the address of > > > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the > > local > > > memory in create_read_req_cqe(). The problem here is that the CQE is > > > actually in a DMA region allocated by dma_alloc_coherent() in > create_cq(). > > > Given that the device also has the permission to access the DMA region, a > > > malicious device controlled by an attacker can modify the CQE in the DMA > > > region after the check in t4_next_hw_cqe() but before the copy in > > > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, > > > which can cause undefined behavior of the kernel and introduce > potential > > > security risks. > > > > > > > If the dma device is malicious, couldn't it just dma some incorrect CQE but > > still valid in the first place? I don't think this patch actually solves > > the issue, and it forces a copy of a 64B CQE in a critical data io path. > > Thanks for your response! If the malicious dma device just dma some > incorrect CQE, it will not be able to pass the verification in > t4_valid_cqe(). > As long as the gen bit is correct, the CQE is considered valid. You cannot protect against a malicious dma device. Or at least not with the current driver/device contract. Steve.
Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address
> > Data protection law, reporting laws in some > > countries and the like mean that anyone expecting an incident to remain > > confidential from the person it was reported against is living in > > dreamland and are going to get a nasty shock. > > OK - you seem to be talking about keeping the incident and reporter > confidential from the person reported against. > Certainly the person reported against has to have the incident > identified to them, so that part is not confidential. Many legal > jurisdictions require that the accused can know their accuser. > But these things come into play mostly when items have veered > into legal territory. Most violation reports are not in the territory. > There's no legal requirement that the Code of Conduct committee > tell someone who it was that said they were rude on the mailing list. The 'who said' is generally safe (except from the it's in court case - which is fine when that happens the legal process deals with it). The other details are not so while someone accused of something might not know who said it they can ask for what personal data (which would include that email with names etc scrubbed). You can possibly fight that in court of course, if you've got lots of money and nothing better to do for six weeks. > > You should also reserving the right to report serious incidents directly > > to law enforcement. Unless of course you want to be forced to sit on > > multiple reports of physical abuse from different people about > > someone - unable to tell them about each others report, unable to prove > > anything, and in twenty years time having to explain to the media why > > nothing was done. > > The scope of the code of conduct basically means that it covers > online interactions (communication via mailing list, git commits > and Bugzilla). I don't see it specifically stating that 'If someone is offensive at a kernel summit we are going to refuse to listen' Seriously ? Alan
Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address
> > Data protection law, reporting laws in some > > countries and the like mean that anyone expecting an incident to remain > > confidential from the person it was reported against is living in > > dreamland and are going to get a nasty shock. > > OK - you seem to be talking about keeping the incident and reporter > confidential from the person reported against. > Certainly the person reported against has to have the incident > identified to them, so that part is not confidential. Many legal > jurisdictions require that the accused can know their accuser. > But these things come into play mostly when items have veered > into legal territory. Most violation reports are not in the territory. > There's no legal requirement that the Code of Conduct committee > tell someone who it was that said they were rude on the mailing list. The 'who said' is generally safe (except from the it's in court case - which is fine when that happens the legal process deals with it). The other details are not so while someone accused of something might not know who said it they can ask for what personal data (which would include that email with names etc scrubbed). You can possibly fight that in court of course, if you've got lots of money and nothing better to do for six weeks. > > You should also reserving the right to report serious incidents directly > > to law enforcement. Unless of course you want to be forced to sit on > > multiple reports of physical abuse from different people about > > someone - unable to tell them about each others report, unable to prove > > anything, and in twenty years time having to explain to the media why > > nothing was done. > > The scope of the code of conduct basically means that it covers > online interactions (communication via mailing list, git commits > and Bugzilla). I don't see it specifically stating that 'If someone is offensive at a kernel summit we are going to refuse to listen' Seriously ? Alan
Re: [PATCH] iw_cxgb4: fix a missing-check bug
On Sat, Oct 20, 2018 at 6:41 PM Steve Wise wrote: > > Hey Wenwen, > > > Subject: [PATCH] iw_cxgb4: fix a missing-check bug > > > > In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In > > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see > > whether it is valid through t4_valid_cqe(). If it is valid, the address of > > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the > local > > memory in create_read_req_cqe(). The problem here is that the CQE is > > actually in a DMA region allocated by dma_alloc_coherent() in create_cq(). > > Given that the device also has the permission to access the DMA region, a > > malicious device controlled by an attacker can modify the CQE in the DMA > > region after the check in t4_next_hw_cqe() but before the copy in > > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, > > which can cause undefined behavior of the kernel and introduce potential > > security risks. > > > > If the dma device is malicious, couldn't it just dma some incorrect CQE but > still valid in the first place? I don't think this patch actually solves > the issue, and it forces a copy of a 64B CQE in a critical data io path. Thanks for your response! If the malicious dma device just dma some incorrect CQE, it will not be able to pass the verification in t4_valid_cqe(). Wenwen
Re: [PATCH] iw_cxgb4: fix a missing-check bug
On Sat, Oct 20, 2018 at 6:41 PM Steve Wise wrote: > > Hey Wenwen, > > > Subject: [PATCH] iw_cxgb4: fix a missing-check bug > > > > In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In > > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see > > whether it is valid through t4_valid_cqe(). If it is valid, the address of > > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the > local > > memory in create_read_req_cqe(). The problem here is that the CQE is > > actually in a DMA region allocated by dma_alloc_coherent() in create_cq(). > > Given that the device also has the permission to access the DMA region, a > > malicious device controlled by an attacker can modify the CQE in the DMA > > region after the check in t4_next_hw_cqe() but before the copy in > > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, > > which can cause undefined behavior of the kernel and introduce potential > > security risks. > > > > If the dma device is malicious, couldn't it just dma some incorrect CQE but > still valid in the first place? I don't think this patch actually solves > the issue, and it forces a copy of a 64B CQE in a critical data io path. Thanks for your response! If the malicious dma device just dma some incorrect CQE, it will not be able to pass the verification in t4_valid_cqe(). Wenwen
Editors 4
We are one image studio who is able to process 300+ photos a day. If you need any image editing, please let us know. We can do it for you such as: Image cut out for photos and clipping path, masking for your photos, They are mostly used for ecommerce photos, jewelry photos retouching, beauty and skin images and wedding photos. We do also different kind of beauty retouching, portraits retouching. We can send editing for your photos if you send us one or two photos. Thanks, Linda
Editors 4
We are one image studio who is able to process 300+ photos a day. If you need any image editing, please let us know. We can do it for you such as: Image cut out for photos and clipping path, masking for your photos, They are mostly used for ecommerce photos, jewelry photos retouching, beauty and skin images and wedding photos. We do also different kind of beauty retouching, portraits retouching. We can send editing for your photos if you send us one or two photos. Thanks, Linda
Editors 4
We are one image studio who is able to process 300+ photos a day. If you need any image editing, please let us know. We can do it for you such as: Image cut out for photos and clipping path, masking for your photos, They are mostly used for ecommerce photos, jewelry photos retouching, beauty and skin images and wedding photos. We do also different kind of beauty retouching, portraits retouching. We can send editing for your photos if you send us one or two photos. Thanks, Linda
Editors 4
We are one image studio who is able to process 300+ photos a day. If you need any image editing, please let us know. We can do it for you such as: Image cut out for photos and clipping path, masking for your photos, They are mostly used for ecommerce photos, jewelry photos retouching, beauty and skin images and wedding photos. We do also different kind of beauty retouching, portraits retouching. We can send editing for your photos if you send us one or two photos. Thanks, Linda
RE: [PATCH] iw_cxgb4: fix a missing-check bug
Hey Wenwen, > Subject: [PATCH] iw_cxgb4: fix a missing-check bug > > In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see > whether it is valid through t4_valid_cqe(). If it is valid, the address of > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the local > memory in create_read_req_cqe(). The problem here is that the CQE is > actually in a DMA region allocated by dma_alloc_coherent() in create_cq(). > Given that the device also has the permission to access the DMA region, a > malicious device controlled by an attacker can modify the CQE in the DMA > region after the check in t4_next_hw_cqe() but before the copy in > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, > which can cause undefined behavior of the kernel and introduce potential > security risks. > If the dma device is malicious, couldn't it just dma some incorrect CQE but still valid in the first place? I don't think this patch actually solves the issue, and it forces a copy of a 64B CQE in a critical data io path. So I must NACK this. Thanks, Steve.
RE: [PATCH] iw_cxgb4: fix a missing-check bug
Hey Wenwen, > Subject: [PATCH] iw_cxgb4: fix a missing-check bug > > In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In > t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see > whether it is valid through t4_valid_cqe(). If it is valid, the address of > the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the local > memory in create_read_req_cqe(). The problem here is that the CQE is > actually in a DMA region allocated by dma_alloc_coherent() in create_cq(). > Given that the device also has the permission to access the DMA region, a > malicious device controlled by an attacker can modify the CQE in the DMA > region after the check in t4_next_hw_cqe() but before the copy in > create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, > which can cause undefined behavior of the kernel and introduce potential > security risks. > If the dma device is malicious, couldn't it just dma some incorrect CQE but still valid in the first place? I don't think this patch actually solves the issue, and it forces a copy of a 64B CQE in a critical data io path. So I must NACK this. Thanks, Steve.
Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
On Sat, Oct 20, 2018 at 03:26:37PM -0700, Matthew Wilcox wrote: > On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote: > > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h > > header and replace with simple assignment. For each case, S_IFx >> 12 > > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give > > us the correct file type. It is expected that for *nix compatibility > > reasons, the relation between S_IFx and DT_x will not change. For > > cases where the mode is invalid, upper layer validation catches this > > anyway, so this improves readability and arguably performance by > > assigning (mode & S_IFMT) >> 12 directly without any jump table > > or conditional logic. > > Shouldn't we also do this for other filesystems? A quick scan suggests > someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2, > nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs. Do what for other filesystems? E.g. ext* has the type cached in directory entry - as a straight enum, so encoding is needed. Which we do. ocfs2 is pretty certain to have it in ext*-compatible layout. ubifs stores them in another enum of its own (also handled). XFS - another enum (present on some layout variants), etc. And... CIFS? Seriously? This stuff is about encoding used to cache the file type in directory entry; how the bleeding hell would CIFS _client_ get anywhere near that?
Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
On Sat, Oct 20, 2018 at 03:26:37PM -0700, Matthew Wilcox wrote: > On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote: > > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h > > header and replace with simple assignment. For each case, S_IFx >> 12 > > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give > > us the correct file type. It is expected that for *nix compatibility > > reasons, the relation between S_IFx and DT_x will not change. For > > cases where the mode is invalid, upper layer validation catches this > > anyway, so this improves readability and arguably performance by > > assigning (mode & S_IFMT) >> 12 directly without any jump table > > or conditional logic. > > Shouldn't we also do this for other filesystems? A quick scan suggests > someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2, > nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs. Do what for other filesystems? E.g. ext* has the type cached in directory entry - as a straight enum, so encoding is needed. Which we do. ocfs2 is pretty certain to have it in ext*-compatible layout. ubifs stores them in another enum of its own (also handled). XFS - another enum (present on some layout variants), etc. And... CIFS? Seriously? This stuff is about encoding used to cache the file type in directory entry; how the bleeding hell would CIFS _client_ get anywhere near that?
[PATCH v2] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
The spec only requires the posted interrupt descriptor address to be 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also forces the address to be page aligned. Only validate that the address does not cross the maximum physical address without enforcing a page alignment. v1 -> v2: - Add a missing parenthesis (dropped while merging!) Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2") Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 38f1a16..bb0fcdb 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11667,7 +11667,7 @@ static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, !nested_exit_intr_ack_set(vcpu) || (vmcs12->posted_intr_nv & 0xff00) || (vmcs12->posted_intr_desc_addr & 0x3f) || - (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr + (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu return -EINVAL; /* tpr shadow is needed by all apicv features. */ -- 2.7.4
[PATCH v2] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
The spec only requires the posted interrupt descriptor address to be 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also forces the address to be page aligned. Only validate that the address does not cross the maximum physical address without enforcing a page alignment. v1 -> v2: - Add a missing parenthesis (dropped while merging!) Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2") Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 38f1a16..bb0fcdb 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11667,7 +11667,7 @@ static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, !nested_exit_intr_ack_set(vcpu) || (vmcs12->posted_intr_nv & 0xff00) || (vmcs12->posted_intr_desc_addr & 0x3f) || - (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr + (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu return -EINVAL; /* tpr shadow is needed by all apicv features. */ -- 2.7.4
RE: [PATCH] tpm: tpm_try_transmit() refactor error flow.
> > On Thu, 18 Oct 2018, Winkler, Tomas wrote: > >> -Original Message- > >> From: Jarkko Sakkinen [mailto:jarkko.sakki...@linux.intel.com] > >> Sent: Thursday, October 18, 2018 03:15 > >> To: Winkler, Tomas > >> Cc: Jarkko Sakkinen ; Jason > >> Gunthorpe ; Nayna Jain ; > >> Usyskin, Alexander ; Struk, Tadeusz > >> ; linux-integr...@vger.kernel.org; > >> linux-security- mod...@vger.kernel.org; linux-kernel@vger.kernel.org; > >> sta...@vger.kernel.org > >> Subject: Re: [PATCH] tpm: tpm_try_transmit() refactor error flow. > >> > >> On Tue, 16 Oct 2018, Tomas Winkler wrote: > >>> First, rename out_no_locality to out_locality for bailing out on > >>> both > >>> tpm_cmd_ready() and tpm_request_locality() failure. > >> > >> This is unnecessary change and technically it is not a rename: the > >> commit message text and the code change do not match. Rename is just > >> a rename (i.e. change a variable name foo to bar). > > > > I'm renaming the label because it doesn't match the code flow anymore, > > I can change the commit message, but you please review the code. > > Tomas > > The flow change is unnecessary and does not really have anything to do with > the bug fix. What I see in the original code is that when tpm_cmd_ready() fails it's jumps to 'out' label and trying to do tpm_go_idle() but instead it should just undoing the locality, so both cmd_read and go idle had a wrong jump. I see both should be fixed. Earlier version was better than this and would have been fine > when taking account the remark from Jason.
RE: [PATCH] tpm: tpm_try_transmit() refactor error flow.
> > On Thu, 18 Oct 2018, Winkler, Tomas wrote: > >> -Original Message- > >> From: Jarkko Sakkinen [mailto:jarkko.sakki...@linux.intel.com] > >> Sent: Thursday, October 18, 2018 03:15 > >> To: Winkler, Tomas > >> Cc: Jarkko Sakkinen ; Jason > >> Gunthorpe ; Nayna Jain ; > >> Usyskin, Alexander ; Struk, Tadeusz > >> ; linux-integr...@vger.kernel.org; > >> linux-security- mod...@vger.kernel.org; linux-kernel@vger.kernel.org; > >> sta...@vger.kernel.org > >> Subject: Re: [PATCH] tpm: tpm_try_transmit() refactor error flow. > >> > >> On Tue, 16 Oct 2018, Tomas Winkler wrote: > >>> First, rename out_no_locality to out_locality for bailing out on > >>> both > >>> tpm_cmd_ready() and tpm_request_locality() failure. > >> > >> This is unnecessary change and technically it is not a rename: the > >> commit message text and the code change do not match. Rename is just > >> a rename (i.e. change a variable name foo to bar). > > > > I'm renaming the label because it doesn't match the code flow anymore, > > I can change the commit message, but you please review the code. > > Tomas > > The flow change is unnecessary and does not really have anything to do with > the bug fix. What I see in the original code is that when tpm_cmd_ready() fails it's jumps to 'out' label and trying to do tpm_go_idle() but instead it should just undoing the locality, so both cmd_read and go idle had a wrong jump. I see both should be fixed. Earlier version was better than this and would have been fine > when taking account the remark from Jason.
Broken dwarf unwinding - wrong stack pointer register value?
Hey all, I'm on the quest to figure out why perf regularly fails to unwind (some) samples. I am seeing very strange behavior, where an apparently wrong stack pointer value is read from the register - see below for more information and the end of this (long) mail for my open questions. Any help would be greatly appreciated. I am currently using this trivial C++ code to reproduce the issue: ``` #include #include #include #include using namespace std; int main() { uniform_real_distribution uniform(-1E5, 1E5); default_random_engine engine; double s = 0; for (int i = 0; i < 1000; ++i) { s += norm(complex(uniform(engine), uniform(engine))); } cout << s << '\n'; return 0; } ``` I compile it with `g++ -O2 -g` and then record it with `perf record --call- graph dwarf`. Using perf script, I then see e.g.: ``` $ perf script -v --no-inline --time 90229.12668,90229.127158 --ns ... # first frame (working unwinding from __hypot_finite): unwind: reg 16, val 7faf7dca2696 unwind: reg 7, val 7ffc80811ca0 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104 unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/ test-clients/cpp-inlining/cpp-inlining unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184 unwind: find_proc_info dso /usr/lib/libc-2.28.so unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376 unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/ test-clients/cpp-inlining/cpp-inlining unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696) unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8) unwind: main:ip = 0x56382b0fc128 (0x1128) unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222) unwind: _start:ip = 0x56382b0fc1ed (0x11ed) # second frame (unrelated) unwind: reg 16, val 56382b0fc114 unwind: reg 7, val 7ffc80811d10 unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/ test-clients/cpp-inlining/cpp-inlining unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72 unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264 unwind: main:ip = 0x56382b0fc114 (0x1114) unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222) unwind: _start:ip = 0x56382b0fc1ed (0x11ed) # third frame (broken unwinding from __hypot_finite) unwind: reg 16, val 7faf7dca2688 unwind: reg 7, val 7ffc80811ca0 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32 unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688) unwind: '':ip = 0x (0x0) cpp-inlining 24617 90229.126685606: 711026 cycles:uppp: 7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so) 7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) 56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so) 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) cpp-inlining 24617 90229.126921551: 714657 cycles:uppp: 56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so) 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) cpp-inlining 24617 90229.127157818: 719976 cycles:uppp: 7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so) [unknown] ([unknown]) ... ``` Now I'm trying to figure out why one __hypot_finite sample works but the other one breaks for no apparent reason. One thing I noticed so far is the following discrepancy in memory accesses, i.e. for the working sample we get: ``` unwind: reg 16, val 7faf7dca2696 unwind: reg 7, val 7ffc80811ca0 unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72 ... ``` But the broken sample only produces: ``` unwind: reg 16, val 7faf7dca2688 unwind: reg 7, val 7ffc80811ca0 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32 ``` The user stack of the working sample starts with (in hex): 9c830441254a9800301d8180fc7fc0c10f2b385630 The user stack of the broken sample starts with (in hex): 2f64fd4073bb9700301d8180fc7fc0c10f2b385630 This is a 64bit machine, so word width is 8. So both user stacks start with `24 * 2 = 48` words of zeros. Hence offset 32 cannot possibly produce a meaningful
Broken dwarf unwinding - wrong stack pointer register value?
Hey all, I'm on the quest to figure out why perf regularly fails to unwind (some) samples. I am seeing very strange behavior, where an apparently wrong stack pointer value is read from the register - see below for more information and the end of this (long) mail for my open questions. Any help would be greatly appreciated. I am currently using this trivial C++ code to reproduce the issue: ``` #include #include #include #include using namespace std; int main() { uniform_real_distribution uniform(-1E5, 1E5); default_random_engine engine; double s = 0; for (int i = 0; i < 1000; ++i) { s += norm(complex(uniform(engine), uniform(engine))); } cout << s << '\n'; return 0; } ``` I compile it with `g++ -O2 -g` and then record it with `perf record --call- graph dwarf`. Using perf script, I then see e.g.: ``` $ perf script -v --no-inline --time 90229.12668,90229.127158 --ns ... # first frame (working unwinding from __hypot_finite): unwind: reg 16, val 7faf7dca2696 unwind: reg 7, val 7ffc80811ca0 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104 unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/ test-clients/cpp-inlining/cpp-inlining unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184 unwind: find_proc_info dso /usr/lib/libc-2.28.so unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376 unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/ test-clients/cpp-inlining/cpp-inlining unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696) unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8) unwind: main:ip = 0x56382b0fc128 (0x1128) unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222) unwind: _start:ip = 0x56382b0fc1ed (0x11ed) # second frame (unrelated) unwind: reg 16, val 56382b0fc114 unwind: reg 7, val 7ffc80811d10 unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/ test-clients/cpp-inlining/cpp-inlining unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72 unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264 unwind: main:ip = 0x56382b0fc114 (0x1114) unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222) unwind: _start:ip = 0x56382b0fc1ed (0x11ed) # third frame (broken unwinding from __hypot_finite) unwind: reg 16, val 7faf7dca2688 unwind: reg 7, val 7ffc80811ca0 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32 unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688) unwind: '':ip = 0x (0x0) cpp-inlining 24617 90229.126685606: 711026 cycles:uppp: 7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so) 7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) 56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so) 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) cpp-inlining 24617 90229.126921551: 714657 cycles:uppp: 56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so) 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/ build/tests/test-clients/cpp-inlining/cpp-inlining) cpp-inlining 24617 90229.127157818: 719976 cycles:uppp: 7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so) [unknown] ([unknown]) ... ``` Now I'm trying to figure out why one __hypot_finite sample works but the other one breaks for no apparent reason. One thing I noticed so far is the following discrepancy in memory accesses, i.e. for the working sample we get: ``` unwind: reg 16, val 7faf7dca2696 unwind: reg 7, val 7ffc80811ca0 unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72 ... ``` But the broken sample only produces: ``` unwind: reg 16, val 7faf7dca2688 unwind: reg 7, val 7ffc80811ca0 unwind: find_proc_info dso /usr/lib/libm-2.28.so unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32 ``` The user stack of the working sample starts with (in hex): 9c830441254a9800301d8180fc7fc0c10f2b385630 The user stack of the broken sample starts with (in hex): 2f64fd4073bb9700301d8180fc7fc0c10f2b385630 This is a 64bit machine, so word width is 8. So both user stacks start with `24 * 2 = 48` words of zeros. Hence offset 32 cannot possibly produce a meaningful
Re: [PATCH v3 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses
James, and our other friends, On Tue, Oct 16, 2018 at 2:59 PM James Bottomley wrote: > > The current code of conduct has an ambiguity More than one ambiguity. This whole file needs to go. >* Trolling, Who decides what is trolling, and what is a technique for raising awareness or sparking discussion on an issue? > * Other conduct which could reasonably be considered inappropriate in a >professional setting Why should this last bit remain? Any literate person with access to a dictionary should know how ambiguous the word professional is. As an amateur contributor to the FOSS ecosystem I am more than a bit offended by the decision to use such divisive, politically charged, and financially discriminatory language in a project of such massive technical importance. This entire file should be expunged from the repository and replaced by well defined minimalistic guidelines for maintaining order on the mailing lists, rather than a set of ambiguous codes that force maintainers to take politically motivated actions against contributors for undefined reasons. Using words like professional is a distressing red flag because it doesn't add any clarification on the issue (what was the issue again?), it only raises more questions. I can't think of any reason that word would be needed unless you're trying to push out unpaid contributors. Why should someones employment status be held against them when contributing ideas or code to a technical project that has benefited greatly from amateur contributions? I fear for the kernels future now that irrational politics are beginning to creep.
Re: [PATCH v3 1/3] code-of-conduct: Fix the ambiguity about collecting email addresses
James, and our other friends, On Tue, Oct 16, 2018 at 2:59 PM James Bottomley wrote: > > The current code of conduct has an ambiguity More than one ambiguity. This whole file needs to go. >* Trolling, Who decides what is trolling, and what is a technique for raising awareness or sparking discussion on an issue? > * Other conduct which could reasonably be considered inappropriate in a >professional setting Why should this last bit remain? Any literate person with access to a dictionary should know how ambiguous the word professional is. As an amateur contributor to the FOSS ecosystem I am more than a bit offended by the decision to use such divisive, politically charged, and financially discriminatory language in a project of such massive technical importance. This entire file should be expunged from the repository and replaced by well defined minimalistic guidelines for maintaining order on the mailing lists, rather than a set of ambiguous codes that force maintainers to take politically motivated actions against contributors for undefined reasons. Using words like professional is a distressing red flag because it doesn't add any clarification on the issue (what was the issue again?), it only raises more questions. I can't think of any reason that word would be needed unless you're trying to push out unpaid contributors. Why should someones employment status be held against them when contributing ideas or code to a technical project that has benefited greatly from amateur contributions? I fear for the kernels future now that irrational politics are beginning to creep.
Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote: > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h > header and replace with simple assignment. For each case, S_IFx >> 12 > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give > us the correct file type. It is expected that for *nix compatibility > reasons, the relation between S_IFx and DT_x will not change. For > cases where the mode is invalid, upper layer validation catches this > anyway, so this improves readability and arguably performance by > assigning (mode & S_IFMT) >> 12 directly without any jump table > or conditional logic. Shouldn't we also do this for other filesystems? A quick scan suggests someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2, nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs. > Signed-off-by: Phillip Potter > --- > fs/ufs/util.h | 30 ++ > 1 file changed, 2 insertions(+), 28 deletions(-) > > diff --git a/fs/ufs/util.h b/fs/ufs/util.h > index 1fd3011ea623..7e0c0878b9f9 100644 > --- a/fs/ufs/util.h > +++ b/fs/ufs/util.h > @@ -16,6 +16,7 @@ > * some useful macros > */ > #define in_range(b,first,len)((b)>=(first)&&(b)<(first)+(len)) > +#define S_SHIFT 12 > > /* > * functions used for retyping > @@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct > ufs_dir_entry *de, int mode) > if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD) > return; > > - /* > - * TODO turn this into a table lookup > - */ > - switch (mode & S_IFMT) { > - case S_IFSOCK: > - de->d_u.d_44.d_type = DT_SOCK; > - break; > - case S_IFLNK: > - de->d_u.d_44.d_type = DT_LNK; > - break; > - case S_IFREG: > - de->d_u.d_44.d_type = DT_REG; > - break; > - case S_IFBLK: > - de->d_u.d_44.d_type = DT_BLK; > - break; > - case S_IFDIR: > - de->d_u.d_44.d_type = DT_DIR; > - break; > - case S_IFCHR: > - de->d_u.d_44.d_type = DT_CHR; > - break; > - case S_IFIFO: > - de->d_u.d_44.d_type = DT_FIFO; > - break; > - default: > - de->d_u.d_44.d_type = DT_UNKNOWN; > - } > + de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT; > } > > static inline u32 > -- > 2.17.2 >
Re: [PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
On Sat, Oct 20, 2018 at 11:09:57PM +0100, Phillip Potter wrote: > Remove switch statement from ufs_set_de_type function in fs/ufs/util.h > header and replace with simple assignment. For each case, S_IFx >> 12 > is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give > us the correct file type. It is expected that for *nix compatibility > reasons, the relation between S_IFx and DT_x will not change. For > cases where the mode is invalid, upper layer validation catches this > anyway, so this improves readability and arguably performance by > assigning (mode & S_IFMT) >> 12 directly without any jump table > or conditional logic. Shouldn't we also do this for other filesystems? A quick scan suggests someone should at least look at xfs, ubifs, squashfs, romfs, ocfs2, nilfs2, hfsplus, f2fs, ext4, ext2, exofs, coda, cifs & btrfs. > Signed-off-by: Phillip Potter > --- > fs/ufs/util.h | 30 ++ > 1 file changed, 2 insertions(+), 28 deletions(-) > > diff --git a/fs/ufs/util.h b/fs/ufs/util.h > index 1fd3011ea623..7e0c0878b9f9 100644 > --- a/fs/ufs/util.h > +++ b/fs/ufs/util.h > @@ -16,6 +16,7 @@ > * some useful macros > */ > #define in_range(b,first,len)((b)>=(first)&&(b)<(first)+(len)) > +#define S_SHIFT 12 > > /* > * functions used for retyping > @@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct > ufs_dir_entry *de, int mode) > if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD) > return; > > - /* > - * TODO turn this into a table lookup > - */ > - switch (mode & S_IFMT) { > - case S_IFSOCK: > - de->d_u.d_44.d_type = DT_SOCK; > - break; > - case S_IFLNK: > - de->d_u.d_44.d_type = DT_LNK; > - break; > - case S_IFREG: > - de->d_u.d_44.d_type = DT_REG; > - break; > - case S_IFBLK: > - de->d_u.d_44.d_type = DT_BLK; > - break; > - case S_IFDIR: > - de->d_u.d_44.d_type = DT_DIR; > - break; > - case S_IFCHR: > - de->d_u.d_44.d_type = DT_CHR; > - break; > - case S_IFIFO: > - de->d_u.d_44.d_type = DT_FIFO; > - break; > - default: > - de->d_u.d_44.d_type = DT_UNKNOWN; > - } > + de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT; > } > > static inline u32 > -- > 2.17.2 >
[PATCH v3 06/13] KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
Use kvm_vcpu_map when mapping the L1 MSR bitmap since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Do not change the lifecycle of the mapping (pbonzini) --- arch/x86/kvm/vmx.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d857401..5b15ca2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -847,6 +847,9 @@ struct nested_vmx { struct page *apic_access_page; struct page *virtual_apic_page; struct page *pi_desc_page; + + struct kvm_host_map msr_bitmap_map; + struct pi_desc *pi_desc; bool pi_pending; u16 posted_intr_nv; @@ -11546,9 +11549,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { int msr; - struct page *page; unsigned long *msr_bitmap_l1; unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap; + struct kvm_host_map *map = _vmx(vcpu)->nested.msr_bitmap_map; + /* * pred_cmd & spec_ctrl are trying to verify two things: * @@ -11574,11 +11578,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, !pred_cmd && !spec_ctrl) return false; - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap); - if (is_error_page(page)) + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map)) return false; - msr_bitmap_l1 = (unsigned long *)kmap(page); + msr_bitmap_l1 = (unsigned long *)map->hva; if (nested_cpu_has_apic_reg_virt(vmcs12)) { /* * L0 need not intercept reads for MSRs between 0x800 and 0x8ff, it @@ -11626,8 +11629,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W); - kunmap(page); - kvm_release_page_clean(page); + kvm_vcpu_unmap(_vmx(vcpu)->nested.msr_bitmap_map); return true; } -- 2.7.4
[PATCH v3 01/13] X86/nVMX: handle_vmon: Read 4 bytes from guest memory
Read the data directly from guest memory instead of the map->read->unmap sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which assumes that there is a "struct page" for guest memory. Suggested-by: Jim Mattson Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Massage commit message a bit. --- arch/x86/kvm/vmx.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4f5d4bd..358759a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8321,7 +8321,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu) { int ret; gpa_t vmptr; - struct page *page; + uint32_t revision; struct vcpu_vmx *vmx = to_vmx(vcpu); const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX; @@ -8373,19 +8373,11 @@ static int handle_vmon(struct kvm_vcpu *vcpu) return kvm_skip_emulated_instruction(vcpu); } - page = kvm_vcpu_gpa_to_page(vcpu, vmptr); - if (is_error_page(page)) { + if (kvm_read_guest(vcpu->kvm, vmptr, , sizeof(revision)) || + revision != VMCS12_REVISION) { nested_vmx_failInvalid(vcpu); return kvm_skip_emulated_instruction(vcpu); } - if (*(u32 *)kmap(page) != VMCS12_REVISION) { - kunmap(page); - kvm_release_page_clean(page); - nested_vmx_failInvalid(vcpu); - return kvm_skip_emulated_instruction(vcpu); - } - kunmap(page); - kvm_release_page_clean(page); vmx->nested.vmxon_ptr = vmptr; ret = enter_vmx_operation(vcpu); -- 2.7.4
[PATCH v3 05/13] KVM: Introduce a new guest mapping API
In KVM, specially for nested guests, there is a dominant pattern of: => map guest memory -> do_something -> unmap guest memory In addition to all this unnecessarily noise in the code due to boiler plate code, most of the time the mapping function does not properly handle memory that is not backed by "struct page". This new guest mapping API encapsulate most of this boiler plate code and also handles guest memory that is not backed by "struct page". Keep in mind that memremap is horribly slow, so this mapping API should not be used for high-frequency mapping operations. But rather for low-frequency mappings. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Drop the caching optimization (pbonzini) - Use 'hva' instead of 'kaddr' (pbonzini) - Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for AMD patch (pbonzini) - Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini) - Only clear map->hva instead of memsetting the whole structure. - Drop kvm_vcpu_map_valid since it is no longer used. - Fix EXPORT_MODULE naming. --- include/linux/kvm_host.h | 9 + virt/kvm/kvm_main.c | 50 2 files changed, 59 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c926698..59e56b8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -205,6 +205,13 @@ enum { READING_SHADOW_PAGE_TABLES, }; +struct kvm_host_map { + struct page *page; + void *hva; + kvm_pfn_t pfn; + kvm_pfn_t gfn; +}; + /* * Sometimes a large or cross-page mmio needs to be broken up into separate * exits for userspace servicing. @@ -708,6 +715,8 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu); struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn); kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn); kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn); +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map); +void kvm_vcpu_unmap(struct kvm_host_map *map); struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn); unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn); unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 94e931f..a79a3c4 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1652,6 +1652,56 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(gfn_to_page); +static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn, +struct kvm_host_map *map) +{ + kvm_pfn_t pfn; + void *hva = NULL; + struct page *page = NULL; + + pfn = gfn_to_pfn_memslot(slot, gfn); + if (is_error_noslot_pfn(pfn)) + return -EINVAL; + + if (pfn_valid(pfn)) { + page = pfn_to_page(pfn); + hva = kmap(page); + } else { + hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB); + } + + if (!hva) + return -EFAULT; + + map->page = page; + map->hva = hva; + map->pfn = pfn; + map->gfn = gfn; + + return 0; +} + +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map) +{ + return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map); +} +EXPORT_SYMBOL_GPL(kvm_vcpu_map); + +void kvm_vcpu_unmap(struct kvm_host_map *map) +{ + if (!map->hva) + return; + + if (map->page) + kunmap(map->page); + else + memunmap(map->hva); + + kvm_release_pfn_dirty(map->pfn); + map->hva = NULL; +} +EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); + struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn) { kvm_pfn_t pfn; -- 2.7.4
[PATCH v3 03/13] X86/nVMX: Update the PML table without mapping and unmapping the page
Update the PML table without mapping and unmapping the page. This also avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct page" for guest memory. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Use kvm_write_guest_page instead of kvm_write_guest (pbonzini) - Do not use pointer arithmetic for pml_address (pbonzini) --- arch/x86/kvm/vmx.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bc45347..d857401 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -13571,9 +13571,7 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12; struct vcpu_vmx *vmx = to_vmx(vcpu); - gpa_t gpa; - struct page *page = NULL; - u64 *pml_address; + gpa_t gpa, dst; if (is_guest_mode(vcpu)) { WARN_ON_ONCE(vmx->nested.pml_full); @@ -13593,15 +13591,13 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu) } gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull; + dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index; - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->pml_address); - if (is_error_page(page)) + if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), , +offset_in_page(dst), sizeof(gpa))) return 0; - pml_address = kmap(page); - pml_address[vmcs12->guest_pml_index--] = gpa; - kunmap(page); - kvm_release_page_clean(page); + vmcs12->guest_pml_index--; } return 0; -- 2.7.4
[PATCH v3 06/13] KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
Use kvm_vcpu_map when mapping the L1 MSR bitmap since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Do not change the lifecycle of the mapping (pbonzini) --- arch/x86/kvm/vmx.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d857401..5b15ca2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -847,6 +847,9 @@ struct nested_vmx { struct page *apic_access_page; struct page *virtual_apic_page; struct page *pi_desc_page; + + struct kvm_host_map msr_bitmap_map; + struct pi_desc *pi_desc; bool pi_pending; u16 posted_intr_nv; @@ -11546,9 +11549,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { int msr; - struct page *page; unsigned long *msr_bitmap_l1; unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap; + struct kvm_host_map *map = _vmx(vcpu)->nested.msr_bitmap_map; + /* * pred_cmd & spec_ctrl are trying to verify two things: * @@ -11574,11 +11578,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, !pred_cmd && !spec_ctrl) return false; - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap); - if (is_error_page(page)) + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map)) return false; - msr_bitmap_l1 = (unsigned long *)kmap(page); + msr_bitmap_l1 = (unsigned long *)map->hva; if (nested_cpu_has_apic_reg_virt(vmcs12)) { /* * L0 need not intercept reads for MSRs between 0x800 and 0x8ff, it @@ -11626,8 +11629,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W); - kunmap(page); - kvm_release_page_clean(page); + kvm_vcpu_unmap(_vmx(vcpu)->nested.msr_bitmap_map); return true; } -- 2.7.4
[PATCH v3 01/13] X86/nVMX: handle_vmon: Read 4 bytes from guest memory
Read the data directly from guest memory instead of the map->read->unmap sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which assumes that there is a "struct page" for guest memory. Suggested-by: Jim Mattson Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Massage commit message a bit. --- arch/x86/kvm/vmx.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4f5d4bd..358759a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8321,7 +8321,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu) { int ret; gpa_t vmptr; - struct page *page; + uint32_t revision; struct vcpu_vmx *vmx = to_vmx(vcpu); const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX; @@ -8373,19 +8373,11 @@ static int handle_vmon(struct kvm_vcpu *vcpu) return kvm_skip_emulated_instruction(vcpu); } - page = kvm_vcpu_gpa_to_page(vcpu, vmptr); - if (is_error_page(page)) { + if (kvm_read_guest(vcpu->kvm, vmptr, , sizeof(revision)) || + revision != VMCS12_REVISION) { nested_vmx_failInvalid(vcpu); return kvm_skip_emulated_instruction(vcpu); } - if (*(u32 *)kmap(page) != VMCS12_REVISION) { - kunmap(page); - kvm_release_page_clean(page); - nested_vmx_failInvalid(vcpu); - return kvm_skip_emulated_instruction(vcpu); - } - kunmap(page); - kvm_release_page_clean(page); vmx->nested.vmxon_ptr = vmptr; ret = enter_vmx_operation(vcpu); -- 2.7.4
[PATCH v3 05/13] KVM: Introduce a new guest mapping API
In KVM, specially for nested guests, there is a dominant pattern of: => map guest memory -> do_something -> unmap guest memory In addition to all this unnecessarily noise in the code due to boiler plate code, most of the time the mapping function does not properly handle memory that is not backed by "struct page". This new guest mapping API encapsulate most of this boiler plate code and also handles guest memory that is not backed by "struct page". Keep in mind that memremap is horribly slow, so this mapping API should not be used for high-frequency mapping operations. But rather for low-frequency mappings. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Drop the caching optimization (pbonzini) - Use 'hva' instead of 'kaddr' (pbonzini) - Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for AMD patch (pbonzini) - Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini) - Only clear map->hva instead of memsetting the whole structure. - Drop kvm_vcpu_map_valid since it is no longer used. - Fix EXPORT_MODULE naming. --- include/linux/kvm_host.h | 9 + virt/kvm/kvm_main.c | 50 2 files changed, 59 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c926698..59e56b8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -205,6 +205,13 @@ enum { READING_SHADOW_PAGE_TABLES, }; +struct kvm_host_map { + struct page *page; + void *hva; + kvm_pfn_t pfn; + kvm_pfn_t gfn; +}; + /* * Sometimes a large or cross-page mmio needs to be broken up into separate * exits for userspace servicing. @@ -708,6 +715,8 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu); struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn); kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn); kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn); +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map); +void kvm_vcpu_unmap(struct kvm_host_map *map); struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn); unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn); unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 94e931f..a79a3c4 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1652,6 +1652,56 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(gfn_to_page); +static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn, +struct kvm_host_map *map) +{ + kvm_pfn_t pfn; + void *hva = NULL; + struct page *page = NULL; + + pfn = gfn_to_pfn_memslot(slot, gfn); + if (is_error_noslot_pfn(pfn)) + return -EINVAL; + + if (pfn_valid(pfn)) { + page = pfn_to_page(pfn); + hva = kmap(page); + } else { + hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB); + } + + if (!hva) + return -EFAULT; + + map->page = page; + map->hva = hva; + map->pfn = pfn; + map->gfn = gfn; + + return 0; +} + +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map) +{ + return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map); +} +EXPORT_SYMBOL_GPL(kvm_vcpu_map); + +void kvm_vcpu_unmap(struct kvm_host_map *map) +{ + if (!map->hva) + return; + + if (map->page) + kunmap(map->page); + else + memunmap(map->hva); + + kvm_release_pfn_dirty(map->pfn); + map->hva = NULL; +} +EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); + struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn) { kvm_pfn_t pfn; -- 2.7.4
[PATCH v3 03/13] X86/nVMX: Update the PML table without mapping and unmapping the page
Update the PML table without mapping and unmapping the page. This also avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct page" for guest memory. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Use kvm_write_guest_page instead of kvm_write_guest (pbonzini) - Do not use pointer arithmetic for pml_address (pbonzini) --- arch/x86/kvm/vmx.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bc45347..d857401 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -13571,9 +13571,7 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12; struct vcpu_vmx *vmx = to_vmx(vcpu); - gpa_t gpa; - struct page *page = NULL; - u64 *pml_address; + gpa_t gpa, dst; if (is_guest_mode(vcpu)) { WARN_ON_ONCE(vmx->nested.pml_full); @@ -13593,15 +13591,13 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu) } gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull; + dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index; - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->pml_address); - if (is_error_page(page)) + if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), , +offset_in_page(dst), sizeof(gpa))) return 0; - pml_address = kmap(page); - pml_address[vmcs12->guest_pml_index--] = gpa; - kunmap(page); - kvm_release_page_clean(page); + vmcs12->guest_pml_index--; } return 0; -- 2.7.4
[PATCH v3 11/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
Use kvm_vcpu_map in synic_deliver_msg since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Update to match the new API return codes --- arch/x86/kvm/hyperv.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 8bdc78d..5310b8b 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -581,7 +581,7 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic *synic, u32 sint, struct hv_message *src_msg) { struct kvm_vcpu *vcpu = synic_to_vcpu(synic); - struct page *page; + struct kvm_host_map map; gpa_t gpa; struct hv_message *dst_msg; int r; @@ -591,11 +591,11 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic *synic, u32 sint, return -ENOENT; gpa = synic->msg_page & PAGE_MASK; - page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) + + if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), )) return -EFAULT; - msg_page = kmap_atomic(page); + msg_page = map.hva; dst_msg = _page->sint_message[sint]; if (sync_cmpxchg(_msg->header.message_type, HVMSG_NONE, src_msg->header.message_type) != HVMSG_NONE) { @@ -612,8 +612,8 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic *synic, u32 sint, else if (r == 0) r = -EFAULT; } - kunmap_atomic(msg_page); - kvm_release_page_dirty(page); + + kvm_vcpu_unmap(); kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); return r; } -- 2.7.4
[PATCH v3 10/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
Use kvm_vcpu_map in synic_clear_sint_msg_pending since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Update to match the new API return codes --- arch/x86/kvm/hyperv.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 01d209a..8bdc78d 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -158,26 +158,22 @@ static void synic_clear_sint_msg_pending(struct kvm_vcpu_hv_synic *synic, u32 sint) { struct kvm_vcpu *vcpu = synic_to_vcpu(synic); - struct page *page; - gpa_t gpa; + struct kvm_host_map map; struct hv_message *msg; struct hv_message_page *msg_page; - gpa = synic->msg_page & PAGE_MASK; - page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) { + if (kvm_vcpu_map(vcpu, gpa_to_gfn(synic->msg_page), )) { vcpu_err(vcpu, "Hyper-V SynIC can't get msg page, gpa 0x%llx\n", -gpa); +synic->msg_page); return; } - msg_page = kmap_atomic(page); + msg_page = map.hva; msg = _page->sint_message[sint]; msg->header.message_flags.msg_pending = 0; - kunmap_atomic(msg_page); - kvm_release_page_dirty(page); - kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); + kvm_vcpu_unmap(); + kvm_vcpu_mark_page_dirty(vcpu, gpa_to_gfn(synic->msg_page)); } static void kvm_hv_notify_acked_sint(struct kvm_vcpu *vcpu, u32 sint) -- 2.7.4
[PATCH v3 11/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
Use kvm_vcpu_map in synic_deliver_msg since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Update to match the new API return codes --- arch/x86/kvm/hyperv.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 8bdc78d..5310b8b 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -581,7 +581,7 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic *synic, u32 sint, struct hv_message *src_msg) { struct kvm_vcpu *vcpu = synic_to_vcpu(synic); - struct page *page; + struct kvm_host_map map; gpa_t gpa; struct hv_message *dst_msg; int r; @@ -591,11 +591,11 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic *synic, u32 sint, return -ENOENT; gpa = synic->msg_page & PAGE_MASK; - page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) + + if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), )) return -EFAULT; - msg_page = kmap_atomic(page); + msg_page = map.hva; dst_msg = _page->sint_message[sint]; if (sync_cmpxchg(_msg->header.message_type, HVMSG_NONE, src_msg->header.message_type) != HVMSG_NONE) { @@ -612,8 +612,8 @@ static int synic_deliver_msg(struct kvm_vcpu_hv_synic *synic, u32 sint, else if (r == 0) r = -EFAULT; } - kunmap_atomic(msg_page); - kvm_release_page_dirty(page); + + kvm_vcpu_unmap(); kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); return r; } -- 2.7.4
[PATCH v3 10/13] KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
Use kvm_vcpu_map in synic_clear_sint_msg_pending since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Update to match the new API return codes --- arch/x86/kvm/hyperv.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 01d209a..8bdc78d 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -158,26 +158,22 @@ static void synic_clear_sint_msg_pending(struct kvm_vcpu_hv_synic *synic, u32 sint) { struct kvm_vcpu *vcpu = synic_to_vcpu(synic); - struct page *page; - gpa_t gpa; + struct kvm_host_map map; struct hv_message *msg; struct hv_message_page *msg_page; - gpa = synic->msg_page & PAGE_MASK; - page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) { + if (kvm_vcpu_map(vcpu, gpa_to_gfn(synic->msg_page), )) { vcpu_err(vcpu, "Hyper-V SynIC can't get msg page, gpa 0x%llx\n", -gpa); +synic->msg_page); return; } - msg_page = kmap_atomic(page); + msg_page = map.hva; msg = _page->sint_message[sint]; msg->header.message_flags.msg_pending = 0; - kunmap_atomic(msg_page); - kvm_release_page_dirty(page); - kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT); + kvm_vcpu_unmap(); + kvm_vcpu_mark_page_dirty(vcpu, gpa_to_gfn(synic->msg_page)); } static void kvm_hv_notify_acked_sint(struct kvm_vcpu *vcpu, u32 sint) -- 2.7.4
[PATCH v3 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
Use kvm_vcpu_map when mapping the virtual APIC page since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the virtual APIC page on the host side. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Do not change the lifecycle of the mapping (pbonzini) - Use pfn_to_hpa instead of gfn_to_gpa --- arch/x86/kvm/vmx.c | 34 +++--- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5b15ca2..83a5e95 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -845,9 +845,8 @@ struct nested_vmx { * pointers, so we must keep them pinned while L2 runs. */ struct page *apic_access_page; - struct page *virtual_apic_page; + struct kvm_host_map virtual_apic_map; struct page *pi_desc_page; - struct kvm_host_map msr_bitmap_map; struct pi_desc *pi_desc; @@ -6152,11 +6151,12 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256); if (max_irr != 256) { - vapic_page = kmap(vmx->nested.virtual_apic_page); + vapic_page = vmx->nested.virtual_apic_map.hva; + if (!vapic_page) + return; + __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page, _irr); - kunmap(vmx->nested.virtual_apic_page); - status = vmcs_read16(GUEST_INTR_STATUS); if ((u8)max_irr > ((u8)status & 0xff)) { status &= ~0xff; @@ -8468,10 +8468,7 @@ static void free_nested(struct vcpu_vmx *vmx) kvm_release_page_dirty(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } - if (vmx->nested.virtual_apic_page) { - kvm_release_page_dirty(vmx->nested.virtual_apic_page); - vmx->nested.virtual_apic_page = NULL; - } + kvm_vcpu_unmap(>nested.virtual_apic_map); if (vmx->nested.pi_desc_page) { kunmap(vmx->nested.pi_desc_page); kvm_release_page_dirty(vmx->nested.pi_desc_page); @@ -11394,6 +11391,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); + struct kvm_host_map *map; struct page *page; u64 hpa; @@ -11426,11 +11424,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) } if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { - if (vmx->nested.virtual_apic_page) { /* shouldn't happen */ - kvm_release_page_dirty(vmx->nested.virtual_apic_page); - vmx->nested.virtual_apic_page = NULL; - } - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->virtual_apic_page_addr); + map = >nested.virtual_apic_map; /* * If translation failed, VM entry will fail because @@ -11445,11 +11439,8 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) * control. But such a configuration is useless, so * let's keep the code simple. */ - if (!is_error_page(page)) { - vmx->nested.virtual_apic_page = page; - hpa = page_to_phys(vmx->nested.virtual_apic_page); - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa); - } + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn)); } if (nested_cpu_has_posted_intr(vmcs12)) { @@ -13353,10 +13344,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, kvm_release_page_dirty(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } - if (vmx->nested.virtual_apic_page) { - kvm_release_page_dirty(vmx->nested.virtual_apic_page); - vmx->nested.virtual_apic_page = NULL; - } + kvm_vcpu_unmap(>nested.virtual_apic_map); if (vmx->nested.pi_desc_page) { kunmap(vmx->nested.pi_desc_page); kvm_release_page_dirty(vmx->nested.pi_desc_page); -- 2.7.4
[PATCH v3 09/13] KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
Use kvm_vcpu_map in emulator_cmpxchg_emulated since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Update to match the new API return codes --- arch/x86/kvm/x86.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ca71773..f083e53 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5280,9 +5280,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, unsigned int bytes, struct x86_exception *exception) { + struct kvm_host_map map; struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); gpa_t gpa; - struct page *page; char *kaddr; bool exchanged; @@ -5299,12 +5299,11 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK)) goto emul_write; - page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) + if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), )) goto emul_write; - kaddr = kmap_atomic(page); - kaddr += offset_in_page(gpa); + kaddr = map.hva + offset_in_page(gpa); + switch (bytes) { case 1: exchanged = CMPXCHG_TYPE(u8, kaddr, old, new); @@ -5321,8 +5320,8 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, default: BUG(); } - kunmap_atomic(kaddr); - kvm_release_page_dirty(page); + + kvm_vcpu_unmap(); if (!exchanged) return X86EMUL_CMPXCHG_FAILED; -- 2.7.4
[PATCH v3 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
Use kvm_vcpu_map when mapping the virtual APIC page since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the virtual APIC page on the host side. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Do not change the lifecycle of the mapping (pbonzini) - Use pfn_to_hpa instead of gfn_to_gpa --- arch/x86/kvm/vmx.c | 34 +++--- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5b15ca2..83a5e95 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -845,9 +845,8 @@ struct nested_vmx { * pointers, so we must keep them pinned while L2 runs. */ struct page *apic_access_page; - struct page *virtual_apic_page; + struct kvm_host_map virtual_apic_map; struct page *pi_desc_page; - struct kvm_host_map msr_bitmap_map; struct pi_desc *pi_desc; @@ -6152,11 +6151,12 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256); if (max_irr != 256) { - vapic_page = kmap(vmx->nested.virtual_apic_page); + vapic_page = vmx->nested.virtual_apic_map.hva; + if (!vapic_page) + return; + __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page, _irr); - kunmap(vmx->nested.virtual_apic_page); - status = vmcs_read16(GUEST_INTR_STATUS); if ((u8)max_irr > ((u8)status & 0xff)) { status &= ~0xff; @@ -8468,10 +8468,7 @@ static void free_nested(struct vcpu_vmx *vmx) kvm_release_page_dirty(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } - if (vmx->nested.virtual_apic_page) { - kvm_release_page_dirty(vmx->nested.virtual_apic_page); - vmx->nested.virtual_apic_page = NULL; - } + kvm_vcpu_unmap(>nested.virtual_apic_map); if (vmx->nested.pi_desc_page) { kunmap(vmx->nested.pi_desc_page); kvm_release_page_dirty(vmx->nested.pi_desc_page); @@ -11394,6 +11391,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); + struct kvm_host_map *map; struct page *page; u64 hpa; @@ -11426,11 +11424,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) } if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { - if (vmx->nested.virtual_apic_page) { /* shouldn't happen */ - kvm_release_page_dirty(vmx->nested.virtual_apic_page); - vmx->nested.virtual_apic_page = NULL; - } - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->virtual_apic_page_addr); + map = >nested.virtual_apic_map; /* * If translation failed, VM entry will fail because @@ -11445,11 +11439,8 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) * control. But such a configuration is useless, so * let's keep the code simple. */ - if (!is_error_page(page)) { - vmx->nested.virtual_apic_page = page; - hpa = page_to_phys(vmx->nested.virtual_apic_page); - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa); - } + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn)); } if (nested_cpu_has_posted_intr(vmcs12)) { @@ -13353,10 +13344,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, kvm_release_page_dirty(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } - if (vmx->nested.virtual_apic_page) { - kvm_release_page_dirty(vmx->nested.virtual_apic_page); - vmx->nested.virtual_apic_page = NULL; - } + kvm_vcpu_unmap(>nested.virtual_apic_map); if (vmx->nested.pi_desc_page) { kunmap(vmx->nested.pi_desc_page); kvm_release_page_dirty(vmx->nested.pi_desc_page); -- 2.7.4
[PATCH v3 09/13] KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
Use kvm_vcpu_map in emulator_cmpxchg_emulated since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Update to match the new API return codes --- arch/x86/kvm/x86.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ca71773..f083e53 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5280,9 +5280,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, unsigned int bytes, struct x86_exception *exception) { + struct kvm_host_map map; struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); gpa_t gpa; - struct page *page; char *kaddr; bool exchanged; @@ -5299,12 +5299,11 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK)) goto emul_write; - page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) + if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), )) goto emul_write; - kaddr = kmap_atomic(page); - kaddr += offset_in_page(gpa); + kaddr = map.hva + offset_in_page(gpa); + switch (bytes) { case 1: exchanged = CMPXCHG_TYPE(u8, kaddr, old, new); @@ -5321,8 +5320,8 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, default: BUG(); } - kunmap_atomic(kaddr); - kvm_release_page_dirty(page); + + kvm_vcpu_unmap(); if (!exchanged) return X86EMUL_CMPXCHG_FAILED; -- 2.7.4
[PATCH v3 00/13] KVM/X86: Introduce a new guest mapping interface
Guest memory can either be directly managed by the kernel (i.e. have a "struct page") or they can simply live outside kernel control (i.e. do not have a "struct page"). KVM mostly support these two modes, except in a few places where the code seems to assume that guest memory must have a "struct page". This patchset introduces a new mapping interface to map guest memory into host kernel memory which also supports PFN-based memory (i.e. memory without 'struct page'). It also converts all offending code to this interface or simply read/write directly from guest memory. As far as I can see all offending code is now fixed except the APIC-access page which I will handle in a seperate series along with dropping kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API. v3 -> v2: - rebase - Add a new patch to also fix the newly introducing shadow VMCS support for nested. Filippo Sironi (1): X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs KarimAllah Ahmed (12): X86/nVMX: handle_vmon: Read 4 bytes from guest memory X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory X86/nVMX: Update the PML table without mapping and unmapping the page KVM: Introduce a new guest mapping API KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg KVM/nSVM: Use the new mapping API for mapping guest memory KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS arch/x86/kvm/hyperv.c | 28 arch/x86/kvm/paging_tmpl.h | 38 --- arch/x86/kvm/svm.c | 97 +- arch/x86/kvm/vmx.c | 167 + arch/x86/kvm/x86.c | 13 ++-- include/linux/kvm_host.h | 9 +++ virt/kvm/kvm_main.c| 50 ++ 7 files changed, 217 insertions(+), 185 deletions(-) -- 2.7.4
[PATCH v3 00/13] KVM/X86: Introduce a new guest mapping interface
Guest memory can either be directly managed by the kernel (i.e. have a "struct page") or they can simply live outside kernel control (i.e. do not have a "struct page"). KVM mostly support these two modes, except in a few places where the code seems to assume that guest memory must have a "struct page". This patchset introduces a new mapping interface to map guest memory into host kernel memory which also supports PFN-based memory (i.e. memory without 'struct page'). It also converts all offending code to this interface or simply read/write directly from guest memory. As far as I can see all offending code is now fixed except the APIC-access page which I will handle in a seperate series along with dropping kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API. v3 -> v2: - rebase - Add a new patch to also fix the newly introducing shadow VMCS support for nested. Filippo Sironi (1): X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs KarimAllah Ahmed (12): X86/nVMX: handle_vmon: Read 4 bytes from guest memory X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory X86/nVMX: Update the PML table without mapping and unmapping the page KVM: Introduce a new guest mapping API KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg KVM/nSVM: Use the new mapping API for mapping guest memory KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS arch/x86/kvm/hyperv.c | 28 arch/x86/kvm/paging_tmpl.h | 38 --- arch/x86/kvm/svm.c | 97 +- arch/x86/kvm/vmx.c | 167 + arch/x86/kvm/x86.c | 13 ++-- include/linux/kvm_host.h | 9 +++ virt/kvm/kvm_main.c| 50 ++ 7 files changed, 217 insertions(+), 185 deletions(-) -- 2.7.4
[PATCH v3 13/13] KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS
Use kvm_vcpu_map for accessing the shadow VMCS since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/vmx.c | 25 - 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0ae25c3..2892770 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11616,20 +11616,20 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, static void nested_cache_shadow_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { + struct kvm_host_map map; struct vmcs12 *shadow; - struct page *page; if (!nested_cpu_has_shadow_vmcs(vmcs12) || vmcs12->vmcs_link_pointer == -1ull) return; shadow = get_shadow_vmcs12(vcpu); - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer); - memcpy(shadow, kmap(page), VMCS12_SIZE); + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), )) + return; - kunmap(page); - kvm_release_page_clean(page); + memcpy(shadow, map.hva, VMCS12_SIZE); + kvm_vcpu_unmap(); } static void nested_flush_cached_shadow_vmcs12(struct kvm_vcpu *vcpu, @@ -12494,9 +12494,9 @@ static int check_vmentry_prereqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - int r; - struct page *page; + int r = 0; struct vmcs12 *shadow; + struct kvm_host_map map; if (vmcs12->vmcs_link_pointer == -1ull) return 0; @@ -12504,17 +12504,16 @@ static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu, if (!page_address_valid(vcpu, vmcs12->vmcs_link_pointer)) return -EINVAL; - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer); - if (is_error_page(page)) + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), )) return -EINVAL; - r = 0; - shadow = kmap(page); + shadow = map.hva; + if (shadow->hdr.revision_id != VMCS12_REVISION || shadow->hdr.shadow_vmcs != nested_cpu_has_shadow_vmcs(vmcs12)) r = -EINVAL; - kunmap(page); - kvm_release_page_clean(page); + + kvm_vcpu_unmap(); return r; } -- 2.7.4
[PATCH v3 12/13] KVM/nSVM: Use the new mapping API for mapping guest memory
Use the new mapping API for mapping guest memory to avoid depending on "struct page". Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/svm.c | 97 +++--- 1 file changed, 49 insertions(+), 48 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d96092b..911d853 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3036,32 +3036,6 @@ static inline bool nested_svm_nmi(struct vcpu_svm *svm) return false; } -static void *nested_svm_map(struct vcpu_svm *svm, u64 gpa, struct page **_page) -{ - struct page *page; - - might_sleep(); - - page = kvm_vcpu_gfn_to_page(>vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) - goto error; - - *_page = page; - - return kmap(page); - -error: - kvm_inject_gp(>vcpu, 0); - - return NULL; -} - -static void nested_svm_unmap(struct page *page) -{ - kunmap(page); - kvm_release_page_dirty(page); -} - static int nested_svm_intercept_ioio(struct vcpu_svm *svm) { unsigned port, size, iopm_len; @@ -3262,10 +3236,11 @@ static inline void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *fr static int nested_svm_vmexit(struct vcpu_svm *svm) { + int rc; struct vmcb *nested_vmcb; struct vmcb *hsave = svm->nested.hsave; struct vmcb *vmcb = svm->vmcb; - struct page *page; + struct kvm_host_map map; trace_kvm_nested_vmexit_inject(vmcb->control.exit_code, vmcb->control.exit_info_1, @@ -3274,9 +3249,14 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) vmcb->control.exit_int_info_err, KVM_ISA_SVM); - nested_vmcb = nested_svm_map(svm, svm->nested.vmcb, ); - if (!nested_vmcb) + rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(svm->nested.vmcb), ); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(>vcpu, 0); return 1; + } + + nested_vmcb = map.hva; /* Exit Guest-Mode */ leave_guest_mode(>vcpu); @@ -3375,7 +3355,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) mark_all_dirty(svm->vmcb); - nested_svm_unmap(page); + kvm_vcpu_unmap(); nested_svm_uninit_mmu_context(>vcpu); kvm_mmu_reset_context(>vcpu); @@ -3433,7 +3413,7 @@ static bool nested_vmcb_checks(struct vmcb *vmcb) } static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, -struct vmcb *nested_vmcb, struct page *page) +struct vmcb *nested_vmcb, struct kvm_host_map *map) { if (kvm_get_rflags(>vcpu) & X86_EFLAGS_IF) svm->vcpu.arch.hflags |= HF_HIF_MASK; @@ -3513,7 +3493,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; - nested_svm_unmap(page); + kvm_vcpu_unmap(map); /* Enter Guest-Mode */ enter_guest_mode(>vcpu); @@ -3533,17 +3513,23 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, static bool nested_svm_vmrun(struct vcpu_svm *svm) { + int rc; struct vmcb *nested_vmcb; struct vmcb *hsave = svm->nested.hsave; struct vmcb *vmcb = svm->vmcb; - struct page *page; + struct kvm_host_map map; u64 vmcb_gpa; vmcb_gpa = svm->vmcb->save.rax; - nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, ); - if (!nested_vmcb) + rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(vmcb_gpa), ); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(>vcpu, 0); return false; + } + + nested_vmcb = map.hva; if (!nested_vmcb_checks(nested_vmcb)) { nested_vmcb->control.exit_code= SVM_EXIT_ERR; @@ -3551,7 +3537,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) nested_vmcb->control.exit_info_1 = 0; nested_vmcb->control.exit_info_2 = 0; - nested_svm_unmap(page); + kvm_vcpu_unmap(); return false; } @@ -3595,7 +3581,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) copy_vmcb_control_area(hsave, vmcb); - enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, page); + enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, ); return true; } @@ -3619,21 +3605,26 @@ static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb) static int vmload_interception(struct vcpu_svm *svm) { struct vmcb *nested_vmcb; - struct page *page; + struct kvm_host_map map; int ret; if (nested_svm_check_permissions(svm))
[PATCH v3 13/13] KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS
Use kvm_vcpu_map for accessing the shadow VMCS since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/vmx.c | 25 - 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0ae25c3..2892770 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11616,20 +11616,20 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, static void nested_cache_shadow_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { + struct kvm_host_map map; struct vmcs12 *shadow; - struct page *page; if (!nested_cpu_has_shadow_vmcs(vmcs12) || vmcs12->vmcs_link_pointer == -1ull) return; shadow = get_shadow_vmcs12(vcpu); - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer); - memcpy(shadow, kmap(page), VMCS12_SIZE); + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), )) + return; - kunmap(page); - kvm_release_page_clean(page); + memcpy(shadow, map.hva, VMCS12_SIZE); + kvm_vcpu_unmap(); } static void nested_flush_cached_shadow_vmcs12(struct kvm_vcpu *vcpu, @@ -12494,9 +12494,9 @@ static int check_vmentry_prereqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - int r; - struct page *page; + int r = 0; struct vmcs12 *shadow; + struct kvm_host_map map; if (vmcs12->vmcs_link_pointer == -1ull) return 0; @@ -12504,17 +12504,16 @@ static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu, if (!page_address_valid(vcpu, vmcs12->vmcs_link_pointer)) return -EINVAL; - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer); - if (is_error_page(page)) + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), )) return -EINVAL; - r = 0; - shadow = kmap(page); + shadow = map.hva; + if (shadow->hdr.revision_id != VMCS12_REVISION || shadow->hdr.shadow_vmcs != nested_cpu_has_shadow_vmcs(vmcs12)) r = -EINVAL; - kunmap(page); - kvm_release_page_clean(page); + + kvm_vcpu_unmap(); return r; } -- 2.7.4
[PATCH v3 12/13] KVM/nSVM: Use the new mapping API for mapping guest memory
Use the new mapping API for mapping guest memory to avoid depending on "struct page". Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/svm.c | 97 +++--- 1 file changed, 49 insertions(+), 48 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d96092b..911d853 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3036,32 +3036,6 @@ static inline bool nested_svm_nmi(struct vcpu_svm *svm) return false; } -static void *nested_svm_map(struct vcpu_svm *svm, u64 gpa, struct page **_page) -{ - struct page *page; - - might_sleep(); - - page = kvm_vcpu_gfn_to_page(>vcpu, gpa >> PAGE_SHIFT); - if (is_error_page(page)) - goto error; - - *_page = page; - - return kmap(page); - -error: - kvm_inject_gp(>vcpu, 0); - - return NULL; -} - -static void nested_svm_unmap(struct page *page) -{ - kunmap(page); - kvm_release_page_dirty(page); -} - static int nested_svm_intercept_ioio(struct vcpu_svm *svm) { unsigned port, size, iopm_len; @@ -3262,10 +3236,11 @@ static inline void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *fr static int nested_svm_vmexit(struct vcpu_svm *svm) { + int rc; struct vmcb *nested_vmcb; struct vmcb *hsave = svm->nested.hsave; struct vmcb *vmcb = svm->vmcb; - struct page *page; + struct kvm_host_map map; trace_kvm_nested_vmexit_inject(vmcb->control.exit_code, vmcb->control.exit_info_1, @@ -3274,9 +3249,14 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) vmcb->control.exit_int_info_err, KVM_ISA_SVM); - nested_vmcb = nested_svm_map(svm, svm->nested.vmcb, ); - if (!nested_vmcb) + rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(svm->nested.vmcb), ); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(>vcpu, 0); return 1; + } + + nested_vmcb = map.hva; /* Exit Guest-Mode */ leave_guest_mode(>vcpu); @@ -3375,7 +3355,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) mark_all_dirty(svm->vmcb); - nested_svm_unmap(page); + kvm_vcpu_unmap(); nested_svm_uninit_mmu_context(>vcpu); kvm_mmu_reset_context(>vcpu); @@ -3433,7 +3413,7 @@ static bool nested_vmcb_checks(struct vmcb *vmcb) } static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, -struct vmcb *nested_vmcb, struct page *page) +struct vmcb *nested_vmcb, struct kvm_host_map *map) { if (kvm_get_rflags(>vcpu) & X86_EFLAGS_IF) svm->vcpu.arch.hflags |= HF_HIF_MASK; @@ -3513,7 +3493,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, svm->vmcb->control.event_inj = nested_vmcb->control.event_inj; svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err; - nested_svm_unmap(page); + kvm_vcpu_unmap(map); /* Enter Guest-Mode */ enter_guest_mode(>vcpu); @@ -3533,17 +3513,23 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, static bool nested_svm_vmrun(struct vcpu_svm *svm) { + int rc; struct vmcb *nested_vmcb; struct vmcb *hsave = svm->nested.hsave; struct vmcb *vmcb = svm->vmcb; - struct page *page; + struct kvm_host_map map; u64 vmcb_gpa; vmcb_gpa = svm->vmcb->save.rax; - nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, ); - if (!nested_vmcb) + rc = kvm_vcpu_map(>vcpu, gfn_to_gpa(vmcb_gpa), ); + if (rc) { + if (rc == -EINVAL) + kvm_inject_gp(>vcpu, 0); return false; + } + + nested_vmcb = map.hva; if (!nested_vmcb_checks(nested_vmcb)) { nested_vmcb->control.exit_code= SVM_EXIT_ERR; @@ -3551,7 +3537,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) nested_vmcb->control.exit_info_1 = 0; nested_vmcb->control.exit_info_2 = 0; - nested_svm_unmap(page); + kvm_vcpu_unmap(); return false; } @@ -3595,7 +3581,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) copy_vmcb_control_area(hsave, vmcb); - enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, page); + enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, ); return true; } @@ -3619,21 +3605,26 @@ static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb) static int vmload_interception(struct vcpu_svm *svm) { struct vmcb *nested_vmcb; - struct page *page; + struct kvm_host_map map; int ret; if (nested_svm_check_permissions(svm))
[PATCH v3 02/13] X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
Copy the VMCS12 directly from guest memory instead of the map->copy->unmap sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which assumes that there is a "struct page" for guest memory. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Massage commit message a bit. --- arch/x86/kvm/vmx.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 358759a..bc45347 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8879,33 +8879,28 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu) } if (vmx->nested.current_vmptr != vmptr) { - struct vmcs12 *new_vmcs12; - struct page *page; - page = kvm_vcpu_gpa_to_page(vcpu, vmptr); - if (is_error_page(page)) { + struct vmcs12 *new_vmcs12 = (struct vmcs12 *)__get_free_page(GFP_KERNEL); + + if (!new_vmcs12 || + kvm_read_guest(vcpu->kvm, vmptr, new_vmcs12, + sizeof(*new_vmcs12))) { + free_page((unsigned long)new_vmcs12); nested_vmx_failInvalid(vcpu); return kvm_skip_emulated_instruction(vcpu); } - new_vmcs12 = kmap(page); + if (new_vmcs12->hdr.revision_id != VMCS12_REVISION || (new_vmcs12->hdr.shadow_vmcs && !nested_cpu_has_vmx_shadow_vmcs(vcpu))) { - kunmap(page); - kvm_release_page_clean(page); + free_page((unsigned long)new_vmcs12); nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); return kvm_skip_emulated_instruction(vcpu); } nested_release_vmcs12(vmx); - /* -* Load VMCS12 from guest memory since it is not already -* cached. -*/ memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE); - kunmap(page); - kvm_release_page_clean(page); - + free_page((unsigned long)new_vmcs12); set_current_vmptr(vmx, vmptr); } -- 2.7.4
[PATCH V4 3/5] KVM: X86: Adding skeleton for Memory ROE
This patch introduces a hypercall implemented for X86 that can assist against subset of kernel rootkits, it works by place readonly protection in shadow PTE. The end result protection is also kept in a bitmap for each kvm_memory_slot and is used as reference when updating SPTEs. The whole goal is to protect the guest kernel static data from modification if attacker is running from guest ring 0, for this reason there is no hypercall to revert effect of Memory ROE hypercall. This patch doesn't implement integrity check on guest TLB so obvious attack on the current implementation will involve guest virtual address -> guest physical address remapping, but there are plans to fix that. Signed-off-by: Ahmed Abd El Mawgood --- arch/x86/include/asm/kvm_host.h | 11 ++- arch/x86/kvm/Kconfig| 7 ++ arch/x86/kvm/mmu.c | 72 +--- arch/x86/kvm/x86.c | 143 +++- include/linux/kvm_host.h| 3 + include/uapi/linux/kvm_para.h | 4 + virt/kvm/kvm_main.c | 34 +++- 7 files changed, 255 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09b2e3e2cf1b..aa080c3e302e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -238,6 +238,15 @@ struct kvm_mmu_memory_cache { void *objects[KVM_NR_MEM_OBJS]; }; +/* + * This is internal structure used to be be able to access kvm memory slot and + * have track of the number of current PTE when doing shadow PTE walk + */ +struct kvm_write_access_data { + int i; + struct kvm_memory_slot *memslot; +}; + /* * the pages used as guest page table on soft mmu are tracked by * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used @@ -1178,7 +1187,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 acc_track_mask, u64 me_mask); void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, +void kvm_mmu_slot_apply_write_access(struct kvm *kvm, struct kvm_memory_slot *memslot); void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *memslot); diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1bbec387d289..2fcbb1788a24 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -96,6 +96,13 @@ config KVM_MMU_AUDIT This option adds a R/W kVM module parameter 'mmu_audit', which allows auditing of KVM MMU events at runtime. +config KVM_ROE + bool "Hypercall Memory Read-Only Enforcement" + depends on KVM && X86 + help + This option adds KVM_HC_ROE hypercall to kvm as a hardening + mechanism to protect memory pages from being edited. + # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. source drivers/vhost/Kconfig diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cc36abe1ee44..c54aa5287e14 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1484,9 +1484,8 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect) return mmu_spte_update(sptep, spte); } -static bool __rmap_write_protect(struct kvm *kvm, -struct kvm_rmap_head *rmap_head, -bool pt_protect, void *data) +static bool __rmap_write_protection(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, bool pt_protect) { u64 *sptep; struct rmap_iterator iter; @@ -1498,6 +1497,38 @@ static bool __rmap_write_protect(struct kvm *kvm, return flush; } +#ifdef CONFIG_KVM_ROE +static bool __rmap_write_protect_roe(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + bool pt_protect, + struct kvm_write_access_data *d) +{ + u64 *sptep; + struct rmap_iterator iter; + bool prot; + bool flush = false; + + for_each_rmap_spte(rmap_head, , sptep) { + prot = !test_bit(d->i, d->memslot->roe_bitmap) && pt_protect; + flush |= spte_write_protect(sptep, prot); + d->i++; + } + return flush; +} +#endif + +static bool __rmap_write_protect(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + bool pt_protect, + struct kvm_write_access_data *d) +{ +#ifdef CONFIG_KVM_ROE + if (d != NULL) + return __rmap_write_protect_roe(kvm, rmap_head, pt_protect, d); +#endif + return __rmap_write_protection(kvm, rmap_head, pt_protect); +} + static bool spte_clear_dirty(u64 *sptep) { u64 spte = *sptep; @@ -1585,7 +1616,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, while (mask) { rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
[PATCH v3 08/13] KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table
Use kvm_vcpu_map when mapping the posted interrupt descriptor table since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the interrupt descriptor table page on the host side. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Do not change the lifecycle of the mapping (pbonzini) --- arch/x86/kvm/vmx.c | 45 +++-- 1 file changed, 15 insertions(+), 30 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 83a5e95..0ae25c3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -846,7 +846,7 @@ struct nested_vmx { */ struct page *apic_access_page; struct kvm_host_map virtual_apic_map; - struct page *pi_desc_page; + struct kvm_host_map pi_desc_map; struct kvm_host_map msr_bitmap_map; struct pi_desc *pi_desc; @@ -8469,12 +8469,8 @@ static void free_nested(struct vcpu_vmx *vmx) vmx->nested.apic_access_page = NULL; } kvm_vcpu_unmap(>nested.virtual_apic_map); - if (vmx->nested.pi_desc_page) { - kunmap(vmx->nested.pi_desc_page); - kvm_release_page_dirty(vmx->nested.pi_desc_page); - vmx->nested.pi_desc_page = NULL; - vmx->nested.pi_desc = NULL; - } + kvm_vcpu_unmap(>nested.pi_desc_map); + vmx->nested.pi_desc = NULL; free_loaded_vmcs(>nested.vmcs02); } @@ -11444,24 +11440,16 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) } if (nested_cpu_has_posted_intr(vmcs12)) { - if (vmx->nested.pi_desc_page) { /* shouldn't happen */ - kunmap(vmx->nested.pi_desc_page); - kvm_release_page_dirty(vmx->nested.pi_desc_page); - vmx->nested.pi_desc_page = NULL; + map = >nested.pi_desc_map; + + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) { + vmx->nested.pi_desc = + (struct pi_desc *)(((void *)map->hva) + + offset_in_page(vmcs12->posted_intr_desc_addr)); + vmcs_write64(POSTED_INTR_DESC_ADDR, pfn_to_hpa(map->pfn) + + offset_in_page(vmcs12->posted_intr_desc_addr)); } - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->posted_intr_desc_addr); - if (is_error_page(page)) - return; - vmx->nested.pi_desc_page = page; - vmx->nested.pi_desc = kmap(vmx->nested.pi_desc_page); - vmx->nested.pi_desc = - (struct pi_desc *)((void *)vmx->nested.pi_desc + - (unsigned long)(vmcs12->posted_intr_desc_addr & - (PAGE_SIZE - 1))); - vmcs_write64(POSTED_INTR_DESC_ADDR, - page_to_phys(vmx->nested.pi_desc_page) + - (unsigned long)(vmcs12->posted_intr_desc_addr & - (PAGE_SIZE - 1))); + } if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12)) vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, @@ -13344,13 +13332,10 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, kvm_release_page_dirty(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } + kvm_vcpu_unmap(>nested.virtual_apic_map); - if (vmx->nested.pi_desc_page) { - kunmap(vmx->nested.pi_desc_page); - kvm_release_page_dirty(vmx->nested.pi_desc_page); - vmx->nested.pi_desc_page = NULL; - vmx->nested.pi_desc = NULL; - } + kvm_vcpu_unmap(>nested.pi_desc_map); + vmx->nested.pi_desc = NULL; /* * We are now running in L2, mmu_notifier will force to reload the -- 2.7.4
[PATCH v3 04/13] X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
From: Filippo Sironi cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of pages and the respective struct page to map in the kernel virtual address space. This doesn't work if get_user_pages_fast() is invoked with a userspace virtual address that's backed by PFNs outside of kernel reach (e.g., when limiting the kernel memory with mem= in the command line and using /dev/mem to map memory). If get_user_pages_fast() fails, look up the VMA that back the userspace virtual address, compute the PFN and the physical address, and map it in the kernel virtual address space with memremap(). Signed-off-by: Filippo Sironi Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/paging_tmpl.h | 38 +- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 14ffd97..8f7bc8f 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -141,15 +141,35 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct page *page; npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, ); - /* Check if the user is doing something meaningless. */ - if (unlikely(npages != 1)) - return -EFAULT; - - table = kmap_atomic(page); - ret = CMPXCHG([index], orig_pte, new_pte); - kunmap_atomic(table); - - kvm_release_page_dirty(page); + if (likely(npages == 1)) { + table = kmap_atomic(page); + ret = CMPXCHG([index], orig_pte, new_pte); + kunmap_atomic(table); + + kvm_release_page_dirty(page); + } else { + struct vm_area_struct *vma; + unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK; + unsigned long pfn; + unsigned long paddr; + + down_read(>mm->mmap_sem); + vma = find_vma_intersection(current->mm, vaddr, vaddr + PAGE_SIZE); + if (!vma || !(vma->vm_flags & VM_PFNMAP)) { + up_read(>mm->mmap_sem); + return -EFAULT; + } + pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + paddr = pfn << PAGE_SHIFT; + table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB); + if (!table) { + up_read(>mm->mmap_sem); + return -EFAULT; + } + ret = CMPXCHG([index], orig_pte, new_pte); + memunmap(table); + up_read(>mm->mmap_sem); + } return (ret != orig_pte); } -- 2.7.4
[PATCH v3 02/13] X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
Copy the VMCS12 directly from guest memory instead of the map->copy->unmap sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which assumes that there is a "struct page" for guest memory. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Massage commit message a bit. --- arch/x86/kvm/vmx.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 358759a..bc45347 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8879,33 +8879,28 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu) } if (vmx->nested.current_vmptr != vmptr) { - struct vmcs12 *new_vmcs12; - struct page *page; - page = kvm_vcpu_gpa_to_page(vcpu, vmptr); - if (is_error_page(page)) { + struct vmcs12 *new_vmcs12 = (struct vmcs12 *)__get_free_page(GFP_KERNEL); + + if (!new_vmcs12 || + kvm_read_guest(vcpu->kvm, vmptr, new_vmcs12, + sizeof(*new_vmcs12))) { + free_page((unsigned long)new_vmcs12); nested_vmx_failInvalid(vcpu); return kvm_skip_emulated_instruction(vcpu); } - new_vmcs12 = kmap(page); + if (new_vmcs12->hdr.revision_id != VMCS12_REVISION || (new_vmcs12->hdr.shadow_vmcs && !nested_cpu_has_vmx_shadow_vmcs(vcpu))) { - kunmap(page); - kvm_release_page_clean(page); + free_page((unsigned long)new_vmcs12); nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); return kvm_skip_emulated_instruction(vcpu); } nested_release_vmcs12(vmx); - /* -* Load VMCS12 from guest memory since it is not already -* cached. -*/ memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE); - kunmap(page); - kvm_release_page_clean(page); - + free_page((unsigned long)new_vmcs12); set_current_vmptr(vmx, vmptr); } -- 2.7.4
[PATCH V4 3/5] KVM: X86: Adding skeleton for Memory ROE
This patch introduces a hypercall implemented for X86 that can assist against subset of kernel rootkits, it works by place readonly protection in shadow PTE. The end result protection is also kept in a bitmap for each kvm_memory_slot and is used as reference when updating SPTEs. The whole goal is to protect the guest kernel static data from modification if attacker is running from guest ring 0, for this reason there is no hypercall to revert effect of Memory ROE hypercall. This patch doesn't implement integrity check on guest TLB so obvious attack on the current implementation will involve guest virtual address -> guest physical address remapping, but there are plans to fix that. Signed-off-by: Ahmed Abd El Mawgood --- arch/x86/include/asm/kvm_host.h | 11 ++- arch/x86/kvm/Kconfig| 7 ++ arch/x86/kvm/mmu.c | 72 +--- arch/x86/kvm/x86.c | 143 +++- include/linux/kvm_host.h| 3 + include/uapi/linux/kvm_para.h | 4 + virt/kvm/kvm_main.c | 34 +++- 7 files changed, 255 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09b2e3e2cf1b..aa080c3e302e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -238,6 +238,15 @@ struct kvm_mmu_memory_cache { void *objects[KVM_NR_MEM_OBJS]; }; +/* + * This is internal structure used to be be able to access kvm memory slot and + * have track of the number of current PTE when doing shadow PTE walk + */ +struct kvm_write_access_data { + int i; + struct kvm_memory_slot *memslot; +}; + /* * the pages used as guest page table on soft mmu are tracked by * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used @@ -1178,7 +1187,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 acc_track_mask, u64 me_mask); void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, +void kvm_mmu_slot_apply_write_access(struct kvm *kvm, struct kvm_memory_slot *memslot); void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *memslot); diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1bbec387d289..2fcbb1788a24 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -96,6 +96,13 @@ config KVM_MMU_AUDIT This option adds a R/W kVM module parameter 'mmu_audit', which allows auditing of KVM MMU events at runtime. +config KVM_ROE + bool "Hypercall Memory Read-Only Enforcement" + depends on KVM && X86 + help + This option adds KVM_HC_ROE hypercall to kvm as a hardening + mechanism to protect memory pages from being edited. + # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. source drivers/vhost/Kconfig diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cc36abe1ee44..c54aa5287e14 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1484,9 +1484,8 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect) return mmu_spte_update(sptep, spte); } -static bool __rmap_write_protect(struct kvm *kvm, -struct kvm_rmap_head *rmap_head, -bool pt_protect, void *data) +static bool __rmap_write_protection(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, bool pt_protect) { u64 *sptep; struct rmap_iterator iter; @@ -1498,6 +1497,38 @@ static bool __rmap_write_protect(struct kvm *kvm, return flush; } +#ifdef CONFIG_KVM_ROE +static bool __rmap_write_protect_roe(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + bool pt_protect, + struct kvm_write_access_data *d) +{ + u64 *sptep; + struct rmap_iterator iter; + bool prot; + bool flush = false; + + for_each_rmap_spte(rmap_head, , sptep) { + prot = !test_bit(d->i, d->memslot->roe_bitmap) && pt_protect; + flush |= spte_write_protect(sptep, prot); + d->i++; + } + return flush; +} +#endif + +static bool __rmap_write_protect(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + bool pt_protect, + struct kvm_write_access_data *d) +{ +#ifdef CONFIG_KVM_ROE + if (d != NULL) + return __rmap_write_protect_roe(kvm, rmap_head, pt_protect, d); +#endif + return __rmap_write_protection(kvm, rmap_head, pt_protect); +} + static bool spte_clear_dirty(u64 *sptep) { u64 spte = *sptep; @@ -1585,7 +1616,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, while (mask) { rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
[PATCH v3 08/13] KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table
Use kvm_vcpu_map when mapping the posted interrupt descriptor table since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the interrupt descriptor table page on the host side. Signed-off-by: KarimAllah Ahmed --- v1 -> v2: - Do not change the lifecycle of the mapping (pbonzini) --- arch/x86/kvm/vmx.c | 45 +++-- 1 file changed, 15 insertions(+), 30 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 83a5e95..0ae25c3 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -846,7 +846,7 @@ struct nested_vmx { */ struct page *apic_access_page; struct kvm_host_map virtual_apic_map; - struct page *pi_desc_page; + struct kvm_host_map pi_desc_map; struct kvm_host_map msr_bitmap_map; struct pi_desc *pi_desc; @@ -8469,12 +8469,8 @@ static void free_nested(struct vcpu_vmx *vmx) vmx->nested.apic_access_page = NULL; } kvm_vcpu_unmap(>nested.virtual_apic_map); - if (vmx->nested.pi_desc_page) { - kunmap(vmx->nested.pi_desc_page); - kvm_release_page_dirty(vmx->nested.pi_desc_page); - vmx->nested.pi_desc_page = NULL; - vmx->nested.pi_desc = NULL; - } + kvm_vcpu_unmap(>nested.pi_desc_map); + vmx->nested.pi_desc = NULL; free_loaded_vmcs(>nested.vmcs02); } @@ -11444,24 +11440,16 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) } if (nested_cpu_has_posted_intr(vmcs12)) { - if (vmx->nested.pi_desc_page) { /* shouldn't happen */ - kunmap(vmx->nested.pi_desc_page); - kvm_release_page_dirty(vmx->nested.pi_desc_page); - vmx->nested.pi_desc_page = NULL; + map = >nested.pi_desc_map; + + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) { + vmx->nested.pi_desc = + (struct pi_desc *)(((void *)map->hva) + + offset_in_page(vmcs12->posted_intr_desc_addr)); + vmcs_write64(POSTED_INTR_DESC_ADDR, pfn_to_hpa(map->pfn) + + offset_in_page(vmcs12->posted_intr_desc_addr)); } - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->posted_intr_desc_addr); - if (is_error_page(page)) - return; - vmx->nested.pi_desc_page = page; - vmx->nested.pi_desc = kmap(vmx->nested.pi_desc_page); - vmx->nested.pi_desc = - (struct pi_desc *)((void *)vmx->nested.pi_desc + - (unsigned long)(vmcs12->posted_intr_desc_addr & - (PAGE_SIZE - 1))); - vmcs_write64(POSTED_INTR_DESC_ADDR, - page_to_phys(vmx->nested.pi_desc_page) + - (unsigned long)(vmcs12->posted_intr_desc_addr & - (PAGE_SIZE - 1))); + } if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12)) vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, @@ -13344,13 +13332,10 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, kvm_release_page_dirty(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } + kvm_vcpu_unmap(>nested.virtual_apic_map); - if (vmx->nested.pi_desc_page) { - kunmap(vmx->nested.pi_desc_page); - kvm_release_page_dirty(vmx->nested.pi_desc_page); - vmx->nested.pi_desc_page = NULL; - vmx->nested.pi_desc = NULL; - } + kvm_vcpu_unmap(>nested.pi_desc_map); + vmx->nested.pi_desc = NULL; /* * We are now running in L2, mmu_notifier will force to reload the -- 2.7.4
[PATCH v3 04/13] X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
From: Filippo Sironi cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of pages and the respective struct page to map in the kernel virtual address space. This doesn't work if get_user_pages_fast() is invoked with a userspace virtual address that's backed by PFNs outside of kernel reach (e.g., when limiting the kernel memory with mem= in the command line and using /dev/mem to map memory). If get_user_pages_fast() fails, look up the VMA that back the userspace virtual address, compute the PFN and the physical address, and map it in the kernel virtual address space with memremap(). Signed-off-by: Filippo Sironi Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/paging_tmpl.h | 38 +- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 14ffd97..8f7bc8f 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -141,15 +141,35 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct page *page; npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, ); - /* Check if the user is doing something meaningless. */ - if (unlikely(npages != 1)) - return -EFAULT; - - table = kmap_atomic(page); - ret = CMPXCHG([index], orig_pte, new_pte); - kunmap_atomic(table); - - kvm_release_page_dirty(page); + if (likely(npages == 1)) { + table = kmap_atomic(page); + ret = CMPXCHG([index], orig_pte, new_pte); + kunmap_atomic(table); + + kvm_release_page_dirty(page); + } else { + struct vm_area_struct *vma; + unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK; + unsigned long pfn; + unsigned long paddr; + + down_read(>mm->mmap_sem); + vma = find_vma_intersection(current->mm, vaddr, vaddr + PAGE_SIZE); + if (!vma || !(vma->vm_flags & VM_PFNMAP)) { + up_read(>mm->mmap_sem); + return -EFAULT; + } + pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + paddr = pfn << PAGE_SHIFT; + table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB); + if (!table) { + up_read(>mm->mmap_sem); + return -EFAULT; + } + ret = CMPXCHG([index], orig_pte, new_pte); + memunmap(table); + up_read(>mm->mmap_sem); + } return (ret != orig_pte); } -- 2.7.4
[PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
Remove switch statement from ufs_set_de_type function in fs/ufs/util.h header and replace with simple assignment. For each case, S_IFx >> 12 is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give us the correct file type. It is expected that for *nix compatibility reasons, the relation between S_IFx and DT_x will not change. For cases where the mode is invalid, upper layer validation catches this anyway, so this improves readability and arguably performance by assigning (mode & S_IFMT) >> 12 directly without any jump table or conditional logic. Signed-off-by: Phillip Potter --- fs/ufs/util.h | 30 ++ 1 file changed, 2 insertions(+), 28 deletions(-) diff --git a/fs/ufs/util.h b/fs/ufs/util.h index 1fd3011ea623..7e0c0878b9f9 100644 --- a/fs/ufs/util.h +++ b/fs/ufs/util.h @@ -16,6 +16,7 @@ * some useful macros */ #define in_range(b,first,len) ((b)>=(first)&&(b)<(first)+(len)) +#define S_SHIFT12 /* * functions used for retyping @@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct ufs_dir_entry *de, int mode) if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD) return; - /* -* TODO turn this into a table lookup -*/ - switch (mode & S_IFMT) { - case S_IFSOCK: - de->d_u.d_44.d_type = DT_SOCK; - break; - case S_IFLNK: - de->d_u.d_44.d_type = DT_LNK; - break; - case S_IFREG: - de->d_u.d_44.d_type = DT_REG; - break; - case S_IFBLK: - de->d_u.d_44.d_type = DT_BLK; - break; - case S_IFDIR: - de->d_u.d_44.d_type = DT_DIR; - break; - case S_IFCHR: - de->d_u.d_44.d_type = DT_CHR; - break; - case S_IFIFO: - de->d_u.d_44.d_type = DT_FIFO; - break; - default: - de->d_u.d_44.d_type = DT_UNKNOWN; - } + de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT; } static inline u32 -- 2.17.2
[PATCH] fs: ufs: Remove switch statement from ufs_set_de_type function
Remove switch statement from ufs_set_de_type function in fs/ufs/util.h header and replace with simple assignment. For each case, S_IFx >> 12 is equal to DT_x, so in valid cases (mode & S_IFMT) >> 12 should give us the correct file type. It is expected that for *nix compatibility reasons, the relation between S_IFx and DT_x will not change. For cases where the mode is invalid, upper layer validation catches this anyway, so this improves readability and arguably performance by assigning (mode & S_IFMT) >> 12 directly without any jump table or conditional logic. Signed-off-by: Phillip Potter --- fs/ufs/util.h | 30 ++ 1 file changed, 2 insertions(+), 28 deletions(-) diff --git a/fs/ufs/util.h b/fs/ufs/util.h index 1fd3011ea623..7e0c0878b9f9 100644 --- a/fs/ufs/util.h +++ b/fs/ufs/util.h @@ -16,6 +16,7 @@ * some useful macros */ #define in_range(b,first,len) ((b)>=(first)&&(b)<(first)+(len)) +#define S_SHIFT12 /* * functions used for retyping @@ -158,34 +159,7 @@ ufs_set_de_type(struct super_block *sb, struct ufs_dir_entry *de, int mode) if ((UFS_SB(sb)->s_flags & UFS_DE_MASK) != UFS_DE_44BSD) return; - /* -* TODO turn this into a table lookup -*/ - switch (mode & S_IFMT) { - case S_IFSOCK: - de->d_u.d_44.d_type = DT_SOCK; - break; - case S_IFLNK: - de->d_u.d_44.d_type = DT_LNK; - break; - case S_IFREG: - de->d_u.d_44.d_type = DT_REG; - break; - case S_IFBLK: - de->d_u.d_44.d_type = DT_BLK; - break; - case S_IFDIR: - de->d_u.d_44.d_type = DT_DIR; - break; - case S_IFCHR: - de->d_u.d_44.d_type = DT_CHR; - break; - case S_IFIFO: - de->d_u.d_44.d_type = DT_FIFO; - break; - default: - de->d_u.d_44.d_type = DT_UNKNOWN; - } + de->d_u.d_44.d_type = (mode & S_IFMT) >> S_SHIFT; } static inline u32 -- 2.17.2
[PATCH] iw_cxgb4: fix a missing-check bug
In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see whether it is valid through t4_valid_cqe(). If it is valid, the address of the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the local memory in create_read_req_cqe(). The problem here is that the CQE is actually in a DMA region allocated by dma_alloc_coherent() in create_cq(). Given that the device also has the permission to access the DMA region, a malicious device controlled by an attacker can modify the CQE in the DMA region after the check in t4_next_hw_cqe() but before the copy in create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, which can cause undefined behavior of the kernel and introduce potential security risks. This patch avoids the above issue by saving the CQE to a local variable if it is verified to be a valid CQE in t4_next_hw_cqe(). Also, the local variable will be used for the copy in create_read_req_ceq(). Signed-off-by: Wenwen Wang --- drivers/infiniband/hw/cxgb4/cq.c | 8 +--- drivers/infiniband/hw/cxgb4/t4.h | 4 ++-- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 6d30427..09918ca 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -335,13 +335,15 @@ static void advance_oldest_read(struct t4_wq *wq) */ void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp *flush_qhp) { - struct t4_cqe *hw_cqe, *swcqe, read_cqe; + struct t4_cqe hw_cqe_obj; + struct t4_cqe *hw_cqe = _cqe_obj; + struct t4_cqe *swcqe, read_cqe; struct c4iw_qp *qhp; struct t4_swsqe *swsqe; int ret; pr_debug("cqid 0x%x\n", chp->cq.cqid); - ret = t4_next_hw_cqe(>cq, _cqe); + ret = t4_next_hw_cqe(>cq, hw_cqe); /* * This logic is similar to poll_cq(), but not quite the same @@ -414,7 +416,7 @@ void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp *flush_qhp) } next_cqe: t4_hwcq_consume(>cq); - ret = t4_next_hw_cqe(>cq, _cqe); + ret = t4_next_hw_cqe(>cq, hw_cqe); if (qhp && flush_qhp != qhp) spin_unlock(>lock); } diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h index e42021f..9a9eea5 100644 --- a/drivers/infiniband/hw/cxgb4/t4.h +++ b/drivers/infiniband/hw/cxgb4/t4.h @@ -791,7 +791,7 @@ static inline int t4_cq_notempty(struct t4_cq *cq) return cq->sw_in_use || t4_valid_cqe(cq, >queue[cq->cidx]); } -static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe) +static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe *cqe) { int ret; u16 prev_cidx; @@ -809,7 +809,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe) /* Ensure CQE is flushed to memory */ rmb(); - *cqe = >queue[cq->cidx]; + *cqe = cq->queue[cq->cidx]; ret = 0; } else ret = -ENODATA; -- 2.7.4
[PATCH] iw_cxgb4: fix a missing-check bug
In c4iw_flush_hw_cq, the next CQE is acquired through t4_next_hw_cqe(). In t4_next_hw_cqe(), the CQE, i.e., 'cq->queue[cq->cidx]', is checked to see whether it is valid through t4_valid_cqe(). If it is valid, the address of the CQE is then saved to 'hw_cqe'. Later on, the CQE is copied to the local memory in create_read_req_cqe(). The problem here is that the CQE is actually in a DMA region allocated by dma_alloc_coherent() in create_cq(). Given that the device also has the permission to access the DMA region, a malicious device controlled by an attacker can modify the CQE in the DMA region after the check in t4_next_hw_cqe() but before the copy in create_read_req_cqe(). By doing so, the attacker can supply invalid CQE, which can cause undefined behavior of the kernel and introduce potential security risks. This patch avoids the above issue by saving the CQE to a local variable if it is verified to be a valid CQE in t4_next_hw_cqe(). Also, the local variable will be used for the copy in create_read_req_ceq(). Signed-off-by: Wenwen Wang --- drivers/infiniband/hw/cxgb4/cq.c | 8 +--- drivers/infiniband/hw/cxgb4/t4.h | 4 ++-- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 6d30427..09918ca 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -335,13 +335,15 @@ static void advance_oldest_read(struct t4_wq *wq) */ void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp *flush_qhp) { - struct t4_cqe *hw_cqe, *swcqe, read_cqe; + struct t4_cqe hw_cqe_obj; + struct t4_cqe *hw_cqe = _cqe_obj; + struct t4_cqe *swcqe, read_cqe; struct c4iw_qp *qhp; struct t4_swsqe *swsqe; int ret; pr_debug("cqid 0x%x\n", chp->cq.cqid); - ret = t4_next_hw_cqe(>cq, _cqe); + ret = t4_next_hw_cqe(>cq, hw_cqe); /* * This logic is similar to poll_cq(), but not quite the same @@ -414,7 +416,7 @@ void c4iw_flush_hw_cq(struct c4iw_cq *chp, struct c4iw_qp *flush_qhp) } next_cqe: t4_hwcq_consume(>cq); - ret = t4_next_hw_cqe(>cq, _cqe); + ret = t4_next_hw_cqe(>cq, hw_cqe); if (qhp && flush_qhp != qhp) spin_unlock(>lock); } diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h index e42021f..9a9eea5 100644 --- a/drivers/infiniband/hw/cxgb4/t4.h +++ b/drivers/infiniband/hw/cxgb4/t4.h @@ -791,7 +791,7 @@ static inline int t4_cq_notempty(struct t4_cq *cq) return cq->sw_in_use || t4_valid_cqe(cq, >queue[cq->cidx]); } -static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe) +static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe *cqe) { int ret; u16 prev_cidx; @@ -809,7 +809,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct t4_cqe **cqe) /* Ensure CQE is flushed to memory */ rmb(); - *cqe = >queue[cq->cidx]; + *cqe = cq->queue[cq->cidx]; ret = 0; } else ret = -ENODATA; -- 2.7.4
[PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
The spec only requires the posted interrupt descriptor address to be 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also forces the address to be page aligned. Only validate that the address does not cross the maximum physical address without enforcing a page alignment. Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2") Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 30bf860..47962f2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, !nested_exit_intr_ack_set(vcpu) || (vmcs12->posted_intr_nv & 0xff00) || (vmcs12->posted_intr_desc_addr & 0x3f) || - (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr + (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu))) return -EINVAL; /* tpr shadow is needed by all apicv features. */ -- 2.7.4
[PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
The spec only requires the posted interrupt descriptor address to be 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also forces the address to be page aligned. Only validate that the address does not cross the maximum physical address without enforcing a page alignment. Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2") Signed-off-by: KarimAllah Ahmed --- arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 30bf860..47962f2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, !nested_exit_intr_ack_set(vcpu) || (vmcs12->posted_intr_nv & 0xff00) || (vmcs12->posted_intr_desc_addr & 0x3f) || - (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr + (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu))) return -EINVAL; /* tpr shadow is needed by all apicv features. */ -- 2.7.4
Re: [PATCH] mfd: tps6586x: Handle interrupts on suspend
On 10/19/18 4:22 PM, Jon Hunter wrote: > From: Jonathan Hunter > > The tps6586x driver creates an irqchip that is used by its various child > devices for managing interrupts. The tps6586x-rtc device is one of its > children that uses the tps6586x irqchip. When using the tps6586x-rtc as > a wake-up device from suspend, the following is seen: > > PM: Syncing filesystems ... done. > Freezing user space processes ... (elapsed 0.001 seconds) done. > OOM killer disabled. > Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. > Disabling non-boot CPUs ... > Entering suspend state LP1 > Enabling non-boot CPUs ... > CPU1 is up > tps6586x 3-0034: failed to read interrupt status > tps6586x 3-0034: failed to read interrupt status > > The reason why the tps6586x interrupt status cannot be read is because > the tps6586x interrupt is not masked during suspend and when the > tps6586x-rtc interrupt occurs, to wake-up the device, the interrupt is > seen before the i2c controller has been resumed in order to read the > tps6586x interrupt status. > > The tps6586x-rtc driver sets it's interrupt as a wake-up source during > suspend, which gets propagated to the parent tps6586x interrupt. > However, the tps6586x-rtc driver cannot disable it's interrupt during > suspend otherwise we would never be woken up and so the tps6586x must > disable it's interrupt instead. > > Prevent the tps6586x interrupt handler from executing on exiting suspend > before the i2c controller has been resumed by disabling the tps6586x > interrupt on entering suspend and re-enabling it on resuming from > suspend. > > Cc: sta...@vger.kernel.org > > Signed-off-by: Jon Hunter > --- Tested patch on linux-next and v4.14, works fine. Thank you! Reviewed-by: Dmitry Osipenko Tested-by: Dmitry Osipenko
Re: [PATCH] mfd: tps6586x: Handle interrupts on suspend
On 10/19/18 4:22 PM, Jon Hunter wrote: > From: Jonathan Hunter > > The tps6586x driver creates an irqchip that is used by its various child > devices for managing interrupts. The tps6586x-rtc device is one of its > children that uses the tps6586x irqchip. When using the tps6586x-rtc as > a wake-up device from suspend, the following is seen: > > PM: Syncing filesystems ... done. > Freezing user space processes ... (elapsed 0.001 seconds) done. > OOM killer disabled. > Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. > Disabling non-boot CPUs ... > Entering suspend state LP1 > Enabling non-boot CPUs ... > CPU1 is up > tps6586x 3-0034: failed to read interrupt status > tps6586x 3-0034: failed to read interrupt status > > The reason why the tps6586x interrupt status cannot be read is because > the tps6586x interrupt is not masked during suspend and when the > tps6586x-rtc interrupt occurs, to wake-up the device, the interrupt is > seen before the i2c controller has been resumed in order to read the > tps6586x interrupt status. > > The tps6586x-rtc driver sets it's interrupt as a wake-up source during > suspend, which gets propagated to the parent tps6586x interrupt. > However, the tps6586x-rtc driver cannot disable it's interrupt during > suspend otherwise we would never be woken up and so the tps6586x must > disable it's interrupt instead. > > Prevent the tps6586x interrupt handler from executing on exiting suspend > before the i2c controller has been resumed by disabling the tps6586x > interrupt on entering suspend and re-enabling it on resuming from > suspend. > > Cc: sta...@vger.kernel.org > > Signed-off-by: Jon Hunter > --- Tested patch on linux-next and v4.14, works fine. Thank you! Reviewed-by: Dmitry Osipenko Tested-by: Dmitry Osipenko
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
On Sat, Oct 20, 2018 at 10:22:29PM +0200, Andrea Parri wrote: > [...] > > > The second (informal) litmus test has a more interesting Linux-kernel > > counterpart: > > > > void t1_interrupt(void) > > { > > r0 = READ_ONCE(y); > > smp_store_release(, 1); > > } > > > > void t1(void) > > { > > smp_store_release(, 1); > > } > > > > void t2(void) > > { > > r1 = smp_load_acquire(); > > r2 = smp_load_acquire(); > > } > > > > On store-reordering architectures that implement smp_store_release() > > as a memory-barrier instruction followed by a store, the interrupt could > > arrive betweentimes in t1(), so that there would be no ordering between > > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. > > FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-) > but I admit that I've only run this test in *my mind*: in an SC world, > > CPU1CPU2 > > t1() > t1_interrupt() > r0 = READ_ONCE(y); // =0 > t2() > r1 = smp_load_acquire(); // =0 > smp_store_release(, 1); > smp_store_release(, 1); > r2 = smp_load_acquire(); // =1 OK, so did I get the outcome messed up again? :-/ Thanx, Paul > > So how paranoid should we be with respect to interrupt handlers for > > smp_store_release(), smp_load_acquire(), and the various RMW atomic > > operations that are sometimes implemented with separate memory-barrier > > instructions? ;-) > > Good question! ;-) > > Andrea > > > > > > Thanx, Paul > > >
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
On Sat, Oct 20, 2018 at 10:22:29PM +0200, Andrea Parri wrote: > [...] > > > The second (informal) litmus test has a more interesting Linux-kernel > > counterpart: > > > > void t1_interrupt(void) > > { > > r0 = READ_ONCE(y); > > smp_store_release(, 1); > > } > > > > void t1(void) > > { > > smp_store_release(, 1); > > } > > > > void t2(void) > > { > > r1 = smp_load_acquire(); > > r2 = smp_load_acquire(); > > } > > > > On store-reordering architectures that implement smp_store_release() > > as a memory-barrier instruction followed by a store, the interrupt could > > arrive betweentimes in t1(), so that there would be no ordering between > > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. > > FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-) > but I admit that I've only run this test in *my mind*: in an SC world, > > CPU1CPU2 > > t1() > t1_interrupt() > r0 = READ_ONCE(y); // =0 > t2() > r1 = smp_load_acquire(); // =0 > smp_store_release(, 1); > smp_store_release(, 1); > r2 = smp_load_acquire(); // =1 OK, so did I get the outcome messed up again? :-/ Thanx, Paul > > So how paranoid should we be with respect to interrupt handlers for > > smp_store_release(), smp_load_acquire(), and the various RMW atomic > > operations that are sometimes implemented with separate memory-barrier > > instructions? ;-) > > Good question! ;-) > > Andrea > > > > > > Thanx, Paul > > >
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote: > On Sat, 20 Oct 2018, Paul E. McKenney wrote: > > > The second (informal) litmus test has a more interesting Linux-kernel > > counterpart: > > > > void t1_interrupt(void) > > { > > r0 = READ_ONCE(y); > > smp_store_release(, 1); > > } > > > > void t1(void) > > { > > smp_store_release(, 1); > > } > > > > void t2(void) > > { > > r1 = smp_load_acquire(); > > r2 = smp_load_acquire(); > > } > > > > On store-reordering architectures that implement smp_store_release() > > as a memory-barrier instruction followed by a store, the interrupt could > > arrive betweentimes in t1(), so that there would be no ordering between > > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. > > This is disconcerting only if we assume that t1_interrupt() has to be > executed by the same CPU as t1(). If the interrupt could be fielded by > a different CPU then the paranoid outcome is perfectly understandable, > even in an SC context. > > So the question really should be limited to situations where a handler > is forced to execute in the context of a particular thread. While > POSIX does allow such restrictions for user programs, I'm not aware of > any similar mechanism in the kernel. Good point, and I was in fact assuming that t1() and t1_interrupt() were executing on the same CPU. This sort of thing happens naturally in the kernel when both t1() and t1_interrupt() are accessing per-CPU variables. Thanx, Paul
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
On Sat, Oct 20, 2018 at 04:18:37PM -0400, Alan Stern wrote: > On Sat, 20 Oct 2018, Paul E. McKenney wrote: > > > The second (informal) litmus test has a more interesting Linux-kernel > > counterpart: > > > > void t1_interrupt(void) > > { > > r0 = READ_ONCE(y); > > smp_store_release(, 1); > > } > > > > void t1(void) > > { > > smp_store_release(, 1); > > } > > > > void t2(void) > > { > > r1 = smp_load_acquire(); > > r2 = smp_load_acquire(); > > } > > > > On store-reordering architectures that implement smp_store_release() > > as a memory-barrier instruction followed by a store, the interrupt could > > arrive betweentimes in t1(), so that there would be no ordering between > > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. > > This is disconcerting only if we assume that t1_interrupt() has to be > executed by the same CPU as t1(). If the interrupt could be fielded by > a different CPU then the paranoid outcome is perfectly understandable, > even in an SC context. > > So the question really should be limited to situations where a handler > is forced to execute in the context of a particular thread. While > POSIX does allow such restrictions for user programs, I'm not aware of > any similar mechanism in the kernel. Good point, and I was in fact assuming that t1() and t1_interrupt() were executing on the same CPU. This sort of thing happens naturally in the kernel when both t1() and t1_interrupt() are accessing per-CPU variables. Thanx, Paul
Imaging 39
We are one image studio who is able to process 300+ photos a day. If you need any image editing, please let us know. We can do it for you such as: Image cut out for photos and clipping path, masking for your photos, They are mostly used for ecommerce photos, jewelry photos retouching, beauty and skin images and wedding photos. We do also different kind of beauty retouching, portraits retouching. We can send editing for your photos if you send us one or two photos. Thanks, Linda
Imaging 39
We are one image studio who is able to process 300+ photos a day. If you need any image editing, please let us know. We can do it for you such as: Image cut out for photos and clipping path, masking for your photos, They are mostly used for ecommerce photos, jewelry photos retouching, beauty and skin images and wedding photos. We do also different kind of beauty retouching, portraits retouching. We can send editing for your photos if you send us one or two photos. Thanks, Linda
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
[...] > The second (informal) litmus test has a more interesting Linux-kernel > counterpart: > > void t1_interrupt(void) > { > r0 = READ_ONCE(y); > smp_store_release(, 1); > } > > void t1(void) > { > smp_store_release(, 1); > } > > void t2(void) > { > r1 = smp_load_acquire(); > r2 = smp_load_acquire(); > } > > On store-reordering architectures that implement smp_store_release() > as a memory-barrier instruction followed by a store, the interrupt could > arrive betweentimes in t1(), so that there would be no ordering between > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-) but I admit that I've only run this test in *my mind*: in an SC world, CPU1 CPU2 t1() t1_interrupt() r0 = READ_ONCE(y); // =0 t2() r1 = smp_load_acquire(); // =0 smp_store_release(, 1); smp_store_release(, 1); r2 = smp_load_acquire(); // =1 > So how paranoid should we be with respect to interrupt handlers for > smp_store_release(), smp_load_acquire(), and the various RMW atomic > operations that are sometimes implemented with separate memory-barrier > instructions? ;-) Good question! ;-) Andrea > > Thanx, Paul >
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
[...] > The second (informal) litmus test has a more interesting Linux-kernel > counterpart: > > void t1_interrupt(void) > { > r0 = READ_ONCE(y); > smp_store_release(, 1); > } > > void t1(void) > { > smp_store_release(, 1); > } > > void t2(void) > { > r1 = smp_load_acquire(); > r2 = smp_load_acquire(); > } > > On store-reordering architectures that implement smp_store_release() > as a memory-barrier instruction followed by a store, the interrupt could > arrive betweentimes in t1(), so that there would be no ordering between > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. FWIW, I'd rather call "paranoid" the act of excluding such outcome ;-) but I admit that I've only run this test in *my mind*: in an SC world, CPU1 CPU2 t1() t1_interrupt() r0 = READ_ONCE(y); // =0 t2() r1 = smp_load_acquire(); // =0 smp_store_release(, 1); smp_store_release(, 1); r2 = smp_load_acquire(); // =1 > So how paranoid should we be with respect to interrupt handlers for > smp_store_release(), smp_load_acquire(), and the various RMW atomic > operations that are sometimes implemented with separate memory-barrier > instructions? ;-) Good question! ;-) Andrea > > Thanx, Paul >
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
On Sat, 20 Oct 2018, Paul E. McKenney wrote: > The second (informal) litmus test has a more interesting Linux-kernel > counterpart: > > void t1_interrupt(void) > { > r0 = READ_ONCE(y); > smp_store_release(, 1); > } > > void t1(void) > { > smp_store_release(, 1); > } > > void t2(void) > { > r1 = smp_load_acquire(); > r2 = smp_load_acquire(); > } > > On store-reordering architectures that implement smp_store_release() > as a memory-barrier instruction followed by a store, the interrupt could > arrive betweentimes in t1(), so that there would be no ordering between > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. This is disconcerting only if we assume that t1_interrupt() has to be executed by the same CPU as t1(). If the interrupt could be fielded by a different CPU then the paranoid outcome is perfectly understandable, even in an SC context. So the question really should be limited to situations where a handler is forced to execute in the context of a particular thread. While POSIX does allow such restrictions for user programs, I'm not aware of any similar mechanism in the kernel. Alan
Re: Interrupts, smp_load_acquire(), smp_store_release(), etc.
On Sat, 20 Oct 2018, Paul E. McKenney wrote: > The second (informal) litmus test has a more interesting Linux-kernel > counterpart: > > void t1_interrupt(void) > { > r0 = READ_ONCE(y); > smp_store_release(, 1); > } > > void t1(void) > { > smp_store_release(, 1); > } > > void t2(void) > { > r1 = smp_load_acquire(); > r2 = smp_load_acquire(); > } > > On store-reordering architectures that implement smp_store_release() > as a memory-barrier instruction followed by a store, the interrupt could > arrive betweentimes in t1(), so that there would be no ordering between > t1_interrupt()'s store to x and t1()'s store to y. This could (again, > in paranoid theory) result in the outcome r0==0 && r1==0 && r2==1. This is disconcerting only if we assume that t1_interrupt() has to be executed by the same CPU as t1(). If the interrupt could be fielded by a different CPU then the paranoid outcome is perfectly understandable, even in an SC context. So the question really should be limited to situations where a handler is forced to execute in the context of a particular thread. While POSIX does allow such restrictions for user programs, I'm not aware of any similar mechanism in the kernel. Alan
[PATCH] thunderbolt: fix a missing-check bug
In tb_ring_poll(), the flag of the frame, i.e., 'ring->descriptors[ring->tail].flags', is checked to see whether the frame is completed. If yes, the frame including the flag will be read from the ring and returned to the caller. The problem here is that the flag is actually in a DMA region, which is allocated through dma_alloc_coherent() in tb_ring_alloc(). Given that the device can also access this DMA region, it is possible that a malicious device controlled by an attacker can modify the flag between the check and the copy. By doing so, the attacker can bypass the check and supply uncompleted frame, which can cause undefined behavior of the kernel and introduce potential security risk. This patch firstly copies the flag into a local variable 'desc_flags' and then performs the check and copy using 'desc_flags'. Through this way, the above issue can be avoided. Signed-off-by: Wenwen Wang --- drivers/thunderbolt/nhi.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c index 5cd6bdf..481b1f2 100644 --- a/drivers/thunderbolt/nhi.c +++ b/drivers/thunderbolt/nhi.c @@ -289,6 +289,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring) { struct ring_frame *frame = NULL; unsigned long flags; + enum ring_desc_flags desc_flags; spin_lock_irqsave(>lock, flags); if (!ring->running) @@ -296,7 +297,8 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring) if (ring_empty(ring)) goto unlock; - if (ring->descriptors[ring->tail].flags & RING_DESC_COMPLETED) { + desc_flags = ring->descriptors[ring->tail].flags; + if (desc_flags & RING_DESC_COMPLETED) { frame = list_first_entry(>in_flight, typeof(*frame), list); list_del_init(>list); @@ -305,7 +307,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring) frame->size = ring->descriptors[ring->tail].length; frame->eof = ring->descriptors[ring->tail].eof; frame->sof = ring->descriptors[ring->tail].sof; - frame->flags = ring->descriptors[ring->tail].flags; + frame->flags = desc_flags; } ring->tail = (ring->tail + 1) % ring->size; -- 2.7.4
[PATCH] thunderbolt: fix a missing-check bug
In tb_ring_poll(), the flag of the frame, i.e., 'ring->descriptors[ring->tail].flags', is checked to see whether the frame is completed. If yes, the frame including the flag will be read from the ring and returned to the caller. The problem here is that the flag is actually in a DMA region, which is allocated through dma_alloc_coherent() in tb_ring_alloc(). Given that the device can also access this DMA region, it is possible that a malicious device controlled by an attacker can modify the flag between the check and the copy. By doing so, the attacker can bypass the check and supply uncompleted frame, which can cause undefined behavior of the kernel and introduce potential security risk. This patch firstly copies the flag into a local variable 'desc_flags' and then performs the check and copy using 'desc_flags'. Through this way, the above issue can be avoided. Signed-off-by: Wenwen Wang --- drivers/thunderbolt/nhi.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c index 5cd6bdf..481b1f2 100644 --- a/drivers/thunderbolt/nhi.c +++ b/drivers/thunderbolt/nhi.c @@ -289,6 +289,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring) { struct ring_frame *frame = NULL; unsigned long flags; + enum ring_desc_flags desc_flags; spin_lock_irqsave(>lock, flags); if (!ring->running) @@ -296,7 +297,8 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring) if (ring_empty(ring)) goto unlock; - if (ring->descriptors[ring->tail].flags & RING_DESC_COMPLETED) { + desc_flags = ring->descriptors[ring->tail].flags; + if (desc_flags & RING_DESC_COMPLETED) { frame = list_first_entry(>in_flight, typeof(*frame), list); list_del_init(>list); @@ -305,7 +307,7 @@ struct ring_frame *tb_ring_poll(struct tb_ring *ring) frame->size = ring->descriptors[ring->tail].length; frame->eof = ring->descriptors[ring->tail].eof; frame->sof = ring->descriptors[ring->tail].sof; - frame->flags = ring->descriptors[ring->tail].flags; + frame->flags = desc_flags; } ring->tail = (ring->tail + 1) % ring->size; -- 2.7.4
[GIT PULL] code of conduct fixes for 4.19-rc8
This is the series of patches which has been discussed on both ksummit- discuss and linux-kernel for the past few weeks. As Shuah said when kicking off the process, it's designed as a starting point for the next phase of the discussion, not as the end point, so it's only really a set of minor updates to further that goal. The merger of the three patches to show the combined effect is attached below. However, Greg recently posted the next phase of the discussion, so people will be asking what the merger of the series looks like. Ignoring the non-CoC documents, I think it looks like this --- diff --git a/Documentation/process/code-of-conduct.rst b/Documentation/process/code-of-conduct.rst index eec768471a4d..8913851dab89 100644 --- a/Documentation/process/code-of-conduct.rst +++ b/Documentation/process/code-of-conduct.rst @@ -59,19 +59,27 @@ address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. -Reporting -= +Enforcement +=== Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the Technical Advisory Board (TAB) at -. All complaints will be reviewed and -investigated and will result in a response that is deemed necessary and -appropriate to the circumstances. The TAB is obligated to maintain -confidentiality with regard to the reporter of an incident (except where -required by law). +reported by contacting the Code of Conduct Committee at +. All complaints will be reviewed and investigated +and will result in a response that is deemed necessary and appropriate +to the circumstances. The Code of Conduct Committee is obligated to +maintain confidentiality with regard to the reporter of an incident +(except where required by law). Further details of specific enforcement +policies may be posted separately. + Attribution === This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +Interpretation +== + +See the :ref:`code_of_conduct_interpretation` document for how the Linux +kernel community will be interpreting this document. --- And I'm sure it can be rebased to this without disturbing the currently gathered tags. The patch can be pulled from git://git.kernel.org/pub/scm/linux/kernel/git/jejb/linux-coc.git coc-fixes With the diffstat: Documentation/process/code-of-conduct.rst | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) With full diff below. James --- diff --git a/Documentation/process/code-of-conduct.rst b/Documentation/process/code-of-conduct.rst index ab7c24b5478c..eec768471a4d 100644 --- a/Documentation/process/code-of-conduct.rst +++ b/Documentation/process/code-of-conduct.rst @@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include: * Trolling, insulting/derogatory comments, and personal or political attacks * Public or private harassment * Publishing others’ private information, such as a physical or electronic - address, without explicit permission + address not ordinarily collected by the project, without explicit permission * Other conduct which could reasonably be considered inappropriate in a professional setting @@ -59,20 +59,16 @@ address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. -Enforcement -=== +Reporting += Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the Technical Advisory Board (TAB) at . All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The TAB is obligated to maintain -confidentiality with regard to the reporter of an incident. Further details of -specific enforcement policies may be posted separately. - -Maintainers who do not follow or enforce the Code of Conduct in good faith may -face temporary or permanent repercussions as determined by other members of the -project’s leadership. +confidentiality with regard to the reporter of an incident (except where +required by law). Attribution ===
[GIT PULL] code of conduct fixes for 4.19-rc8
This is the series of patches which has been discussed on both ksummit- discuss and linux-kernel for the past few weeks. As Shuah said when kicking off the process, it's designed as a starting point for the next phase of the discussion, not as the end point, so it's only really a set of minor updates to further that goal. The merger of the three patches to show the combined effect is attached below. However, Greg recently posted the next phase of the discussion, so people will be asking what the merger of the series looks like. Ignoring the non-CoC documents, I think it looks like this --- diff --git a/Documentation/process/code-of-conduct.rst b/Documentation/process/code-of-conduct.rst index eec768471a4d..8913851dab89 100644 --- a/Documentation/process/code-of-conduct.rst +++ b/Documentation/process/code-of-conduct.rst @@ -59,19 +59,27 @@ address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. -Reporting -= +Enforcement +=== Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the Technical Advisory Board (TAB) at -. All complaints will be reviewed and -investigated and will result in a response that is deemed necessary and -appropriate to the circumstances. The TAB is obligated to maintain -confidentiality with regard to the reporter of an incident (except where -required by law). +reported by contacting the Code of Conduct Committee at +. All complaints will be reviewed and investigated +and will result in a response that is deemed necessary and appropriate +to the circumstances. The Code of Conduct Committee is obligated to +maintain confidentiality with regard to the reporter of an incident +(except where required by law). Further details of specific enforcement +policies may be posted separately. + Attribution === This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +Interpretation +== + +See the :ref:`code_of_conduct_interpretation` document for how the Linux +kernel community will be interpreting this document. --- And I'm sure it can be rebased to this without disturbing the currently gathered tags. The patch can be pulled from git://git.kernel.org/pub/scm/linux/kernel/git/jejb/linux-coc.git coc-fixes With the diffstat: Documentation/process/code-of-conduct.rst | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) With full diff below. James --- diff --git a/Documentation/process/code-of-conduct.rst b/Documentation/process/code-of-conduct.rst index ab7c24b5478c..eec768471a4d 100644 --- a/Documentation/process/code-of-conduct.rst +++ b/Documentation/process/code-of-conduct.rst @@ -31,7 +31,7 @@ Examples of unacceptable behavior by participants include: * Trolling, insulting/derogatory comments, and personal or political attacks * Public or private harassment * Publishing others’ private information, such as a physical or electronic - address, without explicit permission + address not ordinarily collected by the project, without explicit permission * Other conduct which could reasonably be considered inappropriate in a professional setting @@ -59,20 +59,16 @@ address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. -Enforcement -=== +Reporting += Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the Technical Advisory Board (TAB) at . All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The TAB is obligated to maintain -confidentiality with regard to the reporter of an incident. Further details of -specific enforcement policies may be posted separately. - -Maintainers who do not follow or enforce the Code of Conduct in good faith may -face temporary or permanent repercussions as determined by other members of the -project’s leadership. +confidentiality with regard to the reporter of an incident (except where +required by law). Attribution ===
Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address
On Sat, 2018-10-20 at 19:28 +0100, Alan Cox wrote: > > +to the circumstances. The Code of Conduct Committee is obligated > > to > > +maintain confidentiality with regard to the reporter of an > > incident. > > +Further details of specific enforcement policies may be posted > > +separately. > > Unfortunately by ignoring the other suggestions on this you've left > this bit broken. > > The committee can't keep most stuff confidential so it's misleading > and wrong to imply they can. Data protection law, reporting laws in > some countries and the like mean that anyone expecting an incident to > remain confidential from the person it was reported against is living > in dreamland and are going to get a nasty shock. > > At the very least it should say '(except where required by law)'. I've got a solution for this: the patches I've been curating also modify the section so the merger will look like what I have below. The intent of the series I'm curating was only the beginning to show desire to change in 4.19 but to correct the obvious defect before we started the debate, so after suitable discussion, this one can be the final set. > There is a separate issue that serious things should always go to law > enforcement - you are setting up a policy akin to the one that got > the catholic church and many others in trouble. > > You should also reserving the right to report serious incidents > directly to law enforcement. Unless of course you want to be forced > to sit on multiple reports of physical abuse from different people > about someone - unable to tell them about each others report, unable > to prove anything, and in twenty years time having to explain to the > media why nothing was done. I think we should debate that. Most legal systems provide significant deference to victims wishing for confidentiality and we should both respect that and remember that an automatic crime report is a significant deterrent to vulnerable people in a lot of places. James --- diff --git a/Documentation/process/code-of-conduct.rst b/Documentation/process/code-of-conduct.rst index eec768471a4d..8913851dab89 100644 --- a/Documentation/process/code-of-conduct.rst +++ b/Documentation/process/code-of-conduct.rst @@ -59,19 +59,27 @@ address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. -Reporting -= +Enforcement +=== Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the Technical Advisory Board (TAB) at -. All complaints will be reviewed and -investigated and will result in a response that is deemed necessary and -appropriate to the circumstances. The TAB is obligated to maintain -confidentiality with regard to the reporter of an incident (except where -required by law). +reported by contacting the Code of Conduct Committee at +. All complaints will be reviewed and investigated +and will result in a response that is deemed necessary and appropriate +to the circumstances. The Code of Conduct Committee is obligated to +maintain confidentiality with regard to the reporter of an incident +(except where required by law). Further details of specific enforcement +policies may be posted separately. + Attribution === This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +Interpretation +== + +See the :ref:`code_of_conduct_interpretation` document for how the Linux +kernel community will be interpreting this document.
Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address
On Sat, 2018-10-20 at 19:28 +0100, Alan Cox wrote: > > +to the circumstances. The Code of Conduct Committee is obligated > > to > > +maintain confidentiality with regard to the reporter of an > > incident. > > +Further details of specific enforcement policies may be posted > > +separately. > > Unfortunately by ignoring the other suggestions on this you've left > this bit broken. > > The committee can't keep most stuff confidential so it's misleading > and wrong to imply they can. Data protection law, reporting laws in > some countries and the like mean that anyone expecting an incident to > remain confidential from the person it was reported against is living > in dreamland and are going to get a nasty shock. > > At the very least it should say '(except where required by law)'. I've got a solution for this: the patches I've been curating also modify the section so the merger will look like what I have below. The intent of the series I'm curating was only the beginning to show desire to change in 4.19 but to correct the obvious defect before we started the debate, so after suitable discussion, this one can be the final set. > There is a separate issue that serious things should always go to law > enforcement - you are setting up a policy akin to the one that got > the catholic church and many others in trouble. > > You should also reserving the right to report serious incidents > directly to law enforcement. Unless of course you want to be forced > to sit on multiple reports of physical abuse from different people > about someone - unable to tell them about each others report, unable > to prove anything, and in twenty years time having to explain to the > media why nothing was done. I think we should debate that. Most legal systems provide significant deference to victims wishing for confidentiality and we should both respect that and remember that an automatic crime report is a significant deterrent to vulnerable people in a lot of places. James --- diff --git a/Documentation/process/code-of-conduct.rst b/Documentation/process/code-of-conduct.rst index eec768471a4d..8913851dab89 100644 --- a/Documentation/process/code-of-conduct.rst +++ b/Documentation/process/code-of-conduct.rst @@ -59,19 +59,27 @@ address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. -Reporting -= +Enforcement +=== Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the Technical Advisory Board (TAB) at -. All complaints will be reviewed and -investigated and will result in a response that is deemed necessary and -appropriate to the circumstances. The TAB is obligated to maintain -confidentiality with regard to the reporter of an incident (except where -required by law). +reported by contacting the Code of Conduct Committee at +. All complaints will be reviewed and investigated +and will result in a response that is deemed necessary and appropriate +to the circumstances. The Code of Conduct Committee is obligated to +maintain confidentiality with regard to the reporter of an incident +(except where required by law). Further details of specific enforcement +policies may be posted separately. + Attribution === This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +Interpretation +== + +See the :ref:`code_of_conduct_interpretation` document for how the Linux +kernel community will be interpreting this document.
Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address
On Sat, 2018-10-20 at 19:24 +, tim.b...@sony.com wrote: > The scope of the code of conduct basically means that it covers > online interactions (communication via mailing list, git commits > and Bugzilla). Not to be flippant, but those are hardly mediums > that are susceptible to executing physical abuse. Also, they are > all mediums that leave a persistent, public trail. So I don't think > the > comparison is very apt here. > -- Tim If that is the case, then why does this need to go into the Linux kernel in the first place? The mailing lists, the kernel.org git repository, and bugzilla presumably all have "terms of use" pages that could specify the expected behaviour very explicitly, and could specify how arbitration works as part of those terms of use (and if enforcement is required, then it could specify legal venues etc). IOW: if the scope is just communication online, then I would think there are better tools for that. Putting a code of conduct into the kernel code itself wants to be justified by more than just regulating online behaviour. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.mykleb...@hammerspace.com
Re: [Ksummit-discuss] [PATCH 6/7] Code of Conduct: Change the contact email address
On Sat, 2018-10-20 at 19:24 +, tim.b...@sony.com wrote: > The scope of the code of conduct basically means that it covers > online interactions (communication via mailing list, git commits > and Bugzilla). Not to be flippant, but those are hardly mediums > that are susceptible to executing physical abuse. Also, they are > all mediums that leave a persistent, public trail. So I don't think > the > comparison is very apt here. > -- Tim If that is the case, then why does this need to go into the Linux kernel in the first place? The mailing lists, the kernel.org git repository, and bugzilla presumably all have "terms of use" pages that could specify the expected behaviour very explicitly, and could specify how arbitration works as part of those terms of use (and if enforcement is required, then it could specify legal venues etc). IOW: if the scope is just communication online, then I would think there are better tools for that. Putting a code of conduct into the kernel code itself wants to be justified by more than just regulating online behaviour. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.mykleb...@hammerspace.com