On 12/15/2016 1:33 PM, Alex Williamson wrote:
> On Thu, 15 Dec 2016 12:05:35 +0530
> Kirti Wankhede <[email protected]> wrote:
> 
>> On 12/14/2016 2:28 AM, Alex Williamson wrote:
>>> As part of the mdev support, type1 now gets a task reference per
>>> vfio_dma and uses that to get an mm reference for the task while
>>> working on accounting.  That's the correct thing to do for paths
>>> where we can't rely on using current, but there are still hot paths
>>> where we can optimize because we know we're invoked by the user.
>>>
>>> Specifically, vfio_pin_pages_remote() is only called when the user
>>> does DMA mapping (vfio_dma_do_map) or if an IOMMU group is added to
>>> a container with existing mappings (vfio_iommu_replay).  We can
>>> therefore use current->mm as well as rlimit() and capable() directly
>>> rather than going through the high overhead path via the stored
>>> task_struct.  We also know that vfio_dma_do_unmap() is only called
>>> via user ioctl, so we can also tune that path to be more lightweight.
>>>
>>> In a synthetic guest mapping test emulating a 1TB VM backed by a
>>> single 4GB range remapped multiple times across the address space,
>>> the mdev changes to the type1 backend introduced a roughly 25% hit
>>> in runtime of this test.  These changes restore it to nearly the
>>> previous performance for the interfaces exercised here,
>>> VFIO_IOMMU_MAP_DMA and release on close.
>>>
>>> Signed-off-by: Alex Williamson <[email protected]>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c |  145 
>>> +++++++++++++++++++++------------------
>>>  1 file changed, 79 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>> b/drivers/vfio/vfio_iommu_type1.c
>>> index 9815e45..8dfeafb 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -103,6 +103,10 @@ struct vfio_pfn {
>>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)    \
>>>                                     (!list_empty(&iommu->domain_list))
>>>  
>>> +/* Make function bool options readable */
>>> +#define IS_CURRENT (true)
>>> +#define DO_ACCOUNTING      (true)
>>> +
>>>  static int put_pfn(unsigned long pfn, int prot);
>>>  
>>>  /*
>>> @@ -264,7 +268,8 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>>>     kfree(vwork);
>>>  }
>>>  
>>> -static void vfio_lock_acct(struct task_struct *task, long npage)
>>> +static void vfio_lock_acct(struct task_struct *task,
>>> +                      long npage, bool is_current)
>>>  {
>>>     struct vwork *vwork;
>>>     struct mm_struct *mm;
>>> @@ -272,24 +277,31 @@ static void vfio_lock_acct(struct task_struct *task, 
>>> long npage)
>>>     if (!npage)
>>>             return;
>>>  
>>> -   mm = get_task_mm(task);
>>> +   mm = is_current ? task->mm : get_task_mm(task);
>>>     if (!mm)
>>> -           return; /* process exited or nothing to do */
>>> +           return; /* process exited */
>>>  
>>>     if (down_write_trylock(&mm->mmap_sem)) {
>>>             mm->locked_vm += npage;
>>>             up_write(&mm->mmap_sem);
>>> -           mmput(mm);
>>> +           if (!is_current)
>>> +                   mmput(mm);
>>>             return;
>>>     }
>>>  
>>> +   if (is_current) {
>>> +           mm = get_task_mm(task);
>>> +           if (!mm)
>>> +                   return;
>>> +   }
>>> +
>>>     /*
>>>      * Couldn't get mmap_sem lock, so must setup to update
>>>      * mm->locked_vm later. If locked_vm were atomic, we
>>>      * wouldn't need this silliness
>>>      */
>>>     vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>>> -   if (!vwork) {
>>> +   if (WARN_ON(!vwork)) {
>>>             mmput(mm);
>>>             return;
>>>     }
>>> @@ -345,13 +357,13 @@ static int put_pfn(unsigned long pfn, int prot)
>>>  }
>>>  
>>>  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>> -                    int prot, unsigned long *pfn)
>>> +                    int prot, unsigned long *pfn, bool is_current)
>>>  {
>>>     struct page *page[1];
>>>     struct vm_area_struct *vma;
>>>     int ret;
>>>  
>>> -   if (mm == current->mm) {
>>> +   if (is_current) {  
>>
>> With this change, if vfio_pin_page_external() gets called from QEMU
>> process context, for example in response to some BAR0 register access,
>> it will still fallback to slow path, get_user_pages_remote(). We don't
>> have to change this function. This path already takes care of taking
>> best possible path.
>>
>> That also makes me think, vfio_pin_page_external() uses task structure
>> to get mlock limit and capability. Expectation is mdev vendor driver
>> shouldn't pin all system memory, but if any mdev driver does that, then
>> that driver might see such performance impact. Should we optimize this
>> path if (dma->task == current)?
> 
> Hi Kirti,
> 
> I was actually trying to avoid the (task == current) test with this
> change because I wasn't sure how reliable it is.  Is there a
> possibility that this test generates a false positive if current
> coincidentally matches our task and does that allow us the same
> opportunities for making use of current that we have when we know in a
> process context execution path?  The above change makes this a more
> direct association.  Can you show that inferring the process context is
> correct?  Thanks,

We do hold the usage count of task structure, get_task_struct(current),
before saving its reference in dma->task which is released,
put_task_struct(), from vfio_remove_dma(). That makes sure that we have
a valid reference to task structure till we remove/free that dma
structure. Why would the check (dma->task == current) be false positive?
Vendor driver can call vfio_pin_pages() on access to some emulated
register from the same task who have mapped dma range, in that case this
check would be true.

Thanks,
Kirti

Reply via email to