from:"Bob Liu"

Re: [PATCH 0/3] block: blk_interposer - Block Layer Interposer

2020-12-14 Thread Bob Liu

Hi Folks,

On 12/12/20 12:56 AM, Hannes Reinecke wrote:
> On 12/11/20 5:33 PM, Jens Axboe wrote:
>> On 12/11/20 9:30 AM, Mike Snitzer wrote:
>>> While I still think there needs to be a proper _upstream_ consumer of
>>> blk_interposer as a condition of it going in.. I'll let others make the
>>> call.
>>
>> That's an unequivocal rule.
>>
>>> As such, I'll defer to Jens, Christoph and others on whether your
>>> minimalist blk_interposer hook is acceptable in the near-term.
>>
>> I don't think so, we don't do short term bandaids just to plan on
>> ripping that out when the real functionality is there. IMHO, the dm
>> approach is the way to go - it provides exactly the functionality that
>> is needed in an appropriate way, instead of hacking some "interposer"
>> into the core block layer.
>>
> Which is my plan, too.
> 
> I'll be working with the Veeam folks to present a joint patchset (including 
> the DM bits) for the next round.
> 

Besides the dm approach, do you think Veeam's original requirement is a good
use case of "block/bpf: add eBPF based block layer IO filtering"?
https://lwn.net/ml/bpf/20200812163305.545447-1-leah.ruman...@gmail.com/

Thanks,
Bob

Re: [PATCH v3] dm crypt: add flags to optionally bypass dm-crypt workqueues

2020-07-08 Thread Bob Liu

   return;
> + }
> +
>   INIT_WORK(>work, kcryptd_crypt);
>   queue_work(cc->crypt_queue, >work);
>  }
> @@ -2838,7 +2862,7 @@ static int crypt_ctr_optional(struct dm_target *ti, 
> unsigned int argc, char **ar
>   struct crypt_config *cc = ti->private;
>   struct dm_arg_set as;
>   static const struct dm_arg _args[] = {
> - {0, 6, "Invalid number of feature args"},
> + {0, 8, "Invalid number of feature args"},
>   };
>   unsigned int opt_params, val;
>   const char *opt_string, *sval;
> @@ -2868,6 +2892,10 @@ static int crypt_ctr_optional(struct dm_target *ti, 
> unsigned int argc, char **ar
>  
>   else if (!strcasecmp(opt_string, "submit_from_crypt_cpus"))
>   set_bit(DM_CRYPT_NO_OFFLOAD, >flags);
> + else if (!strcasecmp(opt_string, "no_read_workqueue"))
> + set_bit(DM_CRYPT_NO_READ_WORKQUEUE, >flags);
> + else if (!strcasecmp(opt_string, "no_write_workqueue"))
> + set_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, >flags);
>   else if (sscanf(opt_string, "integrity:%u:", ) == 1) {
>   if (val == 0 || val > MAX_TAG_SIZE) {
>   ti->error = "Invalid integrity arguments";
> @@ -3196,6 +3224,8 @@ static void crypt_status(struct dm_target *ti, 
> status_type_t type,
>   num_feature_args += !!ti->num_discard_bios;
>   num_feature_args += test_bit(DM_CRYPT_SAME_CPU, >flags);
>   num_feature_args += test_bit(DM_CRYPT_NO_OFFLOAD, >flags);
> + num_feature_args += test_bit(DM_CRYPT_NO_READ_WORKQUEUE, 
> >flags);
> + num_feature_args += test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, 
> >flags);
>   num_feature_args += cc->sector_size != (1 << SECTOR_SHIFT);
>   num_feature_args += test_bit(CRYPT_IV_LARGE_SECTORS, 
> >cipher_flags);
>   if (cc->on_disk_tag_size)
> @@ -3208,6 +3238,10 @@ static void crypt_status(struct dm_target *ti, 
> status_type_t type,
>       DMEMIT(" same_cpu_crypt");
>   if (test_bit(DM_CRYPT_NO_OFFLOAD, >flags))
>   DMEMIT(" submit_from_crypt_cpus");
> + if (test_bit(DM_CRYPT_NO_READ_WORKQUEUE, >flags))
> + DMEMIT(" no_read_workqueue");
> + if (test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, >flags))
> + DMEMIT(" no_write_workqueue");
>   if (cc->on_disk_tag_size)
>   DMEMIT(" integrity:%u:%s", 
> cc->on_disk_tag_size, cc->cipher_auth);
>   if (cc->sector_size != (1 << SECTOR_SHIFT))
> @@ -3320,7 +3354,7 @@ static void crypt_io_hints(struct dm_target *ti, struct 
> queue_limits *limits)
>  
>  static struct target_type crypt_target = {
>   .name   = "crypt",
> - .version = {1, 21, 0},
> + .version = {1, 22, 0},
>   .module = THIS_MODULE,
>   .ctr= crypt_ctr,
>   .dtr= crypt_dtr,
> 

This patch looks good to me, I tested with null_blk and got similar 
improvement. Thanks!

Reviewed-by: Bob Liu

Re: [dm-devel] [PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues

2020-07-06 Thread Bob Liu

Hi Ignat,

On 6/27/20 5:03 AM, Ignat Korchagin wrote:
> This is a follow up from [1]. Consider the following script:
> 
> sudo modprobe brd rd_nr=1 rd_size=4194304
> 

Did you test null_blk device? I didn't get result as expected using null_blk.

1.
# fio --filename=/dev/nullb0 --readwrite=readwrite --bs=4k --direct=1 --loops   
 =100 --name=plain
Run status group 0 (all jobs):
  READ: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), io=10.6GiB 
(11.3GB), run=27744-27744msec
  WRITE: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), io=10.6GiB 
(11.3GB), run=27744-27744msec

2.
Create enctrypted-ram0 based on null_blk(without this patch):
# cryptsetup open --header crypthdr.img /dev/nullb0 encrypted-ram0
# fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k 
--direct=1 --loops=100 --name=crypt
Run status group 0 (all jobs):
  READ: bw=180MiB/s (188MB/s), 180MiB/s-180MiB/s (188MB/s-188MB/s), io=4534MiB 
(4754MB), run=25246-25246msec
  WRITE: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=4528MiB 
(4748MB), run=25246-25246msec

3.
Create enctrypted-ram0 based on null_blk(with this patch):
# cryptsetup open --header crypthdr.img /dev/nullb0 encrypted-ram0
# fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k  
--direct=1 --loops=100 --name=crypt.patched
Run status group 0 (all jobs):
  READ: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), io=4128MiB 
(4329MB), run=27753-27753msec
  WRITE: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), io=4124MiB 
(4324MB), run=27753-27753msec

Looks like the result is worse after this patch, or I may miss something..

Regards,
Bob


> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \
> sudo dmsetup create eram0
> 
> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> no_write_workqueue' | \
> sudo dmsetup create eram0-inline-write
> 
> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> no_read_workqueue' | \
> sudo dmsetup create eram0-inline-read
> 
> devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read "
> devices+="/dev/mapper/eram0-inline-write"
> 
> for dev in $devices; do
>   echo "reading from $dev"
>   sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \
>   --loops=100 --runtime=3m --name=plain | grep READ
> done
> 
> for dev in $devices; do
>   echo "writing to $dev"
>   sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \
>   --loops=100 --runtime=3m --name=plain | grep WRITE
> done
> 
> This script creates a ramdisk (to eliminate hardware bias in the benchmark) 
> and
> three dm-crypt instances on top. All dm-crypt instances use the NULL cipher
> to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> memcpy
> for "encyption"). The first instance is the current dm-crypt implementation 
> from
> 5.8-rc2, the two others have new optional flags enabled, which bypass kcryptd
> workqueues for reads and writes respectively and write sorting for writes. On
> my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I
> get the following output (formatted for better readability):
> 
> reading from /dev/ram0
>READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), 
> io=89.3GiB (95.9GB), run=18-18msec
> 
> reading from /dev/mapper/eram0
>READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), 
> io=14.2GiB (15.2GB), run=18-18msec
> 
> reading from /dev/mapper/eram0-inline-read
>READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), 
> io=51.8GiB (55.6GB), run=18-18msec
> 
> reading from /dev/mapper/eram0-inline-write
>READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), 
> io=20.1GiB (21.5GB), run=18-18msec
> 
> writing to /dev/ram0
>   WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), 
> io=90.7GiB (97.4GB), run=180001-180001msec
> 
> writing to /dev/mapper/eram0
>   WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), 
> io=7271MiB (7624MB), run=180001-180001msec
> 
> writing to /dev/mapper/eram0-inline-read
>   WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), 
> io=7000MiB (7340MB), run=180001-180001msec
> 
> writing to /dev/mapper/eram0-inline-write
>   WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), 
> io=48.6GiB (52.2GB), run=18-18msec
> 
> Current dm-crypt implementation creates a significant IO performance overhead
> (at least on small IO block sizes) for both latency and throughput. We suspect
> offloading IO request processing into workqueues and async threads is more
> harmful these days with the modern fast storage. I also did some digging into
> the dm-crypt git history and much of this async processing is not needed
> anymore, because the reasons it was added are mostly gone from the kernel. 
> More
> details can be found in [2] (see "Git

Re: [PATCH 1/2] workqueue: don't always set __WQ_ORDERED implicitly

2020-06-30 Thread Bob Liu

On 6/29/20 8:37 AM, Lai Jiangshan wrote:
> On Mon, Jun 29, 2020 at 8:13 AM Bob Liu  wrote:
>>
>> On 6/28/20 11:54 PM, Lai Jiangshan wrote:
>>> On Thu, Jun 11, 2020 at 6:29 PM Bob Liu  wrote:
>>>>
>>>> Current code always set 'Unbound && max_active == 1' workqueues to ordered
>>>> implicitly, while this may be not an expected behaviour for some use cases.
>>>>
>>>> E.g some scsi and iscsi workqueues(unbound && max_active = 1) want to be 
>>>> bind
>>>> to different cpu so as to get better isolation, but their cpumask can't be
>>>> changed because WQ_ORDERED is set implicitly.
>>>
>>> Hello
>>>
>>> If I read the code correctly, the reason why their cpumask can't
>>> be changed is because __WQ_ORDERED_EXPLICIT, not __WQ_ORDERED.
>>>
>>>>
>>>> This patch adds a flag __WQ_ORDERED_DISABLE and also
>>>> create_singlethread_workqueue_noorder() to offer an new option.
>>>>
>>>> Signed-off-by: Bob Liu 
>>>> ---
>>>>  include/linux/workqueue.h | 4 
>>>>  kernel/workqueue.c| 4 +++-
>>>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
>>>> index e48554e..4c86913 100644
>>>> --- a/include/linux/workqueue.h
>>>> +++ b/include/linux/workqueue.h
>>>> @@ -344,6 +344,7 @@ enum {
>>>> __WQ_ORDERED= 1 << 17, /* internal: workqueue is 
>>>> ordered */
>>>> __WQ_LEGACY = 1 << 18, /* internal: 
>>>> create*_workqueue() */
>>>> __WQ_ORDERED_EXPLICIT   = 1 << 19, /* internal: 
>>>> alloc_ordered_workqueue() */
>>>> +   __WQ_ORDERED_DISABLE= 1 << 20, /* internal: don't set 
>>>> __WQ_ORDERED implicitly */
>>>>
>>>> WQ_MAX_ACTIVE   = 512,/* I like 512, better ideas? */
>>>> WQ_MAX_UNBOUND_PER_CPU  = 4,  /* 4 * #cpus for unbound wq */
>>>> @@ -433,6 +434,9 @@ struct workqueue_struct *alloc_workqueue(const char 
>>>> *fmt,
>>>>  #define create_singlethread_workqueue(name)\
>>>> alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
>>>>
>>>> +#define create_singlethread_workqueue_noorder(name)\
>>>> +   alloc_workqueue("%s", WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | \
>>>> +   WQ_UNBOUND | __WQ_ORDERED_DISABLE, 1, (name))
>>>
>>> I think using __WQ_ORDERED without __WQ_ORDERED_EXPLICIT is what you
>>> need, in which case cpumask is allowed to be changed.
>>>
>>
>> I don't think so, see function workqueue_apply_unbound_cpumask():
>>
>> wq_unbound_cpumask_store()
>>  > workqueue_set_unbound_cpumask()
>>> workqueue_apply_unbound_cpumask() {
>>  ...
>> 5276 /* creating multiple pwqs breaks ordering guarantee */
>> 5277 if (wq->flags & __WQ_ORDERED)
>> 5278 continue;
>>   
>>   Here will skip apply cpumask if only __WQ_ORDERED 
>> is set.
> 
> wq_unbound_cpumask_store() is for changing the cpumask of
> *all* workqueues. I don't think it can be used to make
> scsi and iscsi workqueues bound to different cpu.
> 
> apply_workqueue_attrs() is for changing the cpumask of the specific
> workqueue, which can change the cpumask of __WQ_ORDERED workqueue
> (but without __WQ_ORDERED_EXPLICIT).
> 

Yes, you are right. I made a mistake.
Sorry for the noise.

Regards,
Bob

>>
>> 5280 ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
>>
>>  }

Re: [PATCH 1/2] workqueue: don't always set __WQ_ORDERED implicitly

2020-06-28 Thread Bob Liu

On 6/29/20 8:37 AM, Lai Jiangshan wrote:
> On Mon, Jun 29, 2020 at 8:13 AM Bob Liu  wrote:
>>
>> On 6/28/20 11:54 PM, Lai Jiangshan wrote:
>>> On Thu, Jun 11, 2020 at 6:29 PM Bob Liu  wrote:
>>>>
>>>> Current code always set 'Unbound && max_active == 1' workqueues to ordered
>>>> implicitly, while this may be not an expected behaviour for some use cases.
>>>>
>>>> E.g some scsi and iscsi workqueues(unbound && max_active = 1) want to be 
>>>> bind
>>>> to different cpu so as to get better isolation, but their cpumask can't be
>>>> changed because WQ_ORDERED is set implicitly.
>>>
>>> Hello
>>>
>>> If I read the code correctly, the reason why their cpumask can't
>>> be changed is because __WQ_ORDERED_EXPLICIT, not __WQ_ORDERED.
>>>
>>>>
>>>> This patch adds a flag __WQ_ORDERED_DISABLE and also
>>>> create_singlethread_workqueue_noorder() to offer an new option.
>>>>
>>>> Signed-off-by: Bob Liu 
>>>> ---
>>>>  include/linux/workqueue.h | 4 
>>>>  kernel/workqueue.c| 4 +++-
>>>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
>>>> index e48554e..4c86913 100644
>>>> --- a/include/linux/workqueue.h
>>>> +++ b/include/linux/workqueue.h
>>>> @@ -344,6 +344,7 @@ enum {
>>>> __WQ_ORDERED= 1 << 17, /* internal: workqueue is 
>>>> ordered */
>>>> __WQ_LEGACY = 1 << 18, /* internal: 
>>>> create*_workqueue() */
>>>> __WQ_ORDERED_EXPLICIT   = 1 << 19, /* internal: 
>>>> alloc_ordered_workqueue() */
>>>> +   __WQ_ORDERED_DISABLE= 1 << 20, /* internal: don't set 
>>>> __WQ_ORDERED implicitly */
>>>>
>>>> WQ_MAX_ACTIVE   = 512,/* I like 512, better ideas? */
>>>> WQ_MAX_UNBOUND_PER_CPU  = 4,  /* 4 * #cpus for unbound wq */
>>>> @@ -433,6 +434,9 @@ struct workqueue_struct *alloc_workqueue(const char 
>>>> *fmt,
>>>>  #define create_singlethread_workqueue(name)\
>>>> alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
>>>>
>>>> +#define create_singlethread_workqueue_noorder(name)\
>>>> +   alloc_workqueue("%s", WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | \
>>>> +   WQ_UNBOUND | __WQ_ORDERED_DISABLE, 1, (name))
>>>
>>> I think using __WQ_ORDERED without __WQ_ORDERED_EXPLICIT is what you
>>> need, in which case cpumask is allowed to be changed.
>>>
>>
>> I don't think so, see function workqueue_apply_unbound_cpumask():
>>
>> wq_unbound_cpumask_store()
>>  > workqueue_set_unbound_cpumask()
>>> workqueue_apply_unbound_cpumask() {
>>  ...
>> 5276 /* creating multiple pwqs breaks ordering guarantee */
>> 5277 if (wq->flags & __WQ_ORDERED)
>> 5278 continue;
>>   
>>   Here will skip apply cpumask if only __WQ_ORDERED 
>> is set.
> 
> wq_unbound_cpumask_store() is for changing the cpumask of
> *all* workqueues. 

Isn't '/sys/bus/workqueue/devices//cpumask' using the same function to 
change cpumask of 
specific workqueue?
Am I missing something..

> I don't think it can be used to make
> scsi and iscsi workqueues bound to different cpu.
> 

The idea is to register scsi/iscsi workqueues with WQ_SYSFS, and then they can 
be bounded to different
cpu by writing cpu number to "/sys/bus/workqueue/devices//cpumask".

> apply_workqueue_attrs() is for changing the cpumask of the specific
> workqueue, which can change the cpumask of __WQ_ORDERED workqueue
> (but without __WQ_ORDERED_EXPLICIT).
> 
>>
>> 5280 ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
>>
>>  }
>>
>> Thanks for your review.
>> Bob
>>
>>> Just use alloc_workqueue() with __WQ_ORDERED and max_active=1. It can
>>> be wrapped as a new function or macro, but I don't think> 
>>> create_singlethread_workqueue_noorder() is a good name for it.
>>>
>>>>  extern void destroy_workqueue(struct workqueue_struct *wq);
>>>>
>>>>  struct workqueue_attrs *alloc_workqueue_attrs(void);
>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>> index 4e01c44..2167013 100644
>>>> --- a/kernel/workqueue.c
>>>> +++ b/kernel/workqueue.c
>>>> @@ -4237,7 +4237,9 @@ struct workqueue_struct *alloc_workqueue(const char 
>>>> *fmt,
>>>>  * on NUMA.
>>>>  */
>>>> if ((flags & WQ_UNBOUND) && max_active == 1)
>>>> -   flags |= __WQ_ORDERED;
>>>> +   /* the caller may don't want __WQ_ORDERED to be set 
>>>> implicitly. */
>>>> +   if (!(flags & __WQ_ORDERED_DISABLE))
>>>> +   flags |= __WQ_ORDERED;
>>>>
>>>> /* see the comment above the definition of WQ_POWER_EFFICIENT */
>>>> if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
>>>> --
>>>> 2.9.5
>>>>
>>

Re: [PATCH 1/2] workqueue: don't always set __WQ_ORDERED implicitly

2020-06-28 Thread Bob Liu

On 6/28/20 11:54 PM, Lai Jiangshan wrote:
> On Thu, Jun 11, 2020 at 6:29 PM Bob Liu  wrote:
>>
>> Current code always set 'Unbound && max_active == 1' workqueues to ordered
>> implicitly, while this may be not an expected behaviour for some use cases.
>>
>> E.g some scsi and iscsi workqueues(unbound && max_active = 1) want to be bind
>> to different cpu so as to get better isolation, but their cpumask can't be
>> changed because WQ_ORDERED is set implicitly.
> 
> Hello
> 
> If I read the code correctly, the reason why their cpumask can't
> be changed is because __WQ_ORDERED_EXPLICIT, not __WQ_ORDERED.
> 
>>
>> This patch adds a flag __WQ_ORDERED_DISABLE and also
>> create_singlethread_workqueue_noorder() to offer an new option.
>>
>> Signed-off-by: Bob Liu 
>> ---
>>  include/linux/workqueue.h | 4 
>>  kernel/workqueue.c| 4 +++-
>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
>> index e48554e..4c86913 100644
>> --- a/include/linux/workqueue.h
>> +++ b/include/linux/workqueue.h
>> @@ -344,6 +344,7 @@ enum {
>> __WQ_ORDERED= 1 << 17, /* internal: workqueue is ordered 
>> */
>> __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() 
>> */
>> __WQ_ORDERED_EXPLICIT   = 1 << 19, /* internal: 
>> alloc_ordered_workqueue() */
>> +   __WQ_ORDERED_DISABLE= 1 << 20, /* internal: don't set 
>> __WQ_ORDERED implicitly */
>>
>> WQ_MAX_ACTIVE   = 512,/* I like 512, better ideas? */
>> WQ_MAX_UNBOUND_PER_CPU  = 4,  /* 4 * #cpus for unbound wq */
>> @@ -433,6 +434,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
>>  #define create_singlethread_workqueue(name)\
>> alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
>>
>> +#define create_singlethread_workqueue_noorder(name)\
>> +   alloc_workqueue("%s", WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | \
>> +   WQ_UNBOUND | __WQ_ORDERED_DISABLE, 1, (name))
> 
> I think using __WQ_ORDERED without __WQ_ORDERED_EXPLICIT is what you
> need, in which case cpumask is allowed to be changed.
> 

I don't think so, see function workqueue_apply_unbound_cpumask():

wq_unbound_cpumask_store()
 > workqueue_set_unbound_cpumask()
   > workqueue_apply_unbound_cpumask() {
 ...
5276 /* creating multiple pwqs breaks ordering guarantee */
5277 if (wq->flags & __WQ_ORDERED)
5278 continue;
  
  Here will skip apply cpumask if only __WQ_ORDERED is 
set.

5280 ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);

 }

Thanks for your review.
Bob

> Just use alloc_workqueue() with __WQ_ORDERED and max_active=1. It can
> be wrapped as a new function or macro, but I don't think> 
> create_singlethread_workqueue_noorder() is a good name for it.
> 
>>  extern void destroy_workqueue(struct workqueue_struct *wq);
>>
>>  struct workqueue_attrs *alloc_workqueue_attrs(void);
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 4e01c44..2167013 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4237,7 +4237,9 @@ struct workqueue_struct *alloc_workqueue(const char 
>> *fmt,
>>  * on NUMA.
>>  */
>> if ((flags & WQ_UNBOUND) && max_active == 1)
>> -   flags |= __WQ_ORDERED;
>> +   /* the caller may don't want __WQ_ORDERED to be set 
>> implicitly. */
>> +   if (!(flags & __WQ_ORDERED_DISABLE))
>> +   flags |= __WQ_ORDERED;
>>
>> /* see the comment above the definition of WQ_POWER_EFFICIENT */
>> if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
>> --
>> 2.9.5
>>

Re: [PATCH 1/2] workqueue: don't always set __WQ_ORDERED implicitly

2020-06-21 Thread Bob Liu

ping..

On 6/11/20 6:07 PM, Bob Liu wrote:
> Current code always set 'Unbound && max_active == 1' workqueues to ordered
> implicitly, while this may be not an expected behaviour for some use cases.
> 
> E.g some scsi and iscsi workqueues(unbound && max_active = 1) want to be bind
> to different cpu so as to get better isolation, but their cpumask can't be
> changed because WQ_ORDERED is set implicitly.
> 
> This patch adds a flag __WQ_ORDERED_DISABLE and also
> create_singlethread_workqueue_noorder() to offer an new option.
> 
> Signed-off-by: Bob Liu 
> ---
>  include/linux/workqueue.h | 4 
>  kernel/workqueue.c| 4 +++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index e48554e..4c86913 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -344,6 +344,7 @@ enum {
>   __WQ_ORDERED= 1 << 17, /* internal: workqueue is ordered */
>   __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
>   __WQ_ORDERED_EXPLICIT   = 1 << 19, /* internal: 
> alloc_ordered_workqueue() */
> + __WQ_ORDERED_DISABLE= 1 << 20, /* internal: don't set __WQ_ORDERED 
> implicitly */
>  
>   WQ_MAX_ACTIVE   = 512,/* I like 512, better ideas? */
>   WQ_MAX_UNBOUND_PER_CPU  = 4,  /* 4 * #cpus for unbound wq */
> @@ -433,6 +434,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
>  #define create_singlethread_workqueue(name)  \
>   alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
>  
> +#define create_singlethread_workqueue_noorder(name)  \
> + alloc_workqueue("%s", WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | \
> + WQ_UNBOUND | __WQ_ORDERED_DISABLE, 1, (name))
>  extern void destroy_workqueue(struct workqueue_struct *wq);
>  
>  struct workqueue_attrs *alloc_workqueue_attrs(void);
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 4e01c44..2167013 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4237,7 +4237,9 @@ struct workqueue_struct *alloc_workqueue(const char 
> *fmt,
>* on NUMA.
>*/
>   if ((flags & WQ_UNBOUND) && max_active == 1)
> - flags |= __WQ_ORDERED;
> + /* the caller may don't want __WQ_ORDERED to be set implicitly. 
> */
> + if (!(flags & __WQ_ORDERED_DISABLE))
> + flags |= __WQ_ORDERED;
>  
>   /* see the comment above the definition of WQ_POWER_EFFICIENT */
>   if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
>

[PATCH 1/2] workqueue: don't always set __WQ_ORDERED implicitly

2020-06-11 Thread Bob Liu

Current code always set 'Unbound && max_active == 1' workqueues to ordered
implicitly, while this may be not an expected behaviour for some use cases.

E.g some scsi and iscsi workqueues(unbound && max_active = 1) want to be bind
to different cpu so as to get better isolation, but their cpumask can't be
changed because WQ_ORDERED is set implicitly.

This patch adds a flag __WQ_ORDERED_DISABLE and also
create_singlethread_workqueue_noorder() to offer an new option.

Signed-off-by: Bob Liu 
---
 include/linux/workqueue.h | 4 
 kernel/workqueue.c| 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index e48554e..4c86913 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -344,6 +344,7 @@ enum {
__WQ_ORDERED= 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
__WQ_ORDERED_EXPLICIT   = 1 << 19, /* internal: 
alloc_ordered_workqueue() */
+   __WQ_ORDERED_DISABLE= 1 << 20, /* internal: don't set __WQ_ORDERED 
implicitly */
 
WQ_MAX_ACTIVE   = 512,/* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU  = 4,  /* 4 * #cpus for unbound wq */
@@ -433,6 +434,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 #define create_singlethread_workqueue(name)\
alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)
 
+#define create_singlethread_workqueue_noorder(name)\
+   alloc_workqueue("%s", WQ_SYSFS | __WQ_LEGACY | WQ_MEM_RECLAIM | \
+   WQ_UNBOUND | __WQ_ORDERED_DISABLE, 1, (name))
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
 struct workqueue_attrs *alloc_workqueue_attrs(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4e01c44..2167013 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4237,7 +4237,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 * on NUMA.
 */
if ((flags & WQ_UNBOUND) && max_active == 1)
-   flags |= __WQ_ORDERED;
+   /* the caller may don't want __WQ_ORDERED to be set implicitly. 
*/
+   if (!(flags & __WQ_ORDERED_DISABLE))
+   flags |= __WQ_ORDERED;
 
/* see the comment above the definition of WQ_POWER_EFFICIENT */
if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
-- 
2.9.5

[PATCH 2/2] scsi: register sysfs for scsi/iscsi workqueues

2020-06-11 Thread Bob Liu

This patch enable setting cpu affinity through "cpumask" for below
scsi/iscsi workqueues, so as to get better isolation.
- scsi_wq_*
- scsi_tmf_*
- iscsi_q_xx
- iscsi_eh

Signed-off-by: Bob Liu 
---
 drivers/scsi/hosts.c| 4 ++--
 drivers/scsi/libiscsi.c | 2 +-
 drivers/scsi/scsi_transport_iscsi.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 1d669e4..4b9f80d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -272,7 +272,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
if (shost->transportt->create_work_queue) {
snprintf(shost->work_q_name, sizeof(shost->work_q_name),
 "scsi_wq_%d", shost->host_no);
-   shost->work_q = create_singlethread_workqueue(
+   shost->work_q = create_singlethread_workqueue_noorder(
shost->work_q_name);
if (!shost->work_q) {
error = -EINVAL;
@@ -487,7 +487,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
}
 
shost->tmf_work_q = alloc_workqueue("scsi_tmf_%d",
-   WQ_UNBOUND | WQ_MEM_RECLAIM,
+   WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS | 
__WQ_ORDERED_DISABLE,
   1, shost->host_no);
if (!shost->tmf_work_q) {
shost_printk(KERN_WARNING, shost,
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 70b99c0..6808cf3 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2627,7 +2627,7 @@ struct Scsi_Host *iscsi_host_alloc(struct 
scsi_host_template *sht,
if (xmit_can_sleep) {
snprintf(ihost->workq_name, sizeof(ihost->workq_name),
"iscsi_q_%d", shost->host_no);
-   ihost->workq = create_singlethread_workqueue(ihost->workq_name);
+   ihost->workq = 
create_singlethread_workqueue_noorder(ihost->workq_name);
if (!ihost->workq)
goto free_host;
}
diff --git a/drivers/scsi/scsi_transport_iscsi.c 
b/drivers/scsi/scsi_transport_iscsi.c
index dfc726f..d07a0e4 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -4602,7 +4602,7 @@ static __init int iscsi_transport_init(void)
goto unregister_flashnode_bus;
}
 
-   iscsi_eh_timer_workq = create_singlethread_workqueue("iscsi_eh");
+   iscsi_eh_timer_workq = 
create_singlethread_workqueue_noorder("iscsi_eh");
if (!iscsi_eh_timer_workq) {
err = -ENOMEM;
goto release_nls;
-- 
2.9.5

Re: [PATCH 1/1] blk-mq: get ctx in order to handle BLK_MQ_S_INACTIVE in blk_mq_get_tag()

2020-06-02 Thread Bob Liu

On 6/2/20 2:17 PM, Dongli Zhang wrote:
> When scheduler is set, we hit below page fault when we offline cpu.
> 
> [ 1061.007725] BUG: kernel NULL pointer dereference, address: 0040
> [ 1061.008710] #PF: supervisor read access in kernel mode
> [ 1061.009492] #PF: error_code(0x) - not-present page
> [ 1061.010241] PGD 0 P4D 0
> [ 1061.010614] Oops:  [#1] SMP PTI
> [ 1061.011130] CPU: 0 PID: 122 Comm: kworker/0:1H Not tainted 5.7.0-rc7+ #2'
> ... ...
> [ 1061.013760] Workqueue: kblockd blk_mq_run_work_fn
> [ 1061.014446] RIP: 0010:blk_mq_put_tag+0xf/0x30
> ... ...
> [ 1061.017726] RSP: 0018:a5c18037fc70 EFLAGS: 00010287
> [ 1061.018475] RAX:  RBX: a5c18037fcf0 RCX: 
> 0004
> [ 1061.019507] RDX:  RSI:  RDI: 
> 911535dc1180
> ... ...
> [ 1061.028454] Call Trace:
> [ 1061.029307]  blk_mq_get_tag+0x26e/0x280
> [ 1061.029866]  ? wait_woken+0x80/0x80
> [ 1061.030378]  blk_mq_get_driver_tag+0x99/0x110
> [ 1061.031009]  blk_mq_dispatch_rq_list+0x107/0x5e0
> [ 1061.031672]  ? elv_rb_del+0x1a/0x30
> [ 1061.032178]  blk_mq_do_dispatch_sched+0xe2/0x130
> [ 1061.032844]  __blk_mq_sched_dispatch_requests+0xcc/0x150
> [ 1061.033638]  blk_mq_sched_dispatch_requests+0x2b/0x50
> [ 1061.034239]  __blk_mq_run_hw_queue+0x75/0x110
> [ 1061.034867]  process_one_work+0x15c/0x370
> [ 1061.035450]  worker_thread+0x44/0x3d0
> [ 1061.035980]  kthread+0xf3/0x130
> [ 1061.036440]  ? max_active_store+0x80/0x80
> [ 1061.037018]  ? kthread_bind+0x10/0x10
> [ 1061.037554]  ret_from_fork+0x35/0x40
> [ 1061.038073] Modules linked in:
> [ 1061.038543] CR2: 0040
> [ 1061.038962] ---[ end trace d20e1df7d028e69f ]---
> 
> This is because blk_mq_get_driver_tag() would be used to allocate tag once
> scheduler (e.g., mq-deadline) is set. However, in order to handle
> BLK_MQ_S_INACTIVE in blk_mq_get_tag(), we need to set data->ctx for
> blk_mq_put_tag().
> 
> Fixes: bf0beec0607db3c6 ("blk-mq: drain I/O when all CPUs in a hctx are 
> offline")
> Signed-off-by: Dongli Zhang 
> ---
> This is based on for-next because currently the pull request for v5.8 is
> not picked by mainline.
> 
>  block/blk-mq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9a36ac1c1fa1..8bf6c06a86c1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1056,6 +1056,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>  {
>   struct blk_mq_alloc_data data = {
>   .q = rq->q,
> + .ctx = rq->mq_ctx,
>   .hctx = rq->mq_hctx,
>   .flags = BLK_MQ_REQ_NOWAIT,
>   .cmd_flags = rq->cmd_flags,
> 

Nice catch!
Reviewed-by: Bob Liu

Re: [PATCH] block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed

2020-06-02 Thread Bob Liu

On 6/1/20 8:38 PM, yu kuai wrote:
> commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") add a
> kree() for 'buf' if bio_integrity_add_page() return '0'. However, the
> object will be freed in bio_integrity_free() since 'bio->bi_opf' and
> 'bio->bi_integrity' was set previousy in bio_integrity_alloc().
> 
> Fixes: commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug")
> Signed-off-by: yu kuai > ---
>  block/bio-integrity.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index bf62c25cde8f..ae07dd78e951 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -278,7 +278,6 @@ bool bio_integrity_prep(struct bio *bio)
>  
>   if (ret == 0) {
>   printk(KERN_ERR "could not attach integrity payload\n");
> - kfree(buf);
>   status = BLK_STS_RESOURCE;
>   goto err_end_io;
>   }
> 

Looks good to me.
Reviewed-by: Bob Liu

Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

2018-12-08 Thread Bob Liu

On 11/28/18 3:45 PM, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
>>  - how does propagation through stacked layers work?
> 
> The only way it works is by each layering driving it.  Thus my
> recommendation above bilding on your earlier one to use an index
> that is filled by the driver at I/O completion time.
> 
> E.g.
> 
>   bio_init:   bi_leg = -1
> 
>   raid1:  submit bio to lower driver
>   raid 1 completion:  set bi_leg to 0 or 1
> 
> Now if we want to allow stacking we need to save/restore bi_leg
> before submitting to the underlying device.  Which is possible,
> but quite a bit of work in the drivers.
> 

I found it's still very challenge while writing the code.
save/restore bi_leg may not enough because the drivers don't know how to do 
fs-metadata verify.

E.g two layer raid1 stacking

fs:  md0(copies:2)
 /  \
layer1/raid1   md1(copies:2)md2(copies:2)
  /\  / \
layer2/raid1   dev0   dev1  dev2dev3

Assume dev2 is corrupted
 => md2: don't know how to do fs-metadata verify. 
   => md0: fs verify fail, retry md1(preserve md2).
Then md2 will never be retried even dev3 may also has the right copy.
Unless the upper layer device(md0) can know the amount of copy is 4 instead of 
2? 
And need a way to handle the mapping.
Did I miss something? Thanks!

-Bob

>>  - is it generic/abstract enough to be able to work with
>>RAID5/6 to trigger verification/recovery from the parity
>>information in the stripe?
> 
> If we get the non -1 bi_leg for paritity raid this is an inidicator
> that parity rebuild needs to happen.  For multi-parity setups we could
> also use different levels there.
>

Re: [PATCH 1/2] mm: Add kernel MMU notifier to manage IOTLB/DEVTLB

2017-12-13 Thread Bob Liu

On 2017/12/14 11:38, Lu Baolu wrote:
> Hi,
> 
> On 12/14/2017 11:10 AM, Bob Liu wrote:
>> On 2017/12/14 9:02, Lu Baolu wrote:
>>>> From: Huang Ying <ying.hu...@intel.com>
>>>>
>>>> Shared Virtual Memory (SVM) allows a kernel memory mapping to be
>>>> shared between CPU and and a device which requested a supervisor
>>>> PASID. Both devices and IOMMU units have TLBs that cache entries
>>>> from CPU's page tables. We need to get a chance to flush them at
>>>> the same time when we flush the CPU TLBs.
>>>>
>>>> We already have an existing MMU notifiers for userspace updates,
>>>> however we lack the same thing for kernel page table updates. To
>> Sorry, I didn't get which situation need this notification.
>> Could you please describe the full scenario?
> 
> Okay.
> 
> 1. When an SVM capable driver calls intel_svm_bind_mm() with
> SVM_FLAG_SUPERVISOR_MODE set in the @flags, the kernel
> memory page mappings will be shared between CPUs and
> the DMA remapping agent (a.k.a. IOMMU). The page table
> entries will also be cached in both IOTLB (located in IOMMU)
> and the DEVTLB (located in device).
> 

But who/what kind of real device has the requirement to access a kernel VA?
Looks like SVM_FLAG_SUPERVISOR_MODE is used by nobody?

Cheers,
Liubo

> 2. When vmalloc/vfree interfaces are called, the page mappings
> for kernel memory might get changed. And current code calls
> flush_tlb_kernel_range() to flush CPU TLBs only. The IOTLB or
> DevTLB will be stale compared to that on the cpu for kernel
> mappings.
> 
> We need a kernel mmu notification to flush TLBs in IOMMU and
> devices as well.
> 
> Best regards,
> Lu Baolu
>

Re: [PATCH 1/2] mm: Add kernel MMU notifier to manage IOTLB/DEVTLB

2017-12-13 Thread Bob Liu

On 2017/12/14 11:38, Lu Baolu wrote:
> Hi,
> 
> On 12/14/2017 11:10 AM, Bob Liu wrote:
>> On 2017/12/14 9:02, Lu Baolu wrote:
>>>> From: Huang Ying 
>>>>
>>>> Shared Virtual Memory (SVM) allows a kernel memory mapping to be
>>>> shared between CPU and and a device which requested a supervisor
>>>> PASID. Both devices and IOMMU units have TLBs that cache entries
>>>> from CPU's page tables. We need to get a chance to flush them at
>>>> the same time when we flush the CPU TLBs.
>>>>
>>>> We already have an existing MMU notifiers for userspace updates,
>>>> however we lack the same thing for kernel page table updates. To
>> Sorry, I didn't get which situation need this notification.
>> Could you please describe the full scenario?
> 
> Okay.
> 
> 1. When an SVM capable driver calls intel_svm_bind_mm() with
> SVM_FLAG_SUPERVISOR_MODE set in the @flags, the kernel
> memory page mappings will be shared between CPUs and
> the DMA remapping agent (a.k.a. IOMMU). The page table
> entries will also be cached in both IOTLB (located in IOMMU)
> and the DEVTLB (located in device).
> 

But who/what kind of real device has the requirement to access a kernel VA?
Looks like SVM_FLAG_SUPERVISOR_MODE is used by nobody?

Cheers,
Liubo

> 2. When vmalloc/vfree interfaces are called, the page mappings
> for kernel memory might get changed. And current code calls
> flush_tlb_kernel_range() to flush CPU TLBs only. The IOTLB or
> DevTLB will be stale compared to that on the cpu for kernel
> mappings.
> 
> We need a kernel mmu notification to flush TLBs in IOMMU and
> devices as well.
> 
> Best regards,
> Lu Baolu
>

Re: [PATCH 1/2] mm: Add kernel MMU notifier to manage IOTLB/DEVTLB

2017-12-13 Thread Bob Liu

On 2017/12/14 9:02, Lu Baolu wrote:
> From: Huang Ying 
> 
> Shared Virtual Memory (SVM) allows a kernel memory mapping to be
> shared between CPU and and a device which requested a supervisor
> PASID. Both devices and IOMMU units have TLBs that cache entries
> from CPU's page tables. We need to get a chance to flush them at
> the same time when we flush the CPU TLBs.
> 
> We already have an existing MMU notifiers for userspace updates,
> however we lack the same thing for kernel page table updates. To

Sorry, I didn't get which situation need this notification.
Could you please describe the full scenario?

Thanks,
Liubo

> implement the MMU notification mechanism for the kernel address
> space, a kernel MMU notifier chain is defined and will be called
> whenever the CPU TLB is flushed for the kernel address space.
> 
> As consumer of this notifier, the IOMMU SVM implementations will
> register callbacks on this notifier and manage the cache entries
> in both IOTLB and DevTLB.
> 
> Cc: Ashok Raj 
> Cc: Dave Hansen 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Andy Lutomirski 
> Cc: Rik van Riel 
> Cc: Kees Cook 
> Cc: Andrew Morton 
> Cc: Kirill A. Shutemov 
> Cc: Matthew Wilcox 
> Cc: Dave Jiang 
> Cc: Michal Hocko 
> Cc: Paul E. McKenney 
> Cc: Vegard Nossum 
> Cc: x...@kernel.org
> Cc: linux...@kvack.org
> 
> Tested-by: CQ Tang 
> Signed-off-by: Huang Ying 
> Signed-off-by: Lu Baolu 
> ---
>  arch/x86/mm/tlb.c|  2 ++
>  include/linux/mmu_notifier.h | 33 +
>  mm/mmu_notifier.c| 27 +++
>  3 files changed, 62 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 3118392cd..5ff104f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -567,6 +568,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned 
> long end)
>   info.end = end;
>   on_each_cpu(do_kernel_range_flush, , 1);
>   }
> + kernel_mmu_notifier_invalidate_range(start, end);
>  }
>  
>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b25dc9d..44d7c06 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -408,6 +408,25 @@ extern void mmu_notifier_call_srcu(struct rcu_head *rcu,
>  void (*func)(struct rcu_head *rcu));
>  extern void mmu_notifier_synchronize(void);
>  
> +struct kernel_mmu_address_range {
> + unsigned long start;
> + unsigned long end;
> +};
> +
> +/*
> + * Before the virtual address range managed by kernel (vmalloc/kmap)
> + * is reused, That is, remapped to the new physical addresses, the
> + * kernel MMU notifier will be called with KERNEL_MMU_INVALIDATE_RANGE
> + * and struct kernel_mmu_address_range as parameters.  This is used to
> + * manage the remote TLB.
> + */
> +#define KERNEL_MMU_INVALIDATE_RANGE  1
> +extern int kernel_mmu_notifier_register(struct notifier_block *nb);
> +extern int kernel_mmu_notifier_unregister(struct notifier_block *nb);
> +
> +extern int kernel_mmu_notifier_invalidate_range(unsigned long start,
> + unsigned long end);
> +
>  #else /* CONFIG_MMU_NOTIFIER */
>  
>  static inline int mm_has_notifiers(struct mm_struct *mm)
> @@ -474,6 +493,20 @@ static inline void mmu_notifier_mm_destroy(struct 
> mm_struct *mm)
>  #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
>  #define set_pte_at_notify set_pte_at
>  
> +static inline int kernel_mmu_notifier_register(struct notifier_block *nb)
> +{
> + return 0;
> +}
> +
> +static inline int kernel_mmu_notifier_unregister(struct notifier_block *nb)
> +{
> + return 0;
> +}
> +
> +static inline void kernel_mmu_notifier_invalidate_range(unsigned long start,
> + unsigned long end)
> +{
> +}
>  #endif /* CONFIG_MMU_NOTIFIER */
>  
>  #endif /* _LINUX_MMU_NOTIFIER_H */
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 96edb33..52f816a 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -393,3 +393,30 @@ void mmu_notifier_unregister_no_release(struct 
> mmu_notifier *mn,
>   mmdrop(mm);
>  }
>  EXPORT_SYMBOL_GPL(mmu_notifier_unregister_no_release);
> +
> +static ATOMIC_NOTIFIER_HEAD(kernel_mmu_notifier_list);
> +
> +int kernel_mmu_notifier_register(struct

Re: [PATCH 1/2] mm: Add kernel MMU notifier to manage IOTLB/DEVTLB

2017-12-13 Thread Bob Liu

On 2017/12/14 9:02, Lu Baolu wrote:
> From: Huang Ying 
> 
> Shared Virtual Memory (SVM) allows a kernel memory mapping to be
> shared between CPU and and a device which requested a supervisor
> PASID. Both devices and IOMMU units have TLBs that cache entries
> from CPU's page tables. We need to get a chance to flush them at
> the same time when we flush the CPU TLBs.
> 
> We already have an existing MMU notifiers for userspace updates,
> however we lack the same thing for kernel page table updates. To

Sorry, I didn't get which situation need this notification.
Could you please describe the full scenario?

Thanks,
Liubo

> implement the MMU notification mechanism for the kernel address
> space, a kernel MMU notifier chain is defined and will be called
> whenever the CPU TLB is flushed for the kernel address space.
> 
> As consumer of this notifier, the IOMMU SVM implementations will
> register callbacks on this notifier and manage the cache entries
> in both IOTLB and DevTLB.
> 
> Cc: Ashok Raj 
> Cc: Dave Hansen 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Andy Lutomirski 
> Cc: Rik van Riel 
> Cc: Kees Cook 
> Cc: Andrew Morton 
> Cc: Kirill A. Shutemov 
> Cc: Matthew Wilcox 
> Cc: Dave Jiang 
> Cc: Michal Hocko 
> Cc: Paul E. McKenney 
> Cc: Vegard Nossum 
> Cc: x...@kernel.org
> Cc: linux...@kvack.org
> 
> Tested-by: CQ Tang 
> Signed-off-by: Huang Ying 
> Signed-off-by: Lu Baolu 
> ---
>  arch/x86/mm/tlb.c|  2 ++
>  include/linux/mmu_notifier.h | 33 +
>  mm/mmu_notifier.c| 27 +++
>  3 files changed, 62 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 3118392cd..5ff104f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -567,6 +568,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned 
> long end)
>   info.end = end;
>   on_each_cpu(do_kernel_range_flush, , 1);
>   }
> + kernel_mmu_notifier_invalidate_range(start, end);
>  }
>  
>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b25dc9d..44d7c06 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -408,6 +408,25 @@ extern void mmu_notifier_call_srcu(struct rcu_head *rcu,
>  void (*func)(struct rcu_head *rcu));
>  extern void mmu_notifier_synchronize(void);
>  
> +struct kernel_mmu_address_range {
> + unsigned long start;
> + unsigned long end;
> +};
> +
> +/*
> + * Before the virtual address range managed by kernel (vmalloc/kmap)
> + * is reused, That is, remapped to the new physical addresses, the
> + * kernel MMU notifier will be called with KERNEL_MMU_INVALIDATE_RANGE
> + * and struct kernel_mmu_address_range as parameters.  This is used to
> + * manage the remote TLB.
> + */
> +#define KERNEL_MMU_INVALIDATE_RANGE  1
> +extern int kernel_mmu_notifier_register(struct notifier_block *nb);
> +extern int kernel_mmu_notifier_unregister(struct notifier_block *nb);
> +
> +extern int kernel_mmu_notifier_invalidate_range(unsigned long start,
> + unsigned long end);
> +
>  #else /* CONFIG_MMU_NOTIFIER */
>  
>  static inline int mm_has_notifiers(struct mm_struct *mm)
> @@ -474,6 +493,20 @@ static inline void mmu_notifier_mm_destroy(struct 
> mm_struct *mm)
>  #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
>  #define set_pte_at_notify set_pte_at
>  
> +static inline int kernel_mmu_notifier_register(struct notifier_block *nb)
> +{
> + return 0;
> +}
> +
> +static inline int kernel_mmu_notifier_unregister(struct notifier_block *nb)
> +{
> + return 0;
> +}
> +
> +static inline void kernel_mmu_notifier_invalidate_range(unsigned long start,
> + unsigned long end)
> +{
> +}
>  #endif /* CONFIG_MMU_NOTIFIER */
>  
>  #endif /* _LINUX_MMU_NOTIFIER_H */
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 96edb33..52f816a 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -393,3 +393,30 @@ void mmu_notifier_unregister_no_release(struct 
> mmu_notifier *mn,
>   mmdrop(mm);
>  }
>  EXPORT_SYMBOL_GPL(mmu_notifier_unregister_no_release);
> +
> +static ATOMIC_NOTIFIER_HEAD(kernel_mmu_notifier_list);
> +
> +int kernel_mmu_notifier_register(struct notifier_block *nb)
> +{
> + return atomic_notifier_chain_register(_mmu_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(kernel_mmu_notifier_register);
> +
> +int kernel_mmu_notifier_unregister(struct notifier_block *nb)
> +{
> + return atomic_notifier_chain_unregister(_mmu_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(kernel_mmu_notifier_unregister);
> +
> +int kernel_mmu_notifier_invalidate_range(unsigned long start,
> +

Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-05 Thread Bob Liu

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan <wangn...@huawei.com> wrote:
> tlb_gather_mmu(, mm, 0, -1) means gathering all virtual memory space.
> In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
> TLB when tlb->fullmm is true:
>
>   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").
>

CC'ed Will Deacon.

> Which makes leaking of tlb entries. For example, when oom_reaper
> selects a task and reaps its virtual memory space, another thread
> in this task group may still running on another core and access
> these already freed memory through tlb entries.
>
> This patch gather each vma instead of gathering full vm space,
> tlb->fullmm is not true. The behavior of oom reaper become similar
> to munmapping before do_exit, which should be safe for all archs.
>
> Signed-off-by: Wang Nan <wangn...@huawei.com>
> Cc: Bob Liu <liub...@huawei.com>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: David Rientjes <rient...@google.com>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Roman Gushchin <g...@fb.com>
> Cc: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
> Cc: Andrea Arcangeli <aarca...@redhat.com>
> ---
>  mm/oom_kill.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dee0f75..18c5b35 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  */
> set_bit(MMF_UNSTABLE, >flags);
>
> -   tlb_gather_mmu(, mm, 0, -1);
> for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> if (!can_madv_dontneed_vma(vma))
> continue;
> @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  * we do not want to block exit_mmap by keeping mm ref
>  * count elevated without a good reason.
>  */
> -   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +   tlb_gather_mmu(, mm, vma->vm_start, vma->vm_end);
> unmap_page_range(, vma, vma->vm_start, 
> vma->vm_end,
>  NULL);
> +   tlb_finish_mmu(, vma->vm_start, vma->vm_end);
> +   }
> }
> -   tlb_finish_mmu(, 0, -1);
> pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> task_pid_nr(tsk), tsk->comm,
> K(get_mm_counter(mm, MM_ANONPAGES)),

Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-05 Thread Bob Liu

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan  wrote:
> tlb_gather_mmu(, mm, 0, -1) means gathering all virtual memory space.
> In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
> TLB when tlb->fullmm is true:
>
>   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").
>

CC'ed Will Deacon.

> Which makes leaking of tlb entries. For example, when oom_reaper
> selects a task and reaps its virtual memory space, another thread
> in this task group may still running on another core and access
> these already freed memory through tlb entries.
>
> This patch gather each vma instead of gathering full vm space,
> tlb->fullmm is not true. The behavior of oom reaper become similar
> to munmapping before do_exit, which should be safe for all archs.
>
> Signed-off-by: Wang Nan 
> Cc: Bob Liu 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: Ingo Molnar 
> Cc: Roman Gushchin 
> Cc: Konstantin Khlebnikov 
> Cc: Andrea Arcangeli 
> ---
>  mm/oom_kill.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dee0f75..18c5b35 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  */
> set_bit(MMF_UNSTABLE, >flags);
>
> -   tlb_gather_mmu(, mm, 0, -1);
> for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> if (!can_madv_dontneed_vma(vma))
> continue;
> @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  * we do not want to block exit_mmap by keeping mm ref
>  * count elevated without a good reason.
>  */
> -   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +   tlb_gather_mmu(, mm, vma->vm_start, vma->vm_end);
> unmap_page_range(, vma, vma->vm_start, 
> vma->vm_end,
>  NULL);
> +   tlb_finish_mmu(, vma->vm_start, vma->vm_end);
> +   }
> }
> -   tlb_finish_mmu(, 0, -1);
> pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> task_pid_nr(tsk), tsk->comm,
> K(get_mm_counter(mm, MM_ANONPAGES)),

Re: [PATCH v2 03/16] iommu: introduce iommu invalidate API function

2017-10-12 Thread Bob Liu

On 2017/10/12 17:50, Liu, Yi L wrote:
> 
> 
>> -Original Message-----
>> From: Bob Liu [mailto:liub...@huawei.com]
>> Sent: Thursday, October 12, 2017 5:39 PM
>> To: Jean-Philippe Brucker <jean-philippe.bruc...@arm.com>; Joerg Roedel
>> <j...@8bytes.org>; Liu, Yi L <yi.l@intel.com>
>> Cc: Lan, Tianyu <tianyu@intel.com>; Liu, Yi L 
>> <yi.l@linux.intel.com>; Greg
>> Kroah-Hartman <gre...@linuxfoundation.org>; Wysocki, Rafael J
>> <rafael.j.wyso...@intel.com>; LKML <linux-kernel@vger.kernel.org>;
>> io...@lists.linux-foundation.org; David Woodhouse <dw...@infradead.org>
>> Subject: Re: [PATCH v2 03/16] iommu: introduce iommu invalidate API function
>>
>> On 2017/10/11 20:48, Jean-Philippe Brucker wrote:
>>> On 11/10/17 13:15, Joerg Roedel wrote:
>>>> On Wed, Oct 11, 2017 at 11:54:52AM +, Liu, Yi L wrote:
>>>>> I didn't quite get 'iovm' mean. Can you explain a bit about the idea?
>>>>
>>>> It's short for IO Virtual Memory, basically a replacement term for 'svm'
>>>> that is not ambiguous (afaik) and not specific to Intel.
>>>
>>> I wonder if SVM originated in OpenCL first, rather than intel? That's
>>> why I'm using it, but it is ambiguous. I'm not sure IOVM is precise
>>> enough though, since the name could as well be used without shared
>>> tables, for classical map/unmap and IOVAs. Kevin Tian suggested SVA
>>> "Shared Virtual Addressing" last time, which is a little more clear
>>> than SVM and isn't used elsewhere in the kernel either.
>>>
>>
>> The process "vaddr" can be the same as "IOVA" by using the classical 
>> map/unmap
>> way.
>> This is also a kind of share virtual memory/address(except have to pin 
>> physical
>> memory).
>> How to distinguish these two different implementation of "share virtual
>> memory/address"?
>>
> [Liu, Yi L] Not sure if I get your idea well. Process "vaddr" is owned by 
> process and
> maintained by mmu, while "IOVA" is maintained by iommu. So they are different 
> in the
> way they are maintained. Since process "vaddr" is maintained by mmu and then 
> used by
> iommu, so we call it shared virtual memory/address. This is how "shared" term 
> comes.

I think from the view of application, the share virtual memory/address(or 
Nvidia-CUDA unify virtual address) is like this:

1. vaddr = malloc(); e.g vaddr=0x1
2. device can get the same data(accessing the same physical memory) through 
same address e.g 0x1, and don't care about it's a vaddr or IOVA..
(actually in Nvidia-cuda case, the data will be migrated between system-ddr and 
gpu-memory, but the vaddr is always the same for CPU and GPU).

So there are two ways(beside Nvidia way) to implement this requirement:
1)
get the physical memory of vaddr;
dma_map the paddr to iova;
If we appoint iova = vaddr (e.g iova can be controlled by the user space driver 
through vfio DMA_MAP), 
This can also be called share virtual address between CPU process and device..

2) 
The second way is what this RFC did.

Re: [PATCH v2 03/16] iommu: introduce iommu invalidate API function

2017-10-12 Thread Bob Liu

On 2017/10/12 17:50, Liu, Yi L wrote:
> 
> 
>> -Original Message-----
>> From: Bob Liu [mailto:liub...@huawei.com]
>> Sent: Thursday, October 12, 2017 5:39 PM
>> To: Jean-Philippe Brucker ; Joerg Roedel
>> ; Liu, Yi L 
>> Cc: Lan, Tianyu ; Liu, Yi L 
>> ; Greg
>> Kroah-Hartman ; Wysocki, Rafael J
>> ; LKML ;
>> io...@lists.linux-foundation.org; David Woodhouse 
>> Subject: Re: [PATCH v2 03/16] iommu: introduce iommu invalidate API function
>>
>> On 2017/10/11 20:48, Jean-Philippe Brucker wrote:
>>> On 11/10/17 13:15, Joerg Roedel wrote:
>>>> On Wed, Oct 11, 2017 at 11:54:52AM +, Liu, Yi L wrote:
>>>>> I didn't quite get 'iovm' mean. Can you explain a bit about the idea?
>>>>
>>>> It's short for IO Virtual Memory, basically a replacement term for 'svm'
>>>> that is not ambiguous (afaik) and not specific to Intel.
>>>
>>> I wonder if SVM originated in OpenCL first, rather than intel? That's
>>> why I'm using it, but it is ambiguous. I'm not sure IOVM is precise
>>> enough though, since the name could as well be used without shared
>>> tables, for classical map/unmap and IOVAs. Kevin Tian suggested SVA
>>> "Shared Virtual Addressing" last time, which is a little more clear
>>> than SVM and isn't used elsewhere in the kernel either.
>>>
>>
>> The process "vaddr" can be the same as "IOVA" by using the classical 
>> map/unmap
>> way.
>> This is also a kind of share virtual memory/address(except have to pin 
>> physical
>> memory).
>> How to distinguish these two different implementation of "share virtual
>> memory/address"?
>>
> [Liu, Yi L] Not sure if I get your idea well. Process "vaddr" is owned by 
> process and
> maintained by mmu, while "IOVA" is maintained by iommu. So they are different 
> in the
> way they are maintained. Since process "vaddr" is maintained by mmu and then 
> used by
> iommu, so we call it shared virtual memory/address. This is how "shared" term 
> comes.

I think from the view of application, the share virtual memory/address(or 
Nvidia-CUDA unify virtual address) is like this:

1. vaddr = malloc(); e.g vaddr=0x1
2. device can get the same data(accessing the same physical memory) through 
same address e.g 0x1, and don't care about it's a vaddr or IOVA..
(actually in Nvidia-cuda case, the data will be migrated between system-ddr and 
gpu-memory, but the vaddr is always the same for CPU and GPU).

So there are two ways(beside Nvidia way) to implement this requirement:
1)
get the physical memory of vaddr;
dma_map the paddr to iova;
If we appoint iova = vaddr (e.g iova can be controlled by the user space driver 
through vfio DMA_MAP), 
This can also be called share virtual address between CPU process and device..

2) 
The second way is what this RFC did.

Re: [PATCH v2 03/16] iommu: introduce iommu invalidate API function

2017-10-12 Thread Bob Liu

On 2017/10/11 20:48, Jean-Philippe Brucker wrote:
> On 11/10/17 13:15, Joerg Roedel wrote:
>> On Wed, Oct 11, 2017 at 11:54:52AM +, Liu, Yi L wrote:
>>> I didn't quite get 'iovm' mean. Can you explain a bit about the idea?
>>
>> It's short for IO Virtual Memory, basically a replacement term for 'svm'
>> that is not ambiguous (afaik) and not specific to Intel.
> 
> I wonder if SVM originated in OpenCL first, rather than intel? That's why
> I'm using it, but it is ambiguous. I'm not sure IOVM is precise enough
> though, since the name could as well be used without shared tables, for
> classical map/unmap and IOVAs. Kevin Tian suggested SVA "Shared Virtual
> Addressing" last time, which is a little more clear than SVM and isn't
> used elsewhere in the kernel either.
> 

The process "vaddr" can be the same as "IOVA" by using the classical map/unmap 
way.
This is also a kind of share virtual memory/address(except have to pin physical 
memory).
How to distinguish these two different implementation of "share virtual 
memory/address"?

--
Regards,
Liubo

Re: [PATCH v2 03/16] iommu: introduce iommu invalidate API function

2017-10-12 Thread Bob Liu

On 2017/10/11 20:48, Jean-Philippe Brucker wrote:
> On 11/10/17 13:15, Joerg Roedel wrote:
>> On Wed, Oct 11, 2017 at 11:54:52AM +, Liu, Yi L wrote:
>>> I didn't quite get 'iovm' mean. Can you explain a bit about the idea?
>>
>> It's short for IO Virtual Memory, basically a replacement term for 'svm'
>> that is not ambiguous (afaik) and not specific to Intel.
> 
> I wonder if SVM originated in OpenCL first, rather than intel? That's why
> I'm using it, but it is ambiguous. I'm not sure IOVM is precise enough
> though, since the name could as well be used without shared tables, for
> classical map/unmap and IOVAs. Kevin Tian suggested SVA "Shared Virtual
> Addressing" last time, which is a little more clear than SVM and isn't
> used elsewhere in the kernel either.
> 

The process "vaddr" can be the same as "IOVA" by using the classical map/unmap 
way.
This is also a kind of share virtual memory/address(except have to pin physical 
memory).
How to distinguish these two different implementation of "share virtual 
memory/address"?

--
Regards,
Liubo

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-10-11 Thread Bob Liu

On Sun, Oct 1, 2017 at 6:49 AM, Jerome Glisse <jgli...@redhat.com> wrote:
> On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
>> On 2017/9/27 0:16, Jerome Glisse wrote:
>> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
>> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jgli...@redhat.com> 
>> >>>> wrote:
>> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jgli...@redhat.com> 
>> >>>>>> wrote:
>> [...]
>> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
>> >>>>>
>> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >>>>>
>> >>>>
>> >>>> Nice to see that.
>> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> >>>> Device memory directly without extra copy.
>> >>>
>> >>> Yes nouveau CDM support on PPC (which is the only CDM platform 
>> >>> commercialy
>> >>> available today) is on the TODO list. Note that the driver changes for 
>> >>> CDM
>> >>> are minimal (probably less than 100 lines of code). From the driver point
>> >>> of view this is memory and it doesn't matter if it is CDM or not.
>> >>>
>> >>
>> >> It seems have to migrate/copy memory between system-memory and
>> >> device-memory even in HMM-CDM solution.
>> >> Because device-memory is not added into buddy system, the page fault
>> >> for normal malloc() always allocate memory from system-memory!!
>> >> If the device then access the same virtual address, the data is copied
>> >> to device-memory.
>> >>
>> >> Correct me if I misunderstand something.
>> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
>> >
>> > Device can access system memory so copy to device is _not_ mandatory. 
>> > Copying
>> > data to device is for performance only ie the device driver take hint from
>> > userspace and monitor device activity to decide which memory should be 
>> > migrated
>> > to device memory to maximize performance.
>> >
>> > Moreover in some previous version of the HMM patchset we had an helper that
>>
>> Could you point in which version? I'd like to have a look.
>
> I will need to dig in.
>

Thank you.

>>
>> > allowed to directly allocate device memory on device page fault. I intend 
>> > to
>> > post this helper again. With that helper you can have zero copy when device
>> > is the first to access the memory.
>> >
>> > Plan is to get what we have today work properly with the open source driver
>> > and make it perform well. Once we get some experience with real workload we
>> > might look into allowing CPU page fault to be directed to device memory but
>> > at this time i don't think we need this.
>> >
>>
>> For us, we need this feature that CPU page fault can be direct to device 
>> memory.
>> So that don't need to copy data from system memory to device memory.
>> Do you have any suggestion on the implementation? I'll try to make a 
>> prototype patch.
>
> Why do you need that ? What is the device and what are the requirement ?
>

You may think it as a CCIX device or CAPI device.
The requirement is eliminate any extra copy.
A typical usecase/requirement is malloc() and madvise() allocate from
device memory, then CPU write data to device memory directly and
trigger device to read the data/do calculation.

-- 
Regards,
--Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-10-11 Thread Bob Liu

On Sun, Oct 1, 2017 at 6:49 AM, Jerome Glisse  wrote:
> On Sat, Sep 30, 2017 at 10:57:38AM +0800, Bob Liu wrote:
>> On 2017/9/27 0:16, Jerome Glisse wrote:
>> > On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> >> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse  wrote:
>> >>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> >>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse  
>> >>>> wrote:
>> >>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse  
>> >>>>>> wrote:
>> [...]
>> >>>>> So i pushed a branch with WIP for nouveau to use HMM:
>> >>>>>
>> >>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >>>>>
>> >>>>
>> >>>> Nice to see that.
>> >>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> >>>> Device memory directly without extra copy.
>> >>>
>> >>> Yes nouveau CDM support on PPC (which is the only CDM platform 
>> >>> commercialy
>> >>> available today) is on the TODO list. Note that the driver changes for 
>> >>> CDM
>> >>> are minimal (probably less than 100 lines of code). From the driver point
>> >>> of view this is memory and it doesn't matter if it is CDM or not.
>> >>>
>> >>
>> >> It seems have to migrate/copy memory between system-memory and
>> >> device-memory even in HMM-CDM solution.
>> >> Because device-memory is not added into buddy system, the page fault
>> >> for normal malloc() always allocate memory from system-memory!!
>> >> If the device then access the same virtual address, the data is copied
>> >> to device-memory.
>> >>
>> >> Correct me if I misunderstand something.
>> >> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
>> >
>> > Device can access system memory so copy to device is _not_ mandatory. 
>> > Copying
>> > data to device is for performance only ie the device driver take hint from
>> > userspace and monitor device activity to decide which memory should be 
>> > migrated
>> > to device memory to maximize performance.
>> >
>> > Moreover in some previous version of the HMM patchset we had an helper that
>>
>> Could you point in which version? I'd like to have a look.
>
> I will need to dig in.
>

Thank you.

>>
>> > allowed to directly allocate device memory on device page fault. I intend 
>> > to
>> > post this helper again. With that helper you can have zero copy when device
>> > is the first to access the memory.
>> >
>> > Plan is to get what we have today work properly with the open source driver
>> > and make it perform well. Once we get some experience with real workload we
>> > might look into allowing CPU page fault to be directed to device memory but
>> > at this time i don't think we need this.
>> >
>>
>> For us, we need this feature that CPU page fault can be direct to device 
>> memory.
>> So that don't need to copy data from system memory to device memory.
>> Do you have any suggestion on the implementation? I'll try to make a 
>> prototype patch.
>
> Why do you need that ? What is the device and what are the requirement ?
>

You may think it as a CCIX device or CAPI device.
The requirement is eliminate any extra copy.
A typical usecase/requirement is malloc() and madvise() allocate from
device memory, then CPU write data to device memory directly and
trigger device to read the data/do calculation.

-- 
Regards,
--Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-29 Thread Bob Liu

On 2017/9/27 0:16, Jerome Glisse wrote:
> On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
>>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
>>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jgli...@redhat.com> 
>>>>>> wrote:
[...]
>>>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>>>
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>>>
>>>>
>>>> Nice to see that.
>>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>>>> Device memory directly without extra copy.
>>>
>>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
>>> available today) is on the TODO list. Note that the driver changes for CDM
>>> are minimal (probably less than 100 lines of code). From the driver point
>>> of view this is memory and it doesn't matter if it is CDM or not.
>>>
>>
>> It seems have to migrate/copy memory between system-memory and
>> device-memory even in HMM-CDM solution.
>> Because device-memory is not added into buddy system, the page fault
>> for normal malloc() always allocate memory from system-memory!!
>> If the device then access the same virtual address, the data is copied
>> to device-memory.
>>
>> Correct me if I misunderstand something.
>> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> 
> Device can access system memory so copy to device is _not_ mandatory. Copying
> data to device is for performance only ie the device driver take hint from
> userspace and monitor device activity to decide which memory should be 
> migrated
> to device memory to maximize performance.
> 
> Moreover in some previous version of the HMM patchset we had an helper that

Could you point in which version? I'd like to have a look.

> allowed to directly allocate device memory on device page fault. I intend to
> post this helper again. With that helper you can have zero copy when device
> is the first to access the memory.
> 
> Plan is to get what we have today work properly with the open source driver
> and make it perform well. Once we get some experience with real workload we
> might look into allowing CPU page fault to be directed to device memory but
> at this time i don't think we need this.
> 

For us, we need this feature that CPU page fault can be direct to device memory.
So that don't need to copy data from system memory to device memory.
Do you have any suggestion on the implementation? I'll try to make a prototype 
patch.

--
Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-29 Thread Bob Liu

On 2017/9/27 0:16, Jerome Glisse wrote:
> On Tue, Sep 26, 2017 at 05:56:26PM +0800, Bob Liu wrote:
>> On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse  wrote:
>>> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>>>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse  wrote:
>>>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse  
>>>>>> wrote:
[...]
>>>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>>>
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>>>
>>>>
>>>> Nice to see that.
>>>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>>>> Device memory directly without extra copy.
>>>
>>> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
>>> available today) is on the TODO list. Note that the driver changes for CDM
>>> are minimal (probably less than 100 lines of code). From the driver point
>>> of view this is memory and it doesn't matter if it is CDM or not.
>>>
>>
>> It seems have to migrate/copy memory between system-memory and
>> device-memory even in HMM-CDM solution.
>> Because device-memory is not added into buddy system, the page fault
>> for normal malloc() always allocate memory from system-memory!!
>> If the device then access the same virtual address, the data is copied
>> to device-memory.
>>
>> Correct me if I misunderstand something.
>> @Balbir, how do you plan to make zero-copy work if using HMM-CDM?
> 
> Device can access system memory so copy to device is _not_ mandatory. Copying
> data to device is for performance only ie the device driver take hint from
> userspace and monitor device activity to decide which memory should be 
> migrated
> to device memory to maximize performance.
> 
> Moreover in some previous version of the HMM patchset we had an helper that

Could you point in which version? I'd like to have a look.

> allowed to directly allocate device memory on device page fault. I intend to
> post this helper again. With that helper you can have zero copy when device
> is the first to access the memory.
> 
> Plan is to get what we have today work properly with the open source driver
> and make it perform well. Once we get some experience with real workload we
> might look into allowing CPU page fault to be directed to device memory but
> at this time i don't think we need this.
> 

For us, we need this feature that CPU page fault can be direct to device memory.
So that don't need to copy data from system memory to device memory.
Do you have any suggestion on the implementation? I'll try to make a prototype 
patch.

--
Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-26 Thread Bob Liu

On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
>> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jgli...@redhat.com> wrote:
>> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>> >
>> > [...]
>> >
>> >> >> > Second device driver are not integrated that closely within mm and 
>> >> >> > the
>> >> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> >> > notification to page (ie to update struct page so that numa worker
>> >> >> > thread can migrate memory base on accurate informations).
>> >> >> >
>> >> >> > Third it can be hard to decide who win between CPU and device access
>> >> >> > when it comes to updating thing like last CPU id.
>> >> >> >
>> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> >> > If we were to add something the CPU id field in flags of struct page
>> >> >> > would not be big enough so this can have repercusion on struct page
>> >> >> > size. This is not an easy sell.
>> >> >> >
>> >> >> > They are other issues i can't think of right now. I think for now it
>> >> >>
>> >> >> My opinion is most of the issues are the same no matter use CDM or 
>> >> >> HMM-CDM.
>> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or 
>> >> >> other ways.
>> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>> >> >> driver to
>> >> >> demonstrate the whole solution works fine.
>> >> >
>> >> > I am working with NVidia close source driver team to make sure that it 
>> >> > works
>> >> > well for them. I am also working on nouveau open source driver for same 
>> >> > NVidia
>> >> > hardware thought it will be of less use as what is missing there is a 
>> >> > solid
>> >> > open source userspace to leverage this. Nonetheless open source driver 
>> >> > are in
>> >> > the work.
>> >>
>> >> Can you point to the nouveau patches? I still find these HMM patches
>> >> un-reviewable without an upstream consumer.
>> >
>> > So i pushed a branch with WIP for nouveau to use HMM:
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
>
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
>

It seems have to migrate/copy memory between system-memory and
device-memory even in HMM-CDM solution.
Because device-memory is not added into buddy system, the page fault
for normal malloc() always allocate memory from system-memory!!
If the device then access the same virtual address, the data is copied
to device-memory.

Correct me if I misunderstand something.
@Balbir, how do you plan to make zero-copy work if using HMM-CDM?

--
Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-26 Thread Bob Liu

On Tue, Sep 12, 2017 at 7:36 AM, Jerome Glisse  wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse  wrote:
>> > On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> >> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse  wrote:
>> >> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>> >
>> > [...]
>> >
>> >> >> > Second device driver are not integrated that closely within mm and 
>> >> >> > the
>> >> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> >> > notification to page (ie to update struct page so that numa worker
>> >> >> > thread can migrate memory base on accurate informations).
>> >> >> >
>> >> >> > Third it can be hard to decide who win between CPU and device access
>> >> >> > when it comes to updating thing like last CPU id.
>> >> >> >
>> >> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> >> > If we were to add something the CPU id field in flags of struct page
>> >> >> > would not be big enough so this can have repercusion on struct page
>> >> >> > size. This is not an easy sell.
>> >> >> >
>> >> >> > They are other issues i can't think of right now. I think for now it
>> >> >>
>> >> >> My opinion is most of the issues are the same no matter use CDM or 
>> >> >> HMM-CDM.
>> >> >> I just care about a more complete solution no matter CDM,HMM-CDM or 
>> >> >> other ways.
>> >> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>> >> >> driver to
>> >> >> demonstrate the whole solution works fine.
>> >> >
>> >> > I am working with NVidia close source driver team to make sure that it 
>> >> > works
>> >> > well for them. I am also working on nouveau open source driver for same 
>> >> > NVidia
>> >> > hardware thought it will be of less use as what is missing there is a 
>> >> > solid
>> >> > open source userspace to leverage this. Nonetheless open source driver 
>> >> > are in
>> >> > the work.
>> >>
>> >> Can you point to the nouveau patches? I still find these HMM patches
>> >> un-reviewable without an upstream consumer.
>> >
>> > So i pushed a branch with WIP for nouveau to use HMM:
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>> >
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
>
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
>

It seems have to migrate/copy memory between system-memory and
device-memory even in HMM-CDM solution.
Because device-memory is not added into buddy system, the page fault
for normal malloc() always allocate memory from system-memory!!
If the device then access the same virtual address, the data is copied
to device-memory.

Correct me if I misunderstand something.
@Balbir, how do you plan to make zero-copy work if using HMM-CDM?

--
Thanks,
Bob

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-12 Thread Bob Liu

On 2017/9/6 17:57, Jean-Philippe Brucker wrote:
> On 06/09/17 02:02, Bob Liu wrote:
>> On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
>>> On 31/08/17 09:20, Yisheng Xie wrote:
>>>> Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM 
>>>> SMMUv3:
>>>> https://www.spinics.net/lists/arm-kernel/msg565155.html
>>>>
>>>> But for some platform devices(aka on-chip integrated devices), there is 
>>>> also
>>>> SVM requirement, which works based on the SMMU stall mode.
>>>> Jean-Philippe has prepared a prototype patchset to support it:
>>>> git://linux-arm.org/linux-jpb.git svm/stall
>>>
>>> Only meant for testing at that point, and unfit even for an RFC.
>>>
>>
>> Sorry for the misunderstanding.
>> The PRI mode patches is in RFC even no hardware for testing, so I thought 
>> it's fine for "Stall mode" patches sent as RFC.
>> We have tested the Stall mode on our platform.
>> Anyway, I should confirm with you in advance.
>>
>> Btw, Would you consider the "stall mode" upstream at first? Since there is 
>> no hardware for testing the PRI mode.
>> (We can provide you the hardware which support SMMU stall mode if necessary.)
> 
> Yes. What's blocking the ATS, PRI and PASID patches at the moment is the
> lack of endpoints for testing. There has been lots of discussion on the
> API side since my first RFC and I'd like to resubmit the API changes soon.
> It is the same API for ATS+PRI+PASID and SSID+Stall, so the backend
> doesn't matter.
> 
> I'm considering upstreaming SSID+Stall first if it can be tested on
> hardware (having direct access to it would certainly speed things up).
> This would require some work in moving the PCI bits at the end of the
> series. I can reserve some time in the coming months to do it, but I need
> to know what to focus on. Are you able to test SSID as well?
> 

Update:
Our current platform device has only one SSID register, so that have to do 
manually 
switch(write different ssid to that register) if want to use by different 
processes.

But we're going to have an new platform who's platform device can support multi 
ssid.

Regards,
Bob

>>>> We tested this patchset with some fixes on a on-chip integrated device. The
>>>> basic function is ok, so I just send them out for review, although this
>>>> patchset heavily depends on the former patchset (PCIe SVM support for ARM
>>>> SMMUv3), which is still under discussion.
>>>>
>>>> Patch Overview:
>>>> *1 to 3 prepare for device tree or acpi get the device stall ability and 
>>>> pasid bits
>>>> *4 is to realise the SVM function for platform device
>>>> *5 is fix a bug when test SVM function while SMMU donnot support this 
>>>> feature
>>>> *6 avoid ILLEGAL setting of STE and CD entry about stall
>>>>
>>>> Acctually here, I also have a question about SVM on SMMUv3:
>>>>
>>>> 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task 
>>>> to device,
>>>>it will register a mmu_notify. Therefore, when a page range is invalid, 
>>>> we can
>>>>send TLBI or ATC invalid without BTM?
>>>
>>> We could, but the end goal for SVM is to perfectly mirror the CPU page
>>> tables. So for platform SVM we would like to get rid of MMU notifiers
>>> entirely.
>>>
>>>> 2. According to ACPI IORT spec, named component specific data has a node 
>>>> flags field
>>>>whoes bit0 is for Stall support. However, it do not have any field for 
>>>> pasid bit.
>>>>Can we use other 5 bits[5:1] for pasid bit numbers, so we can have 32 
>>>> pasid bit for
>>>>a single platform device which should be enough, because SMMU only 
>>>> support 20 bit pasid
>>>>
>>
>> Any comment on this?
>> The ACPI IORT spec may need be updated?
> 
> I suppose that the Named Component Node could be used for SSID and stall
> capability bits. Can't the ACPI namespace entries be extended to host
> these capabilities in a more generic way? Platforms with different IOMMUs
> might also need this information some day.
>

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-12 Thread Bob Liu

On 2017/9/6 17:57, Jean-Philippe Brucker wrote:
> On 06/09/17 02:02, Bob Liu wrote:
>> On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
>>> On 31/08/17 09:20, Yisheng Xie wrote:
>>>> Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM 
>>>> SMMUv3:
>>>> https://www.spinics.net/lists/arm-kernel/msg565155.html
>>>>
>>>> But for some platform devices(aka on-chip integrated devices), there is 
>>>> also
>>>> SVM requirement, which works based on the SMMU stall mode.
>>>> Jean-Philippe has prepared a prototype patchset to support it:
>>>> git://linux-arm.org/linux-jpb.git svm/stall
>>>
>>> Only meant for testing at that point, and unfit even for an RFC.
>>>
>>
>> Sorry for the misunderstanding.
>> The PRI mode patches is in RFC even no hardware for testing, so I thought 
>> it's fine for "Stall mode" patches sent as RFC.
>> We have tested the Stall mode on our platform.
>> Anyway, I should confirm with you in advance.
>>
>> Btw, Would you consider the "stall mode" upstream at first? Since there is 
>> no hardware for testing the PRI mode.
>> (We can provide you the hardware which support SMMU stall mode if necessary.)
> 
> Yes. What's blocking the ATS, PRI and PASID patches at the moment is the
> lack of endpoints for testing. There has been lots of discussion on the
> API side since my first RFC and I'd like to resubmit the API changes soon.
> It is the same API for ATS+PRI+PASID and SSID+Stall, so the backend
> doesn't matter.
> 
> I'm considering upstreaming SSID+Stall first if it can be tested on
> hardware (having direct access to it would certainly speed things up).
> This would require some work in moving the PCI bits at the end of the
> series. I can reserve some time in the coming months to do it, but I need
> to know what to focus on. Are you able to test SSID as well?
> 

Update:
Our current platform device has only one SSID register, so that have to do 
manually 
switch(write different ssid to that register) if want to use by different 
processes.

But we're going to have an new platform who's platform device can support multi 
ssid.

Regards,
Bob

>>>> We tested this patchset with some fixes on a on-chip integrated device. The
>>>> basic function is ok, so I just send them out for review, although this
>>>> patchset heavily depends on the former patchset (PCIe SVM support for ARM
>>>> SMMUv3), which is still under discussion.
>>>>
>>>> Patch Overview:
>>>> *1 to 3 prepare for device tree or acpi get the device stall ability and 
>>>> pasid bits
>>>> *4 is to realise the SVM function for platform device
>>>> *5 is fix a bug when test SVM function while SMMU donnot support this 
>>>> feature
>>>> *6 avoid ILLEGAL setting of STE and CD entry about stall
>>>>
>>>> Acctually here, I also have a question about SVM on SMMUv3:
>>>>
>>>> 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task 
>>>> to device,
>>>>it will register a mmu_notify. Therefore, when a page range is invalid, 
>>>> we can
>>>>send TLBI or ATC invalid without BTM?
>>>
>>> We could, but the end goal for SVM is to perfectly mirror the CPU page
>>> tables. So for platform SVM we would like to get rid of MMU notifiers
>>> entirely.
>>>
>>>> 2. According to ACPI IORT spec, named component specific data has a node 
>>>> flags field
>>>>whoes bit0 is for Stall support. However, it do not have any field for 
>>>> pasid bit.
>>>>Can we use other 5 bits[5:1] for pasid bit numbers, so we can have 32 
>>>> pasid bit for
>>>>a single platform device which should be enough, because SMMU only 
>>>> support 20 bit pasid
>>>>
>>
>> Any comment on this?
>> The ACPI IORT spec may need be updated?
> 
> I suppose that the Named Component Node could be used for SSID and stall
> capability bits. Can't the ACPI namespace entries be extended to host
> these capabilities in a more generic way? Platforms with different IOMMUs
> might also need this information some day.
>

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-11 Thread Bob Liu

On 2017/9/12 7:36, Jerome Glisse wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jgli...@redhat.com> wrote:
>>>>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>>
>>> [...]
>>>
>>>>>>> Second device driver are not integrated that closely within mm and the
>>>>>>> scheduler kernel code to allow to efficiently plug in device access
>>>>>>> notification to page (ie to update struct page so that numa worker
>>>>>>> thread can migrate memory base on accurate informations).
>>>>>>>
>>>>>>> Third it can be hard to decide who win between CPU and device access
>>>>>>> when it comes to updating thing like last CPU id.
>>>>>>>
>>>>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>>>>> If we were to add something the CPU id field in flags of struct page
>>>>>>> would not be big enough so this can have repercusion on struct page
>>>>>>> size. This is not an easy sell.
>>>>>>>
>>>>>>> They are other issues i can't think of right now. I think for now it
>>>>>>
>>>>>> My opinion is most of the issues are the same no matter use CDM or 
>>>>>> HMM-CDM.
>>>>>> I just care about a more complete solution no matter CDM,HMM-CDM or 
>>>>>> other ways.
>>>>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>>>>>> driver to
>>>>>> demonstrate the whole solution works fine.
>>>>>
>>>>> I am working with NVidia close source driver team to make sure that it 
>>>>> works
>>>>> well for them. I am also working on nouveau open source driver for same 
>>>>> NVidia
>>>>> hardware thought it will be of less use as what is missing there is a 
>>>>> solid
>>>>> open source userspace to leverage this. Nonetheless open source driver 
>>>>> are in
>>>>> the work.
>>>>
>>>> Can you point to the nouveau patches? I still find these HMM patches
>>>> un-reviewable without an upstream consumer.
>>>
>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
> 
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
> 
> The real burden is on the application developpers who need to update their
> code to leverage this.
> 

Why it's not transparent to application?
Application just use system malloc() and don't care whether the data is copied 
or not.

> 
> Also as a data point you want to avoid CPU access to CDM device memory as
> much as possible. The overhead for single cache line access are high (this
> is PCIE or derivative protocol and it is a packet protocol).
> 

Thank you for the hint, we are going to follow cdm-hmm since HMM already merged 
into upstream.

--
Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-11 Thread Bob Liu

On 2017/9/12 7:36, Jerome Glisse wrote:
> On Sun, Sep 10, 2017 at 07:22:58AM +0800, Bob Liu wrote:
>> On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse  wrote:
>>> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>>>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse  wrote:
>>>>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>>
>>> [...]
>>>
>>>>>>> Second device driver are not integrated that closely within mm and the
>>>>>>> scheduler kernel code to allow to efficiently plug in device access
>>>>>>> notification to page (ie to update struct page so that numa worker
>>>>>>> thread can migrate memory base on accurate informations).
>>>>>>>
>>>>>>> Third it can be hard to decide who win between CPU and device access
>>>>>>> when it comes to updating thing like last CPU id.
>>>>>>>
>>>>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>>>>> If we were to add something the CPU id field in flags of struct page
>>>>>>> would not be big enough so this can have repercusion on struct page
>>>>>>> size. This is not an easy sell.
>>>>>>>
>>>>>>> They are other issues i can't think of right now. I think for now it
>>>>>>
>>>>>> My opinion is most of the issues are the same no matter use CDM or 
>>>>>> HMM-CDM.
>>>>>> I just care about a more complete solution no matter CDM,HMM-CDM or 
>>>>>> other ways.
>>>>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>>>>>> driver to
>>>>>> demonstrate the whole solution works fine.
>>>>>
>>>>> I am working with NVidia close source driver team to make sure that it 
>>>>> works
>>>>> well for them. I am also working on nouveau open source driver for same 
>>>>> NVidia
>>>>> hardware thought it will be of less use as what is missing there is a 
>>>>> solid
>>>>> open source userspace to leverage this. Nonetheless open source driver 
>>>>> are in
>>>>> the work.
>>>>
>>>> Can you point to the nouveau patches? I still find these HMM patches
>>>> un-reviewable without an upstream consumer.
>>>
>>> So i pushed a branch with WIP for nouveau to use HMM:
>>>
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>>>
>>
>> Nice to see that.
>> Btw, do you have any plan for a CDM-HMM driver? CPU can write to
>> Device memory directly without extra copy.
> 
> Yes nouveau CDM support on PPC (which is the only CDM platform commercialy
> available today) is on the TODO list. Note that the driver changes for CDM
> are minimal (probably less than 100 lines of code). From the driver point
> of view this is memory and it doesn't matter if it is CDM or not.
> 
> The real burden is on the application developpers who need to update their
> code to leverage this.
> 

Why it's not transparent to application?
Application just use system malloc() and don't care whether the data is copied 
or not.

> 
> Also as a data point you want to avoid CPU access to CDM device memory as
> much as possible. The overhead for single cache line access are high (this
> is PCIE or derivative protocol and it is a packet protocol).
> 

Thank you for the hint, we are going to follow cdm-hmm since HMM already merged 
into upstream.

--
Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-09 Thread Bob Liu

On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse <jgli...@redhat.com> wrote:
> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse <jgli...@redhat.com> wrote:
>> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>
> [...]
>
>> >> > Second device driver are not integrated that closely within mm and the
>> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> > notification to page (ie to update struct page so that numa worker
>> >> > thread can migrate memory base on accurate informations).
>> >> >
>> >> > Third it can be hard to decide who win between CPU and device access
>> >> > when it comes to updating thing like last CPU id.
>> >> >
>> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> > If we were to add something the CPU id field in flags of struct page
>> >> > would not be big enough so this can have repercusion on struct page
>> >> > size. This is not an easy sell.
>> >> >
>> >> > They are other issues i can't think of right now. I think for now it
>> >>
>> >> My opinion is most of the issues are the same no matter use CDM or 
>> >> HMM-CDM.
>> >> I just care about a more complete solution no matter CDM,HMM-CDM or other 
>> >> ways.
>> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>> >> driver to
>> >> demonstrate the whole solution works fine.
>> >
>> > I am working with NVidia close source driver team to make sure that it 
>> > works
>> > well for them. I am also working on nouveau open source driver for same 
>> > NVidia
>> > hardware thought it will be of less use as what is missing there is a solid
>> > open source userspace to leverage this. Nonetheless open source driver are 
>> > in
>> > the work.
>>
>> Can you point to the nouveau patches? I still find these HMM patches
>> un-reviewable without an upstream consumer.
>
> So i pushed a branch with WIP for nouveau to use HMM:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>

Nice to see that.
Btw, do you have any plan for a CDM-HMM driver? CPU can write to
Device memory directly without extra copy.

--
Thanks,
Bob Liu

> Top 16 patches are HMM related (implementic logic inside the driver to use
> HMM). The next 16 patches are hardware specific patches and some nouveau
> changes needed to allow page fault.
>
> It is enough to have simple malloc test case working:
>
> https://cgit.freedesktop.org/~glisse/compote
>
> There is 2 program here the old one is existing way you use GPU for compute
> task while the new one is what HMM allow to achieve ie use malloc memory
> directly.
>
>
> I haven't added yet the device memory support it is in work and i will push
> update to this branch and repo for that. Probably next week if no pressing
> bug preempt my time.
>
>
> So there is a lot of ugliness in all this and i don't expect this to be what
> end up upstream. Right now there is a large rework of nouveau vm (virtual
> memory) code happening to rework completely how we do address space management
> within nouveau. This work is prerequisite for a clean implementation for HMM
> inside nouveau (it will also lift the 40bits address space limitation that
> exist today inside nouveau driver). Once that work land i will work on clean
> upstreamable implementation for nouveau to use HMM as well as userspace to
> leverage it (this is requirement for upstream GPU driver to have open source
> userspace that make use of features). All this is a lot of work and there is
> not many people working on this.
>
>
> They are other initiatives under way related to this that i can not talk about
> publicly but if they bare fruit they might help to speedup all this.
>
> Jérôme
>

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-09-09 Thread Bob Liu

On Wed, Sep 6, 2017 at 3:36 AM, Jerome Glisse  wrote:
> On Thu, Jul 20, 2017 at 08:48:20PM -0700, Dan Williams wrote:
>> On Thu, Jul 20, 2017 at 6:41 PM, Jerome Glisse  wrote:
>> > On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> >> On 2017/7/20 23:03, Jerome Glisse wrote:
>> >> > On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> >> >> On 2017/7/19 10:25, Jerome Glisse wrote:
>> >> >>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> >> >>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>> >> >>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> >> >>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>
> [...]
>
>> >> > Second device driver are not integrated that closely within mm and the
>> >> > scheduler kernel code to allow to efficiently plug in device access
>> >> > notification to page (ie to update struct page so that numa worker
>> >> > thread can migrate memory base on accurate informations).
>> >> >
>> >> > Third it can be hard to decide who win between CPU and device access
>> >> > when it comes to updating thing like last CPU id.
>> >> >
>> >> > Fourth there is no such thing like device id ie equivalent of CPU id.
>> >> > If we were to add something the CPU id field in flags of struct page
>> >> > would not be big enough so this can have repercusion on struct page
>> >> > size. This is not an easy sell.
>> >> >
>> >> > They are other issues i can't think of right now. I think for now it
>> >>
>> >> My opinion is most of the issues are the same no matter use CDM or 
>> >> HMM-CDM.
>> >> I just care about a more complete solution no matter CDM,HMM-CDM or other 
>> >> ways.
>> >> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>> >> driver to
>> >> demonstrate the whole solution works fine.
>> >
>> > I am working with NVidia close source driver team to make sure that it 
>> > works
>> > well for them. I am also working on nouveau open source driver for same 
>> > NVidia
>> > hardware thought it will be of less use as what is missing there is a solid
>> > open source userspace to leverage this. Nonetheless open source driver are 
>> > in
>> > the work.
>>
>> Can you point to the nouveau patches? I still find these HMM patches
>> un-reviewable without an upstream consumer.
>
> So i pushed a branch with WIP for nouveau to use HMM:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>

Nice to see that.
Btw, do you have any plan for a CDM-HMM driver? CPU can write to
Device memory directly without extra copy.

--
Thanks,
Bob Liu

> Top 16 patches are HMM related (implementic logic inside the driver to use
> HMM). The next 16 patches are hardware specific patches and some nouveau
> changes needed to allow page fault.
>
> It is enough to have simple malloc test case working:
>
> https://cgit.freedesktop.org/~glisse/compote
>
> There is 2 program here the old one is existing way you use GPU for compute
> task while the new one is what HMM allow to achieve ie use malloc memory
> directly.
>
>
> I haven't added yet the device memory support it is in work and i will push
> update to this branch and repo for that. Probably next week if no pressing
> bug preempt my time.
>
>
> So there is a lot of ugliness in all this and i don't expect this to be what
> end up upstream. Right now there is a large rework of nouveau vm (virtual
> memory) code happening to rework completely how we do address space management
> within nouveau. This work is prerequisite for a clean implementation for HMM
> inside nouveau (it will also lift the 40bits address space limitation that
> exist today inside nouveau driver). Once that work land i will work on clean
> upstreamable implementation for nouveau to use HMM as well as userspace to
> leverage it (this is requirement for upstream GPU driver to have open source
> userspace that make use of features). All this is a lot of work and there is
> not many people working on this.
>
>
> They are other initiatives under way related to this that i can not talk about
> publicly but if they bare fruit they might help to speedup all this.
>
> Jérôme
>

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Bob Liu

On 2017/9/8 1:27, Jerome Glisse wrote:
>> On 2017/9/6 10:12, Jerome Glisse wrote:
>>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>>>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
> 
> [...]
> 
>>> For HMM each process give hint (somewhat similar to mbind) for range of
>>> virtual address to the device kernel driver (through some API like OpenCL
>>> or CUDA for GPU for instance). All this being device driver specific ioctl.
>>>
>>> The kernel device driver have an overall view of all the process that use
>>> the device and each of the memory advise they gave. From that informations
>>> the kernel device driver decide what part of each process address space to
>>> migrate to device memory.
>>
>> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> They are one and the same really. In both cases HMM is just a set of helpers
> for device driver.
> 
>>> This obviously dynamic and likely to change over the process lifetime.
>>>
>>> My understanding is that HMAT want similar API to allow process to give
>>> direction on
>>> where each range of virtual address should be allocated. It is expected
>>> that most
>>
>> Right, but not clear who should manage the physical memory allocation and
>> setup the pagetable mapping. An new driver or the kernel?
> 
> Physical device memory is manage by the kernel device driver as it is today
> and has it will be tomorrow. HMM does not change that, nor does it requires
> any change to that.
> 

Can someone from Intel give more information about the plan of managing HMAT 
reported memory?

> Migrating process memory to or from device is done by the kernel through
> the regular page migration. HMM provides new helper for device driver to
> initiate such migration. There is no mechanisms like auto numa migration
> for the reasons i explain previously.
> 
> Kernel device driver use all knowledge it has to decide what to migrate to
> device memory. Nothing new here either, it is what happens today for special
> allocated device object and it will just happen all the same for regular
> mmap memory (private anonymous or mmap of a regular file of a filesystem).
> 
> 
> So every low level thing happen in the kernel. Userspace only provides
> directive to the kernel device driver through device specific API. But the
> kernel device driver can ignore or override those directive.
> 
> 
>>> software can easily infer what part of its address will need more
>>> bandwidth, smaller
>>> latency versus what part is sparsely accessed ...
>>>
>>> For HMAT i think first target is HBM and persistent memory and device
>>> memory might
>>> be added latter if that make sense.
>>>
>>
>> Okay, so there are two potential ways for CPU-addressable cache-coherent
>> device memory
>> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
>> 1. CDM-HMM
>> 2. HMAT
> 
> No this are 2 orthogonal thing, they do not conflict with each others quite
> the contrary. HMM (the CDM part is no different) is a set of helpers, see
> it as a toolbox, for device driver.
> 
> HMAT is a way for firmware to report memory resources with more informations
> that just range of physical address. HMAT is specific to platform that rely
> on ACPI. HMAT does not provide any helpers to manage these memory.
> 
> So a device driver can get informations about device memory from HMAT and then
> use HMM to help in managing and using this memory.
> 

Yes, but as Balbir mentioned requires :
1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

And I'm not sure whether Intel going to use this HMM-CDM based method for their 
"target domain" memory ? 
Or they prefer to NUMA approach?   Ross？ Dan?

--
Thanks,
Bob Liu

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Bob Liu

On 2017/9/8 1:27, Jerome Glisse wrote:
>> On 2017/9/6 10:12, Jerome Glisse wrote:
>>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>>>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
> 
> [...]
> 
>>> For HMM each process give hint (somewhat similar to mbind) for range of
>>> virtual address to the device kernel driver (through some API like OpenCL
>>> or CUDA for GPU for instance). All this being device driver specific ioctl.
>>>
>>> The kernel device driver have an overall view of all the process that use
>>> the device and each of the memory advise they gave. From that informations
>>> the kernel device driver decide what part of each process address space to
>>> migrate to device memory.
>>
>> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> They are one and the same really. In both cases HMM is just a set of helpers
> for device driver.
> 
>>> This obviously dynamic and likely to change over the process lifetime.
>>>
>>> My understanding is that HMAT want similar API to allow process to give
>>> direction on
>>> where each range of virtual address should be allocated. It is expected
>>> that most
>>
>> Right, but not clear who should manage the physical memory allocation and
>> setup the pagetable mapping. An new driver or the kernel?
> 
> Physical device memory is manage by the kernel device driver as it is today
> and has it will be tomorrow. HMM does not change that, nor does it requires
> any change to that.
> 

Can someone from Intel give more information about the plan of managing HMAT 
reported memory?

> Migrating process memory to or from device is done by the kernel through
> the regular page migration. HMM provides new helper for device driver to
> initiate such migration. There is no mechanisms like auto numa migration
> for the reasons i explain previously.
> 
> Kernel device driver use all knowledge it has to decide what to migrate to
> device memory. Nothing new here either, it is what happens today for special
> allocated device object and it will just happen all the same for regular
> mmap memory (private anonymous or mmap of a regular file of a filesystem).
> 
> 
> So every low level thing happen in the kernel. Userspace only provides
> directive to the kernel device driver through device specific API. But the
> kernel device driver can ignore or override those directive.
> 
> 
>>> software can easily infer what part of its address will need more
>>> bandwidth, smaller
>>> latency versus what part is sparsely accessed ...
>>>
>>> For HMAT i think first target is HBM and persistent memory and device
>>> memory might
>>> be added latter if that make sense.
>>>
>>
>> Okay, so there are two potential ways for CPU-addressable cache-coherent
>> device memory
>> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
>> 1. CDM-HMM
>> 2. HMAT
> 
> No this are 2 orthogonal thing, they do not conflict with each others quite
> the contrary. HMM (the CDM part is no different) is a set of helpers, see
> it as a toolbox, for device driver.
> 
> HMAT is a way for firmware to report memory resources with more informations
> that just range of physical address. HMAT is specific to platform that rely
> on ACPI. HMAT does not provide any helpers to manage these memory.
> 
> So a device driver can get informations about device memory from HMAT and then
> use HMM to help in managing and using this memory.
> 

Yes, but as Balbir mentioned requires :
1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

And I'm not sure whether Intel going to use this HMM-CDM based method for their 
"target domain" memory ? 
Or they prefer to NUMA approach?   Ross？ Dan?

--
Thanks,
Bob Liu

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-06 Thread Bob Liu

On 2017/9/6 10:12, Jerome Glisse wrote:
> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>>>> resource associated with it on the system (as CPU can address
>>>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>>>> framework.
>>>>>>>>
>>>>>>>
>>>>>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>>>>>> reported to OS 
>>>>>>> through ACPI and recognized as NUMA memory node.
>>>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>>>
>>>>>>
>>>>>> Only platform that has such memory today is powerpc and it is not 
>>>>>> reported
>>>>>> as regular memory by the firmware hence why they need this helper.
>>>>>>
>>>>>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>>>>>> is
>>>>>
>>>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>>>> Table (HMAT) table defined in ACPI 6.2.
>>>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>>>> coherent on-device memory).
>>>>>
>>>>> Ross from Intel already done some work on this, see:
>>>>> https://lwn.net/Articles/724562/
>>>>>
>>>>> arm64 supports APCI also, there is likely more this kind of device when 
>>>>> CCIX
>>>>> is out (should be very soon if on schedule).
>>>>
>>>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
>>>> ie
>>>> when you have several kind of memory each with different characteristics:
>>>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>>>> small (ie few giga bytes)
>>>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>>>   - DDR (good old memory) well characteristics are between HBM and 
>>>> persistent
>>>>
>>>> So AFAICT this has nothing to do with what HMM is for, ie device memory. 
>>>> Note
>>>> that device memory can have a hierarchy of memory themself (HBM, GDDR and 
>>>> in
>>>> maybe even persistent memory).
>>>>
>>>>>> memory on PCIE like interface then i don't expect it to be reported as 
>>>>>> NUMA
>>>>>> memory node but as io range like any regular PCIE resources. Device 
>>>>>> driver
>>>>>> through capabilities flags would then figure out if the link between the
>>>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>>>> as device memory.
>>>>>>
>>>>>
>>>>> From my point of view,  Cache coherent device memory will popular soon and
>>>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>>>> reasonable
>>>>> to me.
>>>>
>>>> Cache coherent device will be reported through standard mecanisms defined 
>>>> by
>>>> the bus standard they are using. To my knowledge all the standard are 
>>>> either
>>>> on top of PCIE or are similar to PCIE.
>>>>
>>>> It is true that on many platform PCIE resource is manage/initialize by the
>>>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>>>> bios pick.
>>>>
>>>> So like i was saying i don't expect the BIOS/UEFI to report device memory 
>>>> as
>>>> regular memory. It will be reported as a regular PCIE resources and then 
>>>> the
>>>> device driver will be able to determine through some flags if the link 
>>>> between
>>>> the CPU(s) and the device is cache coherent or not. At that point the 
>>>> device
>>>> driver can use

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-06 Thread Bob Liu

On 2017/9/6 10:12, Jerome Glisse wrote:
> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>>>> resource associated with it on the system (as CPU can address
>>>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>>>> framework.
>>>>>>>>
>>>>>>>
>>>>>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>>>>>> reported to OS 
>>>>>>> through ACPI and recognized as NUMA memory node.
>>>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>>>
>>>>>>
>>>>>> Only platform that has such memory today is powerpc and it is not 
>>>>>> reported
>>>>>> as regular memory by the firmware hence why they need this helper.
>>>>>>
>>>>>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>>>>>> is
>>>>>
>>>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>>>> Table (HMAT) table defined in ACPI 6.2.
>>>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>>>> coherent on-device memory).
>>>>>
>>>>> Ross from Intel already done some work on this, see:
>>>>> https://lwn.net/Articles/724562/
>>>>>
>>>>> arm64 supports APCI also, there is likely more this kind of device when 
>>>>> CCIX
>>>>> is out (should be very soon if on schedule).
>>>>
>>>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
>>>> ie
>>>> when you have several kind of memory each with different characteristics:
>>>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>>>> small (ie few giga bytes)
>>>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>>>   - DDR (good old memory) well characteristics are between HBM and 
>>>> persistent
>>>>
>>>> So AFAICT this has nothing to do with what HMM is for, ie device memory. 
>>>> Note
>>>> that device memory can have a hierarchy of memory themself (HBM, GDDR and 
>>>> in
>>>> maybe even persistent memory).
>>>>
>>>>>> memory on PCIE like interface then i don't expect it to be reported as 
>>>>>> NUMA
>>>>>> memory node but as io range like any regular PCIE resources. Device 
>>>>>> driver
>>>>>> through capabilities flags would then figure out if the link between the
>>>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>>>> as device memory.
>>>>>>
>>>>>
>>>>> From my point of view,  Cache coherent device memory will popular soon and
>>>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>>>> reasonable
>>>>> to me.
>>>>
>>>> Cache coherent device will be reported through standard mecanisms defined 
>>>> by
>>>> the bus standard they are using. To my knowledge all the standard are 
>>>> either
>>>> on top of PCIE or are similar to PCIE.
>>>>
>>>> It is true that on many platform PCIE resource is manage/initialize by the
>>>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>>>> bios pick.
>>>>
>>>> So like i was saying i don't expect the BIOS/UEFI to report device memory 
>>>> as
>>>> regular memory. It will be reported as a regular PCIE resources and then 
>>>> the
>>>> device driver will be able to determine through some flags if the link 
>>>> between
>>>> the CPU(s) and the device is cache coherent or not. At that point the 
>>>> device
>>>> driver can use

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-06 Thread Bob Liu

On 2017/9/6 17:59, Jean-Philippe Brucker wrote:
> On 06/09/17 02:16, Yisheng Xie wrote:
>> Hi Jean-Philippe,
>>
>> On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
>>> On 31/08/17 09:20, Yisheng Xie wrote:
 Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM 
 SMMUv3:
 https://www.spinics.net/lists/arm-kernel/msg565155.html

 But for some platform devices(aka on-chip integrated devices), there is 
 also
 SVM requirement, which works based on the SMMU stall mode.
 Jean-Philippe has prepared a prototype patchset to support it:
 git://linux-arm.org/linux-jpb.git svm/stall
>>>
>>> Only meant for testing at that point, and unfit even for an RFC.
>>
>> Sorry about that, I should ask you before send it out. It's my mistake. For 
>> I also
>> have some question about this patchset.
>>
>> We have related device, and would like to do some help about it. Do you have
>> any plan about upstream ?
>>
>>>
 We tested this patchset with some fixes on a on-chip integrated device. The
 basic function is ok, so I just send them out for review, although this
 patchset heavily depends on the former patchset (PCIe SVM support for ARM
 SMMUv3), which is still under discussion.

 Patch Overview:
 *1 to 3 prepare for device tree or acpi get the device stall ability and 
 pasid bits
 *4 is to realise the SVM function for platform device
 *5 is fix a bug when test SVM function while SMMU donnot support this 
 feature
 *6 avoid ILLEGAL setting of STE and CD entry about stall

 Acctually here, I also have some questions about SVM on SMMUv3:

 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task 
 to device,
it will register a mmu_notify. Therefore, when a page range is invalid, 
 we can
send TLBI or ATC invalid without BTM?
>>>
>>> We could, but the end goal for SVM is to perfectly mirror the CPU page
>>> tables. So for platform SVM we would like to get rid of MMU notifiers
>>> entirely.
>>
>> I see, but for some SMMU which do not support BTM, it cannot benefit from 
>> SVM.
>>
>> Meanwhile, do you mean even with BTM feature, the PCI-e device also need to 
>> send a
>> ATC invalid by MMU notify? It seems not fair, why not hardware do the 
>> entirely work
>> in this case? It may costly for send ATC invalid and sync.
> 
> It will certainly be costly. But there are major problems with
> transforming broadcast TLB maintenance into ATC invalidations in HW:
> 
> * VMID:ASID to SID:SSID conversion. TLBIs use VMID:ASID, while ATCIs use
> SID:SSID.
> 
> * Most importantly, ATC invalidations accounting. Each endpoint has a
> limited number of in-flight ATC invalidate requests. The conversion module
> would have to buffer incoming invalidations and wait for in-flight ATC
> invalidation to complete before sending the next ones. In case of
> overflow, either we lose invalidation (which opens security holes) or we
> somehow put back-pressure on the interconnect (no idea how feasible this
> is, I suspect really hard).
> 
> Solving the last one is also quite difficult in software, but at least we
> can still invalidate a range. In hardware we would invalidate the ATC
> page-by-page and quickly jam the bus.
> 

Speak to the invalidation, I have one more question.

There is a time window between 1) modify page table;  2) tlb invalidate;

ARM-CPU   Device

1. modify page table

 ^
  Can still write data through smmu tlb even page 
table was already modified.
  (At this point, the same virtual addr may not 
point to the same thing for CPU and device!!!
   I'm afraid there may be some data-loss or other 
potential problems if this situation happens.)

2. tlb invalidate range

--
Thanks,
Bob

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-06 Thread Bob Liu

On 2017/9/6 17:59, Jean-Philippe Brucker wrote:
> On 06/09/17 02:16, Yisheng Xie wrote:
>> Hi Jean-Philippe,
>>
>> On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
>>> On 31/08/17 09:20, Yisheng Xie wrote:
 Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM 
 SMMUv3:
 https://www.spinics.net/lists/arm-kernel/msg565155.html

 But for some platform devices(aka on-chip integrated devices), there is 
 also
 SVM requirement, which works based on the SMMU stall mode.
 Jean-Philippe has prepared a prototype patchset to support it:
 git://linux-arm.org/linux-jpb.git svm/stall
>>>
>>> Only meant for testing at that point, and unfit even for an RFC.
>>
>> Sorry about that, I should ask you before send it out. It's my mistake. For 
>> I also
>> have some question about this patchset.
>>
>> We have related device, and would like to do some help about it. Do you have
>> any plan about upstream ?
>>
>>>
 We tested this patchset with some fixes on a on-chip integrated device. The
 basic function is ok, so I just send them out for review, although this
 patchset heavily depends on the former patchset (PCIe SVM support for ARM
 SMMUv3), which is still under discussion.

 Patch Overview:
 *1 to 3 prepare for device tree or acpi get the device stall ability and 
 pasid bits
 *4 is to realise the SVM function for platform device
 *5 is fix a bug when test SVM function while SMMU donnot support this 
 feature
 *6 avoid ILLEGAL setting of STE and CD entry about stall

 Acctually here, I also have some questions about SVM on SMMUv3:

 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task 
 to device,
it will register a mmu_notify. Therefore, when a page range is invalid, 
 we can
send TLBI or ATC invalid without BTM?
>>>
>>> We could, but the end goal for SVM is to perfectly mirror the CPU page
>>> tables. So for platform SVM we would like to get rid of MMU notifiers
>>> entirely.
>>
>> I see, but for some SMMU which do not support BTM, it cannot benefit from 
>> SVM.
>>
>> Meanwhile, do you mean even with BTM feature, the PCI-e device also need to 
>> send a
>> ATC invalid by MMU notify? It seems not fair, why not hardware do the 
>> entirely work
>> in this case? It may costly for send ATC invalid and sync.
> 
> It will certainly be costly. But there are major problems with
> transforming broadcast TLB maintenance into ATC invalidations in HW:
> 
> * VMID:ASID to SID:SSID conversion. TLBIs use VMID:ASID, while ATCIs use
> SID:SSID.
> 
> * Most importantly, ATC invalidations accounting. Each endpoint has a
> limited number of in-flight ATC invalidate requests. The conversion module
> would have to buffer incoming invalidations and wait for in-flight ATC
> invalidation to complete before sending the next ones. In case of
> overflow, either we lose invalidation (which opens security holes) or we
> somehow put back-pressure on the interconnect (no idea how feasible this
> is, I suspect really hard).
> 
> Solving the last one is also quite difficult in software, but at least we
> can still invalidate a range. In hardware we would invalidate the ATC
> page-by-page and quickly jam the bus.
> 

Speak to the invalidation, I have one more question.

There is a time window between 1) modify page table;  2) tlb invalidate;

ARM-CPU   Device

1. modify page table

 ^
  Can still write data through smmu tlb even page 
table was already modified.
  (At this point, the same virtual addr may not 
point to the same thing for CPU and device!!!
   I'm afraid there may be some data-loss or other 
potential problems if this situation happens.)

2. tlb invalidate range

--
Thanks,
Bob

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-06 Thread Bob Liu

On 2017/9/6 17:57, Jean-Philippe Brucker wrote:
> On 06/09/17 02:02, Bob Liu wrote:
>> On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
>>> On 31/08/17 09:20, Yisheng Xie wrote:
>>>> Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM 
>>>> SMMUv3:
>>>> https://www.spinics.net/lists/arm-kernel/msg565155.html
>>>>
>>>> But for some platform devices(aka on-chip integrated devices), there is 
>>>> also
>>>> SVM requirement, which works based on the SMMU stall mode.
>>>> Jean-Philippe has prepared a prototype patchset to support it:
>>>> git://linux-arm.org/linux-jpb.git svm/stall
>>>
>>> Only meant for testing at that point, and unfit even for an RFC.
>>>
>>
>> Sorry for the misunderstanding.
>> The PRI mode patches is in RFC even no hardware for testing, so I thought 
>> it's fine for "Stall mode" patches sent as RFC.
>> We have tested the Stall mode on our platform.
>> Anyway, I should confirm with you in advance.
>>
>> Btw, Would you consider the "stall mode" upstream at first? Since there is 
>> no hardware for testing the PRI mode.
>> (We can provide you the hardware which support SMMU stall mode if necessary.)
> 
> Yes. What's blocking the ATS, PRI and PASID patches at the moment is the
> lack of endpoints for testing. There has been lots of discussion on the
> API side since my first RFC and I'd like to resubmit the API changes soon.
> It is the same API for ATS+PRI+PASID and SSID+Stall, so the backend
> doesn't matter.
> 

Indeed!

> I'm considering upstreaming SSID+Stall first if it can be tested on
> hardware (having direct access to it would certainly speed things up).

Glad to hear that.

> This would require some work in moving the PCI bits at the end of the
> series. I can reserve some time in the coming months to do it, but I need
> to know what to focus on. Are you able to test SSID as well?
> 

Yes, but the difficulty is our devices are on-chip integrated hardware 
accelerators which requires complicate driver.
You may need much time to understand the driver.
That's the same case as intel/amd SVM, the current user is their GPU :-(

Btw, what kind of device/method do you think is ideal for testing arm-SVM?

>>>> We tested this patchset with some fixes on a on-chip integrated device. The
>>>> basic function is ok, so I just send them out for review, although this
>>>> patchset heavily depends on the former patchset (PCIe SVM support for ARM
>>>> SMMUv3), which is still under discussion.
>>>>
>>>> Patch Overview:
>>>> *1 to 3 prepare for device tree or acpi get the device stall ability and 
>>>> pasid bits
>>>> *4 is to realise the SVM function for platform device
>>>> *5 is fix a bug when test SVM function while SMMU donnot support this 
>>>> feature
>>>> *6 avoid ILLEGAL setting of STE and CD entry about stall
>>>>
>>>> Acctually here, I also have a question about SVM on SMMUv3:
>>>>
>>>> 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task 
>>>> to device,
>>>>it will register a mmu_notify. Therefore, when a page range is invalid, 
>>>> we can
>>>>send TLBI or ATC invalid without BTM?
>>>
>>> We could, but the end goal for SVM is to perfectly mirror the CPU page
>>> tables. So for platform SVM we would like to get rid of MMU notifiers
>>> entirely.
>>>
>>>> 2. According to ACPI IORT spec, named component specific data has a node 
>>>> flags field
>>>>whoes bit0 is for Stall support. However, it do not have any field for 
>>>> pasid bit.
>>>>Can we use other 5 bits[5:1] for pasid bit numbers, so we can have 32 
>>>> pasid bit for
>>>>a single platform device which should be enough, because SMMU only 
>>>> support 20 bit pasid
>>>>
>>
>> Any comment on this?
>> The ACPI IORT spec may need be updated?
> 
> I suppose that the Named Component Node could be used for SSID and stall
> capability bits. Can't the ACPI namespace entries be extended to host
> these capabilities in a more generic way? Platforms with different IOMMUs
> might also need this information some day.
> 

Hmm, that would be better.
But in anyway, it depends on the ACPI IORT Spec would be extended in next 
version.

--
Thanks,
Bob Liu

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-06 Thread Bob Liu

On 2017/9/6 17:57, Jean-Philippe Brucker wrote:
> On 06/09/17 02:02, Bob Liu wrote:
>> On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
>>> On 31/08/17 09:20, Yisheng Xie wrote:
>>>> Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM 
>>>> SMMUv3:
>>>> https://www.spinics.net/lists/arm-kernel/msg565155.html
>>>>
>>>> But for some platform devices(aka on-chip integrated devices), there is 
>>>> also
>>>> SVM requirement, which works based on the SMMU stall mode.
>>>> Jean-Philippe has prepared a prototype patchset to support it:
>>>> git://linux-arm.org/linux-jpb.git svm/stall
>>>
>>> Only meant for testing at that point, and unfit even for an RFC.
>>>
>>
>> Sorry for the misunderstanding.
>> The PRI mode patches is in RFC even no hardware for testing, so I thought 
>> it's fine for "Stall mode" patches sent as RFC.
>> We have tested the Stall mode on our platform.
>> Anyway, I should confirm with you in advance.
>>
>> Btw, Would you consider the "stall mode" upstream at first? Since there is 
>> no hardware for testing the PRI mode.
>> (We can provide you the hardware which support SMMU stall mode if necessary.)
> 
> Yes. What's blocking the ATS, PRI and PASID patches at the moment is the
> lack of endpoints for testing. There has been lots of discussion on the
> API side since my first RFC and I'd like to resubmit the API changes soon.
> It is the same API for ATS+PRI+PASID and SSID+Stall, so the backend
> doesn't matter.
> 

Indeed!

> I'm considering upstreaming SSID+Stall first if it can be tested on
> hardware (having direct access to it would certainly speed things up).

Glad to hear that.

> This would require some work in moving the PCI bits at the end of the
> series. I can reserve some time in the coming months to do it, but I need
> to know what to focus on. Are you able to test SSID as well?
> 

Yes, but the difficulty is our devices are on-chip integrated hardware 
accelerators which requires complicate driver.
You may need much time to understand the driver.
That's the same case as intel/amd SVM, the current user is their GPU :-(

Btw, what kind of device/method do you think is ideal for testing arm-SVM?

>>>> We tested this patchset with some fixes on a on-chip integrated device. The
>>>> basic function is ok, so I just send them out for review, although this
>>>> patchset heavily depends on the former patchset (PCIe SVM support for ARM
>>>> SMMUv3), which is still under discussion.
>>>>
>>>> Patch Overview:
>>>> *1 to 3 prepare for device tree or acpi get the device stall ability and 
>>>> pasid bits
>>>> *4 is to realise the SVM function for platform device
>>>> *5 is fix a bug when test SVM function while SMMU donnot support this 
>>>> feature
>>>> *6 avoid ILLEGAL setting of STE and CD entry about stall
>>>>
>>>> Acctually here, I also have a question about SVM on SMMUv3:
>>>>
>>>> 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task 
>>>> to device,
>>>>it will register a mmu_notify. Therefore, when a page range is invalid, 
>>>> we can
>>>>send TLBI or ATC invalid without BTM?
>>>
>>> We could, but the end goal for SVM is to perfectly mirror the CPU page
>>> tables. So for platform SVM we would like to get rid of MMU notifiers
>>> entirely.
>>>
>>>> 2. According to ACPI IORT spec, named component specific data has a node 
>>>> flags field
>>>>whoes bit0 is for Stall support. However, it do not have any field for 
>>>> pasid bit.
>>>>Can we use other 5 bits[5:1] for pasid bit numbers, so we can have 32 
>>>> pasid bit for
>>>>a single platform device which should be enough, because SMMU only 
>>>> support 20 bit pasid
>>>>
>>
>> Any comment on this?
>> The ACPI IORT spec may need be updated?
> 
> I suppose that the Named Component Node could be used for SSID and stall
> capability bits. Can't the ACPI namespace entries be extended to host
> these capabilities in a more generic way? Platforms with different IOMMUs
> might also need this information some day.
> 

Hmm, that would be better.
But in anyway, it depends on the ACPI IORT Spec would be extended in next 
version.

--
Thanks,
Bob Liu

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Bob Liu

On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>> resource associated with it on the system (as CPU can address
>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>> framework.
>>>>>>
>>>>>
>>>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>>>> reported to OS 
>>>>> through ACPI and recognized as NUMA memory node.
>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>
>>>>
>>>> Only platform that has such memory today is powerpc and it is not reported
>>>> as regular memory by the firmware hence why they need this helper.
>>>>
>>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>>
>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>> Table (HMAT) table defined in ACPI 6.2.
>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>> coherent on-device memory).
>>>
>>> Ross from Intel already done some work on this, see:
>>> https://lwn.net/Articles/724562/
>>>
>>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>>> is out (should be very soon if on schedule).
>>
>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>> when you have several kind of memory each with different characteristics:
>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> small (ie few giga bytes)
>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>   - DDR (good old memory) well characteristics are between HBM and persistent
>>
>> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>> maybe even persistent memory).
>>
>>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>>> memory node but as io range like any regular PCIE resources. Device driver
>>>> through capabilities flags would then figure out if the link between the
>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>> as device memory.
>>>>
>>>
>>> From my point of view,  Cache coherent device memory will popular soon and
>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>> reasonable
>>> to me.
>>
>> Cache coherent device will be reported through standard mecanisms defined by
>> the bus standard they are using. To my knowledge all the standard are either
>> on top of PCIE or are similar to PCIE.
>>
>> It is true that on many platform PCIE resource is manage/initialize by the
>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>> bios pick.
>>
>> So like i was saying i don't expect the BIOS/UEFI to report device memory as
>> regular memory. It will be reported as a regular PCIE resources and then the
>> device driver will be able to determine through some flags if the link 
>> between
>> the CPU(s) and the device is cache coherent or not. At that point the device
>> driver can use register it with HMM helper.
>>
>>
>> The whole NUMA discussion happen several time in the past i suggest looking
>> on mm list archive for them. But it was rule out for several reasons. Top of
>> my head:
>>   - people hate CPU less node and device memory is inherently CPU less
> 
> With the introduction of the HMAT in ACPI 6.2 one of the things that was added
> was the ability to have an ACPI proximity domain that isn't associated with a
> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> field in table 5-73 which describes the "Memory Affinity Structure".  One of
> the major features of the HMAT was the separation of "Initiator" proximity
> domains (CPUs, devices that initiate memory transfers), and "target" proximity
> domains (memory regions, be they attached to a CPU or some other device).
> 
> ACPI

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Bob Liu

On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>> resource associated with it on the system (as CPU can address
>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>> framework.
>>>>>>
>>>>>
>>>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>>>> reported to OS 
>>>>> through ACPI and recognized as NUMA memory node.
>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>
>>>>
>>>> Only platform that has such memory today is powerpc and it is not reported
>>>> as regular memory by the firmware hence why they need this helper.
>>>>
>>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>>
>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>> Table (HMAT) table defined in ACPI 6.2.
>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>> coherent on-device memory).
>>>
>>> Ross from Intel already done some work on this, see:
>>> https://lwn.net/Articles/724562/
>>>
>>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>>> is out (should be very soon if on schedule).
>>
>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>> when you have several kind of memory each with different characteristics:
>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> small (ie few giga bytes)
>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>   - DDR (good old memory) well characteristics are between HBM and persistent
>>
>> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>> maybe even persistent memory).
>>
>>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>>> memory node but as io range like any regular PCIE resources. Device driver
>>>> through capabilities flags would then figure out if the link between the
>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>> as device memory.
>>>>
>>>
>>> From my point of view,  Cache coherent device memory will popular soon and
>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>> reasonable
>>> to me.
>>
>> Cache coherent device will be reported through standard mecanisms defined by
>> the bus standard they are using. To my knowledge all the standard are either
>> on top of PCIE or are similar to PCIE.
>>
>> It is true that on many platform PCIE resource is manage/initialize by the
>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>> bios pick.
>>
>> So like i was saying i don't expect the BIOS/UEFI to report device memory as
>> regular memory. It will be reported as a regular PCIE resources and then the
>> device driver will be able to determine through some flags if the link 
>> between
>> the CPU(s) and the device is cache coherent or not. At that point the device
>> driver can use register it with HMM helper.
>>
>>
>> The whole NUMA discussion happen several time in the past i suggest looking
>> on mm list archive for them. But it was rule out for several reasons. Top of
>> my head:
>>   - people hate CPU less node and device memory is inherently CPU less
> 
> With the introduction of the HMAT in ACPI 6.2 one of the things that was added
> was the ability to have an ACPI proximity domain that isn't associated with a
> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> field in table 5-73 which describes the "Memory Affinity Structure".  One of
> the major features of the HMAT was the separation of "Initiator" proximity
> domains (CPUs, devices that initiate memory transfers), and "target" proximity
> domains (memory regions, be they attached to a CPU or some other device).
> 
> ACPI

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-05 Thread Bob Liu

On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
> On 31/08/17 09:20, Yisheng Xie wrote:
>> Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM SMMUv3:
>> https://www.spinics.net/lists/arm-kernel/msg565155.html
>>
>> But for some platform devices(aka on-chip integrated devices), there is also
>> SVM requirement, which works based on the SMMU stall mode.
>> Jean-Philippe has prepared a prototype patchset to support it:
>> git://linux-arm.org/linux-jpb.git svm/stall
> 
> Only meant for testing at that point, and unfit even for an RFC.
> 

Sorry for the misunderstanding.
The PRI mode patches is in RFC even no hardware for testing, so I thought it's 
fine for "Stall mode" patches sent as RFC.
We have tested the Stall mode on our platform.
Anyway, I should confirm with you in advance.

Btw, Would you consider the "stall mode" upstream at first? Since there is no 
hardware for testing the PRI mode.
(We can provide you the hardware which support SMMU stall mode if necessary.)

>> We tested this patchset with some fixes on a on-chip integrated device. The
>> basic function is ok, so I just send them out for review, although this
>> patchset heavily depends on the former patchset (PCIe SVM support for ARM
>> SMMUv3), which is still under discussion.
>>
>> Patch Overview:
>> *1 to 3 prepare for device tree or acpi get the device stall ability and 
>> pasid bits
>> *4 is to realise the SVM function for platform device
>> *5 is fix a bug when test SVM function while SMMU donnot support this feature
>> *6 avoid ILLEGAL setting of STE and CD entry about stall
>>
>> Acctually here, I also have a question about SVM on SMMUv3:
>>
>> 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task to 
>> device,
>>it will register a mmu_notify. Therefore, when a page range is invalid, 
>> we can
>>send TLBI or ATC invalid without BTM?
> 
> We could, but the end goal for SVM is to perfectly mirror the CPU page
> tables. So for platform SVM we would like to get rid of MMU notifiers
> entirely.
> 
>> 2. According to ACPI IORT spec, named component specific data has a node 
>> flags field
>>whoes bit0 is for Stall support. However, it do not have any field for 
>> pasid bit.
>>Can we use other 5 bits[5:1] for pasid bit numbers, so we can have 32 
>> pasid bit for
>>a single platform device which should be enough, because SMMU only 
>> support 20 bit pasid
>>

Any comment on this?
The ACPI IORT spec may need be updated?

Regards,
Liubo

>> 3. Presently, the pasid is allocate for a task but not for a context, if a 
>> task is trying
>>to bind to 2 device A and B:
>>  a) A support 5 pasid bits
>>  b) B support 2 pasid bits
>>  c) when the task bind to device A, it allocate pasid = 16
>>  d) then it must be fail when trying to bind to task B, for its highest 
>> pasid is 4.
>>So it should allocate a single pasid for a context to avoid this?
> 
> Ideally yes, but the model chosen for the IOMMU API was one PASID per
> task, so I implemented this model (the PASID allocator will be common to
> IOMMU core in the future).
> 
> Therefore the PASID allocation will fail in your example, and there is no
> way around it. If you do (d) then (c), the task will have PASID 4.
> 
> Thanks,
> Jean
> 
> .
>

Re: [RFC PATCH 0/6] Add platform device SVM support for ARM SMMUv3

2017-09-05 Thread Bob Liu

On 2017/9/5 20:56, Jean-Philippe Brucker wrote:
> On 31/08/17 09:20, Yisheng Xie wrote:
>> Jean-Philippe has post a patchset for Adding PCIe SVM support to ARM SMMUv3:
>> https://www.spinics.net/lists/arm-kernel/msg565155.html
>>
>> But for some platform devices(aka on-chip integrated devices), there is also
>> SVM requirement, which works based on the SMMU stall mode.
>> Jean-Philippe has prepared a prototype patchset to support it:
>> git://linux-arm.org/linux-jpb.git svm/stall
> 
> Only meant for testing at that point, and unfit even for an RFC.
> 

Sorry for the misunderstanding.
The PRI mode patches is in RFC even no hardware for testing, so I thought it's 
fine for "Stall mode" patches sent as RFC.
We have tested the Stall mode on our platform.
Anyway, I should confirm with you in advance.

Btw, Would you consider the "stall mode" upstream at first? Since there is no 
hardware for testing the PRI mode.
(We can provide you the hardware which support SMMU stall mode if necessary.)

>> We tested this patchset with some fixes on a on-chip integrated device. The
>> basic function is ok, so I just send them out for review, although this
>> patchset heavily depends on the former patchset (PCIe SVM support for ARM
>> SMMUv3), which is still under discussion.
>>
>> Patch Overview:
>> *1 to 3 prepare for device tree or acpi get the device stall ability and 
>> pasid bits
>> *4 is to realise the SVM function for platform device
>> *5 is fix a bug when test SVM function while SMMU donnot support this feature
>> *6 avoid ILLEGAL setting of STE and CD entry about stall
>>
>> Acctually here, I also have a question about SVM on SMMUv3:
>>
>> 1. Why the SVM feature on SMMUv3 depends on BTM feature? when bind a task to 
>> device,
>>it will register a mmu_notify. Therefore, when a page range is invalid, 
>> we can
>>send TLBI or ATC invalid without BTM?
> 
> We could, but the end goal for SVM is to perfectly mirror the CPU page
> tables. So for platform SVM we would like to get rid of MMU notifiers
> entirely.
> 
>> 2. According to ACPI IORT spec, named component specific data has a node 
>> flags field
>>whoes bit0 is for Stall support. However, it do not have any field for 
>> pasid bit.
>>Can we use other 5 bits[5:1] for pasid bit numbers, so we can have 32 
>> pasid bit for
>>a single platform device which should be enough, because SMMU only 
>> support 20 bit pasid
>>

Any comment on this?
The ACPI IORT spec may need be updated?

Regards,
Liubo

>> 3. Presently, the pasid is allocate for a task but not for a context, if a 
>> task is trying
>>to bind to 2 device A and B:
>>  a) A support 5 pasid bits
>>  b) B support 2 pasid bits
>>  c) when the task bind to device A, it allocate pasid = 16
>>  d) then it must be fail when trying to bind to task B, for its highest 
>> pasid is 4.
>>So it should allocate a single pasid for a context to avoid this?
> 
> Ideally yes, but the model chosen for the IOMMU API was one PASID per
> task, so I implemented this model (the PASID allocator will be common to
> IOMMU core in the future).
> 
> Therefore the PASID allocation will fail in your example, and there is no
> way around it. If you do (d) then (c), the task will have PASID 4.
> 
> Thanks,
> Jean
> 
> .
>

Re: [RFC PATCH 4/6] iommu/arm-smmu-v3: Add SVM support for platform devices

2017-09-05 Thread Bob Liu

On 2017/9/5 20:53, Jean-Philippe Brucker wrote:
> On 31/08/17 09:20, Yisheng Xie wrote:
>> From: Jean-Philippe Brucker 
>>
>> Platform device can realise SVM function by using the stall mode. That
>> is to say, when device access a memory via iova which is not populated,
>> it will stalled and when SMMU try to translate this iova, it also will
>> stall and meanwhile send an event to CPU via MSI.
>>
>> After SMMU driver handle the event and populated the iova, it will send
>> a RESUME command to SMMU to exit the stall mode, therefore the platform
>> device can contiue access the memory.
>>
>> Signed-off-by: Jean-Philippe Brucker 
> 
> No. Please don't forge a signed-off-by under a commit message you wrote,

Really sorry for that.
We sent out the wrong version, I should take more careful review.

Regards,
Liubo

> it's rude. I didn't sign it, didn't consider it fit for mainline or even
> as an RFC, and wanted to have another read before sending. My mistake,
> I'll think twice before sharing prototypes in the future.
> 
> Thanks,
> Jean
>

Re: [RFC PATCH 4/6] iommu/arm-smmu-v3: Add SVM support for platform devices

2017-09-05 Thread Bob Liu

On 2017/9/5 20:53, Jean-Philippe Brucker wrote:
> On 31/08/17 09:20, Yisheng Xie wrote:
>> From: Jean-Philippe Brucker 
>>
>> Platform device can realise SVM function by using the stall mode. That
>> is to say, when device access a memory via iova which is not populated,
>> it will stalled and when SMMU try to translate this iova, it also will
>> stall and meanwhile send an event to CPU via MSI.
>>
>> After SMMU driver handle the event and populated the iova, it will send
>> a RESUME command to SMMU to exit the stall mode, therefore the platform
>> device can contiue access the memory.
>>
>> Signed-off-by: Jean-Philippe Brucker 
> 
> No. Please don't forge a signed-off-by under a commit message you wrote,

Really sorry for that.
We sent out the wrong version, I should take more careful review.

Regards,
Liubo

> it's rude. I didn't sign it, didn't consider it fit for mainline or even
> as an RFC, and wanted to have another read before sending. My mistake,
> I'll think twice before sharing prototypes in the future.
> 
> Thanks,
> Jean
>

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu

On 2017/9/5 10:38, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>> resource associated with it on the system (as CPU can address
>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>> framework.
>>>>>
>>>>
>>>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>>>> to OS 
>>>> through ACPI and recognized as NUMA memory node.
>>>> Then how can their memory be captured and managed by HMM framework?
>>>>
>>>
>>> Only platform that has such memory today is powerpc and it is not reported
>>> as regular memory by the firmware hence why they need this helper.
>>>
>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>
>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> Table (HMAT) table defined in ACPI 6.2.
>> The HMAT can cover CPU-addressable memory types(though not non-cache
>> coherent on-device memory).
>>
>> Ross from Intel already done some work on this, see:
>> https://lwn.net/Articles/724562/
>>
>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>> is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 

Okay, then how the kernel handle the situation of "kind of memory each with 
different characteristics"?
Does someone have any suggestion?  I thought HMM can do this.
Numa policy/node distance is good but perhaps require a few extending, e.g a 
HBM node can't be
swap, can't accept DDR fallback allocation.

> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 

This looks like a subset of HMAT when CPU can address device memory directly in 
cache-coherent way.


>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>> memory node but as io range like any regular PCIE resources. Device driver
>>> through capabilities flags would then figure out if the link between the
>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>> as device memory.
>>>
>>
>> From my point of view,  Cache coherent device memory will popular soon and
>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>> reasonable
>> to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as

But it's happening.
In my understanding, that's why HMAT was introduced.
For reporting device memory as regular memory(with different characteristics).

--
Regards,
Bob Liu

> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less
>   - device driver want total control over memory and thus to be isolated from
> mm mecanism and doing all those special cases was not welcome
>   - existing NUMA migration mecanism are ill suited for this memory as
> access by the device to the memory is unknown to core mm and there
> is no easy way to report it or track it (this kind of depends on the
> platform and hardware)
> 
> I am likely missing other big points.
> 
> Cheers,
> Jérôme
> 
> .
>

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu

On 2017/9/5 10:38, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>> resource associated with it on the system (as CPU can address
>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>> framework.
>>>>>
>>>>
>>>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>>>> to OS 
>>>> through ACPI and recognized as NUMA memory node.
>>>> Then how can their memory be captured and managed by HMM framework?
>>>>
>>>
>>> Only platform that has such memory today is powerpc and it is not reported
>>> as regular memory by the firmware hence why they need this helper.
>>>
>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>
>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> Table (HMAT) table defined in ACPI 6.2.
>> The HMAT can cover CPU-addressable memory types(though not non-cache
>> coherent on-device memory).
>>
>> Ross from Intel already done some work on this, see:
>> https://lwn.net/Articles/724562/
>>
>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>> is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 

Okay, then how the kernel handle the situation of "kind of memory each with 
different characteristics"?
Does someone have any suggestion?  I thought HMM can do this.
Numa policy/node distance is good but perhaps require a few extending, e.g a 
HBM node can't be
swap, can't accept DDR fallback allocation.

> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 

This looks like a subset of HMAT when CPU can address device memory directly in 
cache-coherent way.


>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>> memory node but as io range like any regular PCIE resources. Device driver
>>> through capabilities flags would then figure out if the link between the
>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>> as device memory.
>>>
>>
>> From my point of view,  Cache coherent device memory will popular soon and
>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>> reasonable
>> to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as

But it's happening.
In my understanding, that's why HMAT was introduced.
For reporting device memory as regular memory(with different characteristics).

--
Regards,
Bob Liu

> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less
>   - device driver want total control over memory and thus to be isolated from
> mm mecanism and doing all those special cases was not welcome
>   - existing NUMA migration mecanism are ill suited for this memory as
> access by the device to the memory is unknown to core mm and there
> is no easy way to report it or track it (this kind of depends on the
> platform and hardware)
> 
> I am likely missing other big points.
> 
> Cheers,
> Jérôme
> 
> .
>

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu

On 2017/9/4 23:51, Jerome Glisse wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>> Unlike unaddressable memory, coherent device memory has a real
>>> resource associated with it on the system (as CPU can address
>>> it). Add a new helper to hotplug such memory within the HMM
>>> framework.
>>>
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>> to OS 
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
> 
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
> 
> I don't think anyone has defined anything yet for x86 and acpi. As this is

Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
Table (HMAT) table defined in ACPI 6.2.
The HMAT can cover CPU-addressable memory types(though not non-cache coherent 
on-device memory).

Ross from Intel already done some work on this, see:
https://lwn.net/Articles/724562/

arm64 supports APCI also, there is likely more this kind of device when CCIX is 
out
(should be very soon if on schedule).

> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.
> 

>From my point of view,  Cache coherent device memory will popular soon and 
>reported through ACPI/UEFI.
Extending NUMA policy still sounds more reasonable to me.

--
Thanks,
Bob Liu

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu

On 2017/9/4 23:51, Jerome Glisse wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>> Unlike unaddressable memory, coherent device memory has a real
>>> resource associated with it on the system (as CPU can address
>>> it). Add a new helper to hotplug such memory within the HMM
>>> framework.
>>>
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>> to OS 
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
> 
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
> 
> I don't think anyone has defined anything yet for x86 and acpi. As this is

Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
Table (HMAT) table defined in ACPI 6.2.
The HMAT can cover CPU-addressable memory types(though not non-cache coherent 
on-device memory).

Ross from Intel already done some work on this, see:
https://lwn.net/Articles/724562/

arm64 supports APCI also, there is likely more this kind of device when CCIX is 
out
(should be very soon if on schedule).

> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.
> 

>From my point of view,  Cache coherent device memory will popular soon and 
>reported through ACPI/UEFI.
Extending NUMA policy still sounds more reasonable to me.

--
Thanks,
Bob Liu

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-03 Thread Bob Liu

On 2017/8/17 8:05, Jérôme Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
> 

Got an new question, coherent device( e.g CCIX) memory are likely reported to 
OS 
through ACPI and recognized as NUMA memory node.
Then how can their memory be captured and managed by HMM framework?

--
Regards,
Bob Liu

> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
> 
> Signed-off-by: Jérôme Glisse <jgli...@redhat.com>
> Reviewed-by: Balbir Singh <bsinghar...@gmail.com>
> ---
>  include/linux/hmm.h |  3 ++
>  mm/hmm.c| 88 
> ++---
>  2 files changed, 86 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79e63178fd87..5866f3194c26 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -443,6 +443,9 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
> struct device *device,
> unsigned long size);
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res);
>  void hmm_devmem_remove(struct hmm_devmem *devmem);
>  
>  /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 1a1e79d390c1..3faa4d40295e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void 
> *data)
>   zone = page_zone(page);
>  
>   mem_hotplug_begin();
> - __remove_pages(zone, start_pfn, npages);
> + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> + __remove_pages(zone, start_pfn, npages);
> + else
> + arch_remove_memory(start_pfn << PAGE_SHIFT,
> +npages << PAGE_SHIFT);
>   mem_hotplug_done();
>  
>   hmm_devmem_radix_release(resource);
> @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   if (is_ram == REGION_INTERSECTS)
>   return -ENXIO;
>  
> - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> + else
> + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +
>   devmem->pagemap.res = devmem->resource;
>   devmem->pagemap.page_fault = hmm_devmem_fault;
>   devmem->pagemap.page_free = hmm_devmem_free;
> @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>* over the device memory is un-accessible thus we do not want to
>* create a linear mapping for the memory like arch_add_memory()
>* would do.
> +  *
> +  * For device public memory, which is accesible by the CPU, we do
> +  * want the linear mapping and thus use arch_add_memory().
>*/
> - ret = add_pages(nid, align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, false);
> + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> + ret = arch_add_memory(nid, align_start, align_size, false);
> + else
> + ret = add_pages(nid, align_start >> PAGE_SHIFT,
> + align_size >> PAGE_SHIFT, false);
>   if (ret) {
>   mem_hotplug_done();
>   goto error_add_memory;
> @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>  }
>  EXPORT_SYMBOL(hmm_devmem_add);
>  
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res)
> +{
> + struct hmm_devmem *devmem;
> + int ret;
> +
> + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + return ERR_PTR(-EINVAL);
> +
> + static_branch_enable(_private_key);
> +
> + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
> +GFP_KERNEL, dev_to_node(device));
> + if (!devmem)
> + return ERR_PTR(-ENOMEM);
> +
> + init_completion(>completion);
> + devmem->pfn_first = -1UL;
> + devmem->pfn_last = -1UL;
> + devmem->resource = res;
> + devmem->device = device;

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-03 Thread Bob Liu

On 2017/8/17 8:05, Jérôme Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
> 

Got an new question, coherent device( e.g CCIX) memory are likely reported to 
OS 
through ACPI and recognized as NUMA memory node.
Then how can their memory be captured and managed by HMM framework?

--
Regards,
Bob Liu

> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
> 
> Signed-off-by: Jérôme Glisse 
> Reviewed-by: Balbir Singh 
> ---
>  include/linux/hmm.h |  3 ++
>  mm/hmm.c| 88 
> ++---
>  2 files changed, 86 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79e63178fd87..5866f3194c26 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -443,6 +443,9 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
> struct device *device,
> unsigned long size);
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res);
>  void hmm_devmem_remove(struct hmm_devmem *devmem);
>  
>  /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 1a1e79d390c1..3faa4d40295e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void 
> *data)
>   zone = page_zone(page);
>  
>   mem_hotplug_begin();
> - __remove_pages(zone, start_pfn, npages);
> + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> + __remove_pages(zone, start_pfn, npages);
> + else
> + arch_remove_memory(start_pfn << PAGE_SHIFT,
> +npages << PAGE_SHIFT);
>   mem_hotplug_done();
>  
>   hmm_devmem_radix_release(resource);
> @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   if (is_ram == REGION_INTERSECTS)
>   return -ENXIO;
>  
> - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> + else
> + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +
>   devmem->pagemap.res = devmem->resource;
>   devmem->pagemap.page_fault = hmm_devmem_fault;
>   devmem->pagemap.page_free = hmm_devmem_free;
> @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>* over the device memory is un-accessible thus we do not want to
>* create a linear mapping for the memory like arch_add_memory()
>* would do.
> +  *
> +  * For device public memory, which is accesible by the CPU, we do
> +  * want the linear mapping and thus use arch_add_memory().
>*/
> - ret = add_pages(nid, align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, false);
> + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> + ret = arch_add_memory(nid, align_start, align_size, false);
> + else
> + ret = add_pages(nid, align_start >> PAGE_SHIFT,
> + align_size >> PAGE_SHIFT, false);
>   if (ret) {
>   mem_hotplug_done();
>   goto error_add_memory;
> @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>  }
>  EXPORT_SYMBOL(hmm_devmem_add);
>  
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res)
> +{
> + struct hmm_devmem *devmem;
> + int ret;
> +
> + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + return ERR_PTR(-EINVAL);
> +
> + static_branch_enable(_private_key);
> +
> + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
> +GFP_KERNEL, dev_to_node(device));
> + if (!devmem)
> + return ERR_PTR(-ENOMEM);
> +
> + init_completion(>completion);
> + devmem->pfn_first = -1UL;
> + devmem->pfn_last = -1UL;
> + devmem->resource = res;
> + devmem->device = device;
> + devmem->o

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-21 Thread Bob Liu

On Fri, Jul 21, 2017 at 10:10 AM, Bob Liu <liub...@huawei.com> wrote:
> On 2017/7/21 9:41, Jerome Glisse wrote:
>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>
>> [...]
>>
>>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>>
>>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>>> DDR?
>>>>>
>>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma 
>>>>> mechanism
>>>>> at least.
>>>>
>>>> NUMA is not as easy as you think. First like i said we want the device
>>>> memory to be isolated from most existing mm mechanism. Because memory
>>>> is unreliable and also because device might need to be able to evict
>>>> memory to make contiguous physical memory allocation for graphics.
>>>>
>>>
>>> Right, but we need isolation any way.
>>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>>> if (is_device_public_page(page)) ...
>>>
>>> But how to evict device memory?
>>
>> What you mean by evict ? Device driver can evict whenever they see the need
>> to do so. CPU page fault will evict too. Process exit or munmap() will free
>> the device memory.
>>
>> Are you refering to evict in the sense of memory reclaim under pressure ?
>>
>> So the way it flows for memory pressure is that if device driver want to
>> make room it can evict stuff to system memory and if there is not enough
>
> Yes, I mean this.
> So every driver have to maintain their own LRU-similar list instead of reuse 
> what already in linux kernel.
>

And how HMM-CDM can handle multiple devices or device with multiple
device memories(may with different properties also)?
This kind of hardware platform would be very common when CCIX is out soon.

Thanks,
Bob Liu



>> system memory than thing get reclaim as usual before device driver can
>> make progress on device memory reclaim.
>>
>>
>>>> Second device driver are not integrated that closely within mm and the
>>>> scheduler kernel code to allow to efficiently plug in device access
>>>> notification to page (ie to update struct page so that numa worker
>>>> thread can migrate memory base on accurate informations).
>>>>
>>>> Third it can be hard to decide who win between CPU and device access
>>>> when it comes to updating thing like last CPU id.
>>>>
>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>> If we were to add something the CPU id field in flags of struct page
>>>> would not be big enough so this can have repercusion on struct page
>>>> size. This is not an easy sell.
>>>>
>>>> They are other issues i can't think of right now. I think for now it
>>>
>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>>> I just care about a more complete solution no matter CDM,HMM-CDM or other 
>>> ways.
>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>>> driver to
>>> demonstrate the whole solution works fine.
>>
>> I am working with NVidia close source driver team to make sure that it works
>> well for them. I am also working on nouveau open source driver for same 
>> NVidia
>> hardware thought it will be of less use as what is missing there is a solid
>> open source userspace to leverage this. Nonetheless open source driver are in
>> the work.
>>
>
> Looking forward to see these drivers be public.
>
>> The way i see it is start with HMM-CDM which isolate most of the changes in
>> hmm code. Once we get more experience with real workload and not with device
>> driver test suite then we can start revisiting NUMA and deeper integration
>> with the linux kernel. I rather grow organicaly toward that than trying to
>> design something that would make major changes all over the kernel without
>> knowing for sure that we are going in the right direction. I hope that this
>> make sense to others too.
>>
>
> Make sense.
>
> Thanks,
> Bob Liu
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



-- 
Regards,
--Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-21 Thread Bob Liu

On Fri, Jul 21, 2017 at 10:10 AM, Bob Liu  wrote:
> On 2017/7/21 9:41, Jerome Glisse wrote:
>> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>
>> [...]
>>
>>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>>
>>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>>> DDR?
>>>>>
>>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma 
>>>>> mechanism
>>>>> at least.
>>>>
>>>> NUMA is not as easy as you think. First like i said we want the device
>>>> memory to be isolated from most existing mm mechanism. Because memory
>>>> is unreliable and also because device might need to be able to evict
>>>> memory to make contiguous physical memory allocation for graphics.
>>>>
>>>
>>> Right, but we need isolation any way.
>>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>>> if (is_device_public_page(page)) ...
>>>
>>> But how to evict device memory?
>>
>> What you mean by evict ? Device driver can evict whenever they see the need
>> to do so. CPU page fault will evict too. Process exit or munmap() will free
>> the device memory.
>>
>> Are you refering to evict in the sense of memory reclaim under pressure ?
>>
>> So the way it flows for memory pressure is that if device driver want to
>> make room it can evict stuff to system memory and if there is not enough
>
> Yes, I mean this.
> So every driver have to maintain their own LRU-similar list instead of reuse 
> what already in linux kernel.
>

And how HMM-CDM can handle multiple devices or device with multiple
device memories(may with different properties also)?
This kind of hardware platform would be very common when CCIX is out soon.

Thanks,
Bob Liu



>> system memory than thing get reclaim as usual before device driver can
>> make progress on device memory reclaim.
>>
>>
>>>> Second device driver are not integrated that closely within mm and the
>>>> scheduler kernel code to allow to efficiently plug in device access
>>>> notification to page (ie to update struct page so that numa worker
>>>> thread can migrate memory base on accurate informations).
>>>>
>>>> Third it can be hard to decide who win between CPU and device access
>>>> when it comes to updating thing like last CPU id.
>>>>
>>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>>> If we were to add something the CPU id field in flags of struct page
>>>> would not be big enough so this can have repercusion on struct page
>>>> size. This is not an easy sell.
>>>>
>>>> They are other issues i can't think of right now. I think for now it
>>>
>>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>>> I just care about a more complete solution no matter CDM,HMM-CDM or other 
>>> ways.
>>> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>>> driver to
>>> demonstrate the whole solution works fine.
>>
>> I am working with NVidia close source driver team to make sure that it works
>> well for them. I am also working on nouveau open source driver for same 
>> NVidia
>> hardware thought it will be of less use as what is missing there is a solid
>> open source userspace to leverage this. Nonetheless open source driver are in
>> the work.
>>
>
> Looking forward to see these drivers be public.
>
>> The way i see it is start with HMM-CDM which isolate most of the changes in
>> hmm code. Once we get more experience with real workload and not with device
>> driver test suite then we can start revisiting NUMA and deeper integration
>> with the linux kernel. I rather grow organicaly toward that than trying to
>> design something that would make major changes all over the kernel without
>> knowing for sure that we are going in the right direction. I hope that this
>> make sense to others too.
>>
>
> Make sense.
>
> Thanks,
> Bob Liu
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



-- 
Regards,
--Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-20 Thread Bob Liu

On 2017/7/21 9:41, Jerome Glisse wrote:
> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
> 
> [...]
> 
>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>
>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>> DDR? 
>>>>
>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>>>> at least.
>>>
>>> NUMA is not as easy as you think. First like i said we want the device
>>> memory to be isolated from most existing mm mechanism. Because memory
>>> is unreliable and also because device might need to be able to evict
>>> memory to make contiguous physical memory allocation for graphics.
>>>
>>
>> Right, but we need isolation any way.
>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>> if (is_device_public_page(page)) ...
>>
>> But how to evict device memory?
> 
> What you mean by evict ? Device driver can evict whenever they see the need
> to do so. CPU page fault will evict too. Process exit or munmap() will free
> the device memory.
> 
> Are you refering to evict in the sense of memory reclaim under pressure ?
> 
> So the way it flows for memory pressure is that if device driver want to
> make room it can evict stuff to system memory and if there is not enough

Yes, I mean this. 
So every driver have to maintain their own LRU-similar list instead of reuse 
what already in linux kernel.

> system memory than thing get reclaim as usual before device driver can
> make progress on device memory reclaim.
> 
> 
>>> Second device driver are not integrated that closely within mm and the
>>> scheduler kernel code to allow to efficiently plug in device access
>>> notification to page (ie to update struct page so that numa worker
>>> thread can migrate memory base on accurate informations).
>>>
>>> Third it can be hard to decide who win between CPU and device access
>>> when it comes to updating thing like last CPU id.
>>>
>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>> If we were to add something the CPU id field in flags of struct page
>>> would not be big enough so this can have repercusion on struct page
>>> size. This is not an easy sell.
>>>
>>> They are other issues i can't think of right now. I think for now it
>>
>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> I just care about a more complete solution no matter CDM,HMM-CDM or other 
>> ways.
>> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>> driver to 
>> demonstrate the whole solution works fine.
> 
> I am working with NVidia close source driver team to make sure that it works
> well for them. I am also working on nouveau open source driver for same NVidia
> hardware thought it will be of less use as what is missing there is a solid
> open source userspace to leverage this. Nonetheless open source driver are in
> the work.
> 

Looking forward to see these drivers be public.

> The way i see it is start with HMM-CDM which isolate most of the changes in
> hmm code. Once we get more experience with real workload and not with device
> driver test suite then we can start revisiting NUMA and deeper integration
> with the linux kernel. I rather grow organicaly toward that than trying to
> design something that would make major changes all over the kernel without
> knowing for sure that we are going in the right direction. I hope that this
> make sense to others too.
> 

Make sense.

Thanks,
Bob Liu

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-20 Thread Bob Liu

On 2017/7/21 9:41, Jerome Glisse wrote:
> On Fri, Jul 21, 2017 at 09:15:29AM +0800, Bob Liu wrote:
>> On 2017/7/20 23:03, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>>>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
> 
> [...]
> 
>>>> Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE
>>>> (type MEMORY_DEVICE_PUBLIC). But the problem is the same, e.g how to make
>>>> sure the device memory say HBM won't be occupied by normal CPU allocation.
>>>> Things will be more complex if there are multi GPU connected by nvlink
>>>> (also cache coherent) in a system, each GPU has their own HBM.
>>>>
>>>> How to decide allocate physical memory from local HBM/DDR or remote HBM/
>>>> DDR? 
>>>>
>>>> If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism
>>>> at least.
>>>
>>> NUMA is not as easy as you think. First like i said we want the device
>>> memory to be isolated from most existing mm mechanism. Because memory
>>> is unreliable and also because device might need to be able to evict
>>> memory to make contiguous physical memory allocation for graphics.
>>>
>>
>> Right, but we need isolation any way.
>> For hmm-cdm, the isolation is not adding device memory to lru list, and many
>> if (is_device_public_page(page)) ...
>>
>> But how to evict device memory?
> 
> What you mean by evict ? Device driver can evict whenever they see the need
> to do so. CPU page fault will evict too. Process exit or munmap() will free
> the device memory.
> 
> Are you refering to evict in the sense of memory reclaim under pressure ?
> 
> So the way it flows for memory pressure is that if device driver want to
> make room it can evict stuff to system memory and if there is not enough

Yes, I mean this. 
So every driver have to maintain their own LRU-similar list instead of reuse 
what already in linux kernel.

> system memory than thing get reclaim as usual before device driver can
> make progress on device memory reclaim.
> 
> 
>>> Second device driver are not integrated that closely within mm and the
>>> scheduler kernel code to allow to efficiently plug in device access
>>> notification to page (ie to update struct page so that numa worker
>>> thread can migrate memory base on accurate informations).
>>>
>>> Third it can be hard to decide who win between CPU and device access
>>> when it comes to updating thing like last CPU id.
>>>
>>> Fourth there is no such thing like device id ie equivalent of CPU id.
>>> If we were to add something the CPU id field in flags of struct page
>>> would not be big enough so this can have repercusion on struct page
>>> size. This is not an easy sell.
>>>
>>> They are other issues i can't think of right now. I think for now it
>>
>> My opinion is most of the issues are the same no matter use CDM or HMM-CDM.
>> I just care about a more complete solution no matter CDM,HMM-CDM or other 
>> ways.
>> HMM or HMM-CDM depends on device driver, but haven't see a public/full 
>> driver to 
>> demonstrate the whole solution works fine.
> 
> I am working with NVidia close source driver team to make sure that it works
> well for them. I am also working on nouveau open source driver for same NVidia
> hardware thought it will be of less use as what is missing there is a solid
> open source userspace to leverage this. Nonetheless open source driver are in
> the work.
> 

Looking forward to see these drivers be public.

> The way i see it is start with HMM-CDM which isolate most of the changes in
> hmm code. Once we get more experience with real workload and not with device
> driver test suite then we can start revisiting NUMA and deeper integration
> with the linux kernel. I rather grow organicaly toward that than trying to
> design something that would make major changes all over the kernel without
> knowing for sure that we are going in the right direction. I hope that this
> make sense to others too.
> 

Make sense.

Thanks,
Bob Liu

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-20 Thread Bob Liu

On 2017/7/20 23:03, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>>>> understood the suggestion. So here i repost with proper naming.
>>>>>>> This is the only change since v3. Again sorry about the noise
>>>>>>> with v4.
>>>>>>>
>>>>>>> Changes since v4:
>>>>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>>>
>>>>>>> Git tree:
>>>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>>>
>>>>>>>
>>>>>>> Cache coherent device memory apply to architecture with system bus
>>>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>>>> their memory to the system and allow cache coherent access to it
>>>>>>> from the CPU.
>>>>>>>
>>>>>>> Even if for all intent and purposes device memory behave like regular
>>>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>>>> Several reasons for that, first and foremost this memory is less
>>>>>>> reliable than regular memory if the device hangs because of invalid
>>>>>>> commands we can loose access to device memory. Second CPU access to
>>>>>>> this memory is expected to be slower than to regular memory. Third
>>>>>>> having random memory into device means that some of the bus bandwith
>>>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>>>
>>>>>>> This is why we want to manage such memory in isolation from regular
>>>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>>>> when running out of memory, at least for now.
>>>>>>>
>>>>>>
>>>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>>>> may be a easier way to address these concerns.
>>>>>
>>>>> Such approach was discuss at length in the past see links below. Outcome
>>>>> of discussion:
>>>>>   - CPU less node are bad
>>>>>   - device memory can be unreliable (device hang) no way for application
>>>>> to understand that
>>>>
>>>> Device memory can also be more reliable if using high quality and 
>>>> expensive memory.
>>>
>>> Even ECC memory does not compensate for device hang. When your GPU lockups
>>> you might need to re-init GPU from scratch after which the content of the
>>> device memory is unreliable. During init the device memory might not get
>>> proper clock or proper refresh cycle and thus is susceptible to corruption.
>>>
>>>>
>>>>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>>>> with each other and no way the kernel can figure out which should
>>>>> apply
>>>>>   - NUMA as it is now would not work as we need further isolation that
>>>>> what a large node distance would provide
>>>>>
>>>>
>>>> Agree, that's where we need spend time on.
>>>>
>>>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>>>> In the cache coherent case, CPU can write data to device memory
>>>> directly then start fpga/GPU/other accelerators.
>>>
>>> There is not necessarily an extra copy. Device driver can pre-allocate
>>> virtual address range of a process with device memory. Device page fault
>>
>> Okay, I get your point. But the typical use case is CPU allocate a memory
>> and prepare/write data then launch GPU "cuda kernel".
> 
> I don't think we should make to many assumption on what is typical case.
> GPU compute is fast evolving and they are new domains where it is apply
> for instance some folks use it to process network stream and the network
> adapter directly write into GPU memory so there is never

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-20 Thread Bob Liu

On 2017/7/20 23:03, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 05:09:04PM +0800, Bob Liu wrote:
>> On 2017/7/19 10:25, Jerome Glisse wrote:
>>> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>>>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>>>> understood the suggestion. So here i repost with proper naming.
>>>>>>> This is the only change since v3. Again sorry about the noise
>>>>>>> with v4.
>>>>>>>
>>>>>>> Changes since v4:
>>>>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>>>
>>>>>>> Git tree:
>>>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>>>
>>>>>>>
>>>>>>> Cache coherent device memory apply to architecture with system bus
>>>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>>>> their memory to the system and allow cache coherent access to it
>>>>>>> from the CPU.
>>>>>>>
>>>>>>> Even if for all intent and purposes device memory behave like regular
>>>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>>>> Several reasons for that, first and foremost this memory is less
>>>>>>> reliable than regular memory if the device hangs because of invalid
>>>>>>> commands we can loose access to device memory. Second CPU access to
>>>>>>> this memory is expected to be slower than to regular memory. Third
>>>>>>> having random memory into device means that some of the bus bandwith
>>>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>>>
>>>>>>> This is why we want to manage such memory in isolation from regular
>>>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>>>> when running out of memory, at least for now.
>>>>>>>
>>>>>>
>>>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>>>> may be a easier way to address these concerns.
>>>>>
>>>>> Such approach was discuss at length in the past see links below. Outcome
>>>>> of discussion:
>>>>>   - CPU less node are bad
>>>>>   - device memory can be unreliable (device hang) no way for application
>>>>> to understand that
>>>>
>>>> Device memory can also be more reliable if using high quality and 
>>>> expensive memory.
>>>
>>> Even ECC memory does not compensate for device hang. When your GPU lockups
>>> you might need to re-init GPU from scratch after which the content of the
>>> device memory is unreliable. During init the device memory might not get
>>> proper clock or proper refresh cycle and thus is susceptible to corruption.
>>>
>>>>
>>>>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>>>> with each other and no way the kernel can figure out which should
>>>>> apply
>>>>>   - NUMA as it is now would not work as we need further isolation that
>>>>> what a large node distance would provide
>>>>>
>>>>
>>>> Agree, that's where we need spend time on.
>>>>
>>>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>>>> In the cache coherent case, CPU can write data to device memory
>>>> directly then start fpga/GPU/other accelerators.
>>>
>>> There is not necessarily an extra copy. Device driver can pre-allocate
>>> virtual address range of a process with device memory. Device page fault
>>
>> Okay, I get your point. But the typical use case is CPU allocate a memory
>> and prepare/write data then launch GPU "cuda kernel".
> 
> I don't think we should make to many assumption on what is typical case.
> GPU compute is fast evolving and they are new domains where it is apply
> for instance some folks use it to process network stream and the network
> adapter directly write into GPU memory so there is never

Re: [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-19 Thread Bob Liu

On 2017/7/7 5:52, Ross Zwisler wrote:
>  Quick Summary 
> 
> Platforms in the very near future will have multiple types of memory
> attached to a single CPU.  These disparate memory ranges will have some
> characteristics in common, such as CPU cache coherence, but they can have
> wide ranges of performance both in terms of latency and bandwidth.
> 
> For example, consider a system that contains persistent memory, standard
> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> There could potentially be an order of magnitude or more difference in
> performance between the slowest and fastest memory attached to that CPU.
> 
> With the current Linux code NUMA nodes are CPU-centric, so all the memory
> attached to a given CPU will be lumped into the same NUMA node.  This makes
> it very difficult for userspace applications to understand the performance
> of different memory ranges on a given CPU.
> 
> We solve this issue by providing userspace with performance information on
> individual memory ranges.  This performance information is exposed via
> sysfs:
> 
>   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>   mem_tgt2/firmware_id:1
>   mem_tgt2/is_cached:0
>   mem_tgt2/is_enabled:1
>   mem_tgt2/is_isolated:0
>   mem_tgt2/phys_addr_base:0x0
>   mem_tgt2/phys_length_bytes:0x8
>   mem_tgt2/local_init/read_bw_MBps:30720
>   mem_tgt2/local_init/read_lat_nsec:100
>   mem_tgt2/local_init/write_bw_MBps:30720
>   mem_tgt2/local_init/write_lat_nsec:100
> 
> This allows applications to easily find the memory that they want to use.
> We expect that the existing NUMA APIs will be enhanced to use this new
> information so that applications can continue to use them to select their
> desired memory.
> 
> This series is built upon acpica-1705:
> 
> https://github.com/zetalog/linux/commits/acpica-1705
> 
> And you can find a working tree here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
> 
>  Lots of Details 
> 
> This patch set is only concerned with CPU-addressable memory types, not
> on-device memory like what we have with Jerome Glisse's HMM series:
> 
> https://lwn.net/Articles/726691/
> 
> This patch set works by enabling the new Heterogeneous Memory Attribute
> Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> in ACPI 6.2 related to this work is that proximity domains no longer need
> to contain a processor.  We can now have memory-only proximity domains,
> which means that we can now have memory-only Linux NUMA nodes.
> 
> Here is an example configuration where we have a single processor, one
> range of regular memory and one range of HBM:
> 
>   +---+   ++
>   | Processor |   | Memory |
>   | prox domain 0 +---+ prox domain 1  |
>   | NUMA node 1   |   | NUMA node 2|
>   +---+---+   ++
>   |
>   +---+--+
>   | HBM  |
>   | prox domain 2|
>   | NUMA node 0  |
>   +--+
> 
> This gives us one initiator (the processor) and two targets (the two memory
> ranges).  Each of these three has its own ACPI proximity domain and
> associated Linux NUMA node.  Note also that while there is a 1:1 mapping
> from each proximity domain to each NUMA node, the numbers don't necessarily
> match up.  Additionally we can have extra NUMA nodes that don't map back to
> ACPI proximity domains.
> 
> The above configuration could also have the processor and one of the two
> memory ranges sharing a proximity domain and NUMA node, but for the
> purposes of the HMAT the two memory ranges will always need to be
> separated.
> 
> The overall goal of this series and of the HMAT is to allow users to
> identify memory using its performance characteristics.  This can broadly be
> done in one of two ways:
> 
> Option 1: Provide the user with a way to map between proximity domains and
> NUMA nodes and a way to access the HMAT directly (probably via
> /sys/firmware/acpi/tables).  Then, through possibly a library and a daemon,
> provide an API so that applications can either request information about
> memory ranges, or request memory allocations that meet a given set of
> performance characteristics.
> 
> Option 2: Provide the user with HMAT performance data directly in sysfs,
> allowing applications to directly access it without the need for the
> library and daemon.
> 

Is it possible to do the memory allocation automatically by the kernel and 
transparent to users?
It sounds like unreasonable that most users should aware this detail memory 
topology.

--
Thanks,
Bob Liu

> The kernel work for option 1 is started by patches 1-3.  These just surface
> the minimal amount of information in sysfs to allow userspace to map
> between proximity domains and NUMA nodes so that the raw data in the HMAT
> table can be understood.

Re: [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-19 Thread Bob Liu

On 2017/7/7 5:52, Ross Zwisler wrote:
>  Quick Summary 
> 
> Platforms in the very near future will have multiple types of memory
> attached to a single CPU.  These disparate memory ranges will have some
> characteristics in common, such as CPU cache coherence, but they can have
> wide ranges of performance both in terms of latency and bandwidth.
> 
> For example, consider a system that contains persistent memory, standard
> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> There could potentially be an order of magnitude or more difference in
> performance between the slowest and fastest memory attached to that CPU.
> 
> With the current Linux code NUMA nodes are CPU-centric, so all the memory
> attached to a given CPU will be lumped into the same NUMA node.  This makes
> it very difficult for userspace applications to understand the performance
> of different memory ranges on a given CPU.
> 
> We solve this issue by providing userspace with performance information on
> individual memory ranges.  This performance information is exposed via
> sysfs:
> 
>   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>   mem_tgt2/firmware_id:1
>   mem_tgt2/is_cached:0
>   mem_tgt2/is_enabled:1
>   mem_tgt2/is_isolated:0
>   mem_tgt2/phys_addr_base:0x0
>   mem_tgt2/phys_length_bytes:0x8
>   mem_tgt2/local_init/read_bw_MBps:30720
>   mem_tgt2/local_init/read_lat_nsec:100
>   mem_tgt2/local_init/write_bw_MBps:30720
>   mem_tgt2/local_init/write_lat_nsec:100
> 
> This allows applications to easily find the memory that they want to use.
> We expect that the existing NUMA APIs will be enhanced to use this new
> information so that applications can continue to use them to select their
> desired memory.
> 
> This series is built upon acpica-1705:
> 
> https://github.com/zetalog/linux/commits/acpica-1705
> 
> And you can find a working tree here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
> 
>  Lots of Details 
> 
> This patch set is only concerned with CPU-addressable memory types, not
> on-device memory like what we have with Jerome Glisse's HMM series:
> 
> https://lwn.net/Articles/726691/
> 
> This patch set works by enabling the new Heterogeneous Memory Attribute
> Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> in ACPI 6.2 related to this work is that proximity domains no longer need
> to contain a processor.  We can now have memory-only proximity domains,
> which means that we can now have memory-only Linux NUMA nodes.
> 
> Here is an example configuration where we have a single processor, one
> range of regular memory and one range of HBM:
> 
>   +---+   ++
>   | Processor |   | Memory |
>   | prox domain 0 +---+ prox domain 1  |
>   | NUMA node 1   |   | NUMA node 2|
>   +---+---+   ++
>   |
>   +---+--+
>   | HBM  |
>   | prox domain 2|
>   | NUMA node 0  |
>   +--+
> 
> This gives us one initiator (the processor) and two targets (the two memory
> ranges).  Each of these three has its own ACPI proximity domain and
> associated Linux NUMA node.  Note also that while there is a 1:1 mapping
> from each proximity domain to each NUMA node, the numbers don't necessarily
> match up.  Additionally we can have extra NUMA nodes that don't map back to
> ACPI proximity domains.
> 
> The above configuration could also have the processor and one of the two
> memory ranges sharing a proximity domain and NUMA node, but for the
> purposes of the HMAT the two memory ranges will always need to be
> separated.
> 
> The overall goal of this series and of the HMAT is to allow users to
> identify memory using its performance characteristics.  This can broadly be
> done in one of two ways:
> 
> Option 1: Provide the user with a way to map between proximity domains and
> NUMA nodes and a way to access the HMAT directly (probably via
> /sys/firmware/acpi/tables).  Then, through possibly a library and a daemon,
> provide an API so that applications can either request information about
> memory ranges, or request memory allocations that meet a given set of
> performance characteristics.
> 
> Option 2: Provide the user with HMAT performance data directly in sysfs,
> allowing applications to directly access it without the need for the
> library and daemon.
> 

Is it possible to do the memory allocation automatically by the kernel and 
transparent to users?
It sounds like unreasonable that most users should aware this detail memory 
topology.

--
Thanks,
Bob Liu

> The kernel work for option 1 is started by patches 1-3.  These just surface
> the minimal amount of information in sysfs to allow userspace to map
> between proximity domains and NUMA nodes so that the raw data in the HMAT
> table can be understood.

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-19 Thread Bob Liu

On 2017/7/19 10:25, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>> understood the suggestion. So here i repost with proper naming.
>>>>> This is the only change since v3. Again sorry about the noise
>>>>> with v4.
>>>>>
>>>>> Changes since v4:
>>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>
>>>>> Git tree:
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>
>>>>>
>>>>> Cache coherent device memory apply to architecture with system bus
>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>> their memory to the system and allow cache coherent access to it
>>>>> from the CPU.
>>>>>
>>>>> Even if for all intent and purposes device memory behave like regular
>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>> Several reasons for that, first and foremost this memory is less
>>>>> reliable than regular memory if the device hangs because of invalid
>>>>> commands we can loose access to device memory. Second CPU access to
>>>>> this memory is expected to be slower than to regular memory. Third
>>>>> having random memory into device means that some of the bus bandwith
>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>
>>>>> This is why we want to manage such memory in isolation from regular
>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>> when running out of memory, at least for now.
>>>>>
>>>>
>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>> may be a easier way to address these concerns.
>>>
>>> Such approach was discuss at length in the past see links below. Outcome
>>> of discussion:
>>>   - CPU less node are bad
>>>   - device memory can be unreliable (device hang) no way for application
>>> to understand that
>>
>> Device memory can also be more reliable if using high quality and expensive 
>> memory.
> 
> Even ECC memory does not compensate for device hang. When your GPU lockups
> you might need to re-init GPU from scratch after which the content of the
> device memory is unreliable. During init the device memory might not get
> proper clock or proper refresh cycle and thus is susceptible to corruption.
> 
>>
>>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>> with each other and no way the kernel can figure out which should
>>> apply
>>>   - NUMA as it is now would not work as we need further isolation that
>>> what a large node distance would provide
>>>
>>
>> Agree, that's where we need spend time on.
>>
>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>> In the cache coherent case, CPU can write data to device memory
>> directly then start fpga/GPU/other accelerators.
> 
> There is not necessarily an extra copy. Device driver can pre-allocate
> virtual address range of a process with device memory. Device page fault

Okay, I get your point.
But the typical use case is CPU allocate a memory and prepare/write data then 
launch GPU "cuda kernel".
How to control the allocation go to device memory e.g HBM or system DDR at the 
beginning without user explicit advise?
If goes to DDR by default, there is an extra copy. If goes to HBM by default, 
the HBM may be waste.

> can directly allocate device memory. Once allocated CPU access will use
> the device memory.
> 

Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE(type 
MEMORY_DEVICE_PUBLIC).
But the problem is the same, e.g how to make sure the device memory say HBM 
won't be occupied by normal CPU allocation.
Things will be more complex if there are multi GPU connected by nvlink(also 
cache coherent) in a system, each GPU has their own HBM.
How to decide allocate physical memory from local HBM/DDR or remote HBM/DDR? 
If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism at 
least.

Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-19 Thread Bob Liu

On 2017/7/19 10:25, Jerome Glisse wrote:
> On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote:
>> On 2017/7/18 23:38, Jerome Glisse wrote:
>>> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>>>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>>>> understood the suggestion. So here i repost with proper naming.
>>>>> This is the only change since v3. Again sorry about the noise
>>>>> with v4.
>>>>>
>>>>> Changes since v4:
>>>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>>>
>>>>> Git tree:
>>>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>>>
>>>>>
>>>>> Cache coherent device memory apply to architecture with system bus
>>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>>> their memory to the system and allow cache coherent access to it
>>>>> from the CPU.
>>>>>
>>>>> Even if for all intent and purposes device memory behave like regular
>>>>> memory, we still want to manage it in isolation from regular memory.
>>>>> Several reasons for that, first and foremost this memory is less
>>>>> reliable than regular memory if the device hangs because of invalid
>>>>> commands we can loose access to device memory. Second CPU access to
>>>>> this memory is expected to be slower than to regular memory. Third
>>>>> having random memory into device means that some of the bus bandwith
>>>>> wouldn't be available to the device but would be use by CPU access.
>>>>>
>>>>> This is why we want to manage such memory in isolation from regular
>>>>> memory. Kernel should not try to use this memory even as last resort
>>>>> when running out of memory, at least for now.
>>>>>
>>>>
>>>> I think set a very large node distance for "Cache Coherent Device Memory"
>>>> may be a easier way to address these concerns.
>>>
>>> Such approach was discuss at length in the past see links below. Outcome
>>> of discussion:
>>>   - CPU less node are bad
>>>   - device memory can be unreliable (device hang) no way for application
>>> to understand that
>>
>> Device memory can also be more reliable if using high quality and expensive 
>> memory.
> 
> Even ECC memory does not compensate for device hang. When your GPU lockups
> you might need to re-init GPU from scratch after which the content of the
> device memory is unreliable. During init the device memory might not get
> proper clock or proper refresh cycle and thus is susceptible to corruption.
> 
>>
>>>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
>>> with each other and no way the kernel can figure out which should
>>> apply
>>>   - NUMA as it is now would not work as we need further isolation that
>>> what a large node distance would provide
>>>
>>
>> Agree, that's where we need spend time on.
>>
>> One drawback of HMM-CDM I'm worry about is one more extra copy.
>> In the cache coherent case, CPU can write data to device memory
>> directly then start fpga/GPU/other accelerators.
> 
> There is not necessarily an extra copy. Device driver can pre-allocate
> virtual address range of a process with device memory. Device page fault

Okay, I get your point.
But the typical use case is CPU allocate a memory and prepare/write data then 
launch GPU "cuda kernel".
How to control the allocation go to device memory e.g HBM or system DDR at the 
beginning without user explicit advise?
If goes to DDR by default, there is an extra copy. If goes to HBM by default, 
the HBM may be waste.

> can directly allocate device memory. Once allocated CPU access will use
> the device memory.
> 

Then it's more like replace the numa node solution(CDM) with ZONE_DEVICE(type 
MEMORY_DEVICE_PUBLIC).
But the problem is the same, e.g how to make sure the device memory say HBM 
won't be occupied by normal CPU allocation.
Things will be more complex if there are multi GPU connected by nvlink(also 
cache coherent) in a system, each GPU has their own HBM.
How to decide allocate physical memory from local HBM/DDR or remote HBM/DDR? 
If using numa(CDM) approach there are NUMA mempolicy and autonuma mechanism at 
least.

Thanks,
Bob

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-18 Thread Bob Liu

On 2017/7/18 23:38, Jerome Glisse wrote:
> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>> understood the suggestion. So here i repost with proper naming.
>>> This is the only change since v3. Again sorry about the noise
>>> with v4.
>>>
>>> Changes since v4:
>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>
>>> Git tree:
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>
>>>
>>> Cache coherent device memory apply to architecture with system bus
>>> like CAPI or CCIX. Device connected to such system bus can expose
>>> their memory to the system and allow cache coherent access to it
>>> from the CPU.
>>>
>>> Even if for all intent and purposes device memory behave like regular
>>> memory, we still want to manage it in isolation from regular memory.
>>> Several reasons for that, first and foremost this memory is less
>>> reliable than regular memory if the device hangs because of invalid
>>> commands we can loose access to device memory. Second CPU access to
>>> this memory is expected to be slower than to regular memory. Third
>>> having random memory into device means that some of the bus bandwith
>>> wouldn't be available to the device but would be use by CPU access.
>>>
>>> This is why we want to manage such memory in isolation from regular
>>> memory. Kernel should not try to use this memory even as last resort
>>> when running out of memory, at least for now.
>>>
>>
>> I think set a very large node distance for "Cache Coherent Device Memory"
>> may be a easier way to address these concerns.
> 
> Such approach was discuss at length in the past see links below. Outcome
> of discussion:
>   - CPU less node are bad
>   - device memory can be unreliable (device hang) no way for application
> to understand that

Device memory can also be more reliable if using high quality and expensive 
memory.

>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> with each other and no way the kernel can figure out which should
> apply
>   - NUMA as it is now would not work as we need further isolation that
> what a large node distance would provide
> 

Agree, that's where we need spend time on.

One drawback of HMM-CDM I'm worry about is one more extra copy.
In the cache coherent case, CPU can write data to device memory directly then 
start fpga/GPU/other accelerators.

Thanks,
Bob Liu

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-18 Thread Bob Liu

On 2017/7/18 23:38, Jerome Glisse wrote:
> On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote:
>> On 2017/7/14 5:15, Jérôme Glisse wrote:
>>> Sorry i made horrible mistake on names in v4, i completly miss-
>>> understood the suggestion. So here i repost with proper naming.
>>> This is the only change since v3. Again sorry about the noise
>>> with v4.
>>>
>>> Changes since v4:
>>>   - s/DEVICE_HOST/DEVICE_PUBLIC
>>>
>>> Git tree:
>>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
>>>
>>>
>>> Cache coherent device memory apply to architecture with system bus
>>> like CAPI or CCIX. Device connected to such system bus can expose
>>> their memory to the system and allow cache coherent access to it
>>> from the CPU.
>>>
>>> Even if for all intent and purposes device memory behave like regular
>>> memory, we still want to manage it in isolation from regular memory.
>>> Several reasons for that, first and foremost this memory is less
>>> reliable than regular memory if the device hangs because of invalid
>>> commands we can loose access to device memory. Second CPU access to
>>> this memory is expected to be slower than to regular memory. Third
>>> having random memory into device means that some of the bus bandwith
>>> wouldn't be available to the device but would be use by CPU access.
>>>
>>> This is why we want to manage such memory in isolation from regular
>>> memory. Kernel should not try to use this memory even as last resort
>>> when running out of memory, at least for now.
>>>
>>
>> I think set a very large node distance for "Cache Coherent Device Memory"
>> may be a easier way to address these concerns.
> 
> Such approach was discuss at length in the past see links below. Outcome
> of discussion:
>   - CPU less node are bad
>   - device memory can be unreliable (device hang) no way for application
> to understand that

Device memory can also be more reliable if using high quality and expensive 
memory.

>   - application and driver NUMA madvise/mbind/mempolicy ... can conflict
> with each other and no way the kernel can figure out which should
> apply
>   - NUMA as it is now would not work as we need further isolation that
> what a large node distance would provide
> 

Agree, that's where we need spend time on.

One drawback of HMM-CDM I'm worry about is one more extra copy.
In the cache coherent case, CPU can write data to device memory directly then 
start fpga/GPU/other accelerators.

Thanks,
Bob Liu

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-17 Thread Bob Liu

On 2017/7/14 5:15, Jérôme Glisse wrote:
> Sorry i made horrible mistake on names in v4, i completly miss-
> understood the suggestion. So here i repost with proper naming.
> This is the only change since v3. Again sorry about the noise
> with v4.
> 
> Changes since v4:
>   - s/DEVICE_HOST/DEVICE_PUBLIC
> 
> Git tree:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> 
> 
> Cache coherent device memory apply to architecture with system bus
> like CAPI or CCIX. Device connected to such system bus can expose
> their memory to the system and allow cache coherent access to it
> from the CPU.
> 
> Even if for all intent and purposes device memory behave like regular
> memory, we still want to manage it in isolation from regular memory.
> Several reasons for that, first and foremost this memory is less
> reliable than regular memory if the device hangs because of invalid
> commands we can loose access to device memory. Second CPU access to
> this memory is expected to be slower than to regular memory. Third
> having random memory into device means that some of the bus bandwith
> wouldn't be available to the device but would be use by CPU access.
> 
> This is why we want to manage such memory in isolation from regular
> memory. Kernel should not try to use this memory even as last resort
> when running out of memory, at least for now.
>

I think set a very large node distance for "Cache Coherent Device Memory" may 
be a easier way to address these concerns.

--
Regards,
Bob Liu


 
> This patchset add a new type of ZONE_DEVICE memory (DEVICE_HOST)
> that is use to represent CDM memory. This patchset build on top of
> the HMM patchset that already introduce a new type of ZONE_DEVICE
> memory for private device memory (see HMM patchset).
> 
> The end result is that with this patchset if a device is in use in
> a process you might have private anonymous memory or file back
> page memory using ZONE_DEVICE (DEVICE_HOST). Thus care must be
> taken to not overwritte lru fields of such pages.
> 
> Hence all core mm changes are done to address assumption that any
> process memory is back by a regular struct page that is part of
> the lru. ZONE_DEVICE page are not on the lru and the lru pointer
> of struct page are use to store device specific informations.
> 
> Thus this patchset update all code path that would make assumptions
> about lruness of a process page.
> 
> patch 01 - rename DEVICE_PUBLIC to DEVICE_HOST to free DEVICE_PUBLIC name
> patch 02 - add DEVICE_PUBLIC type to ZONE_DEVICE (all core mm changes)
> patch 03 - add an helper to HMM for hotplug of CDM memory
> patch 04 - preparatory patch for memory controller changes (memch)
> patch 05 - update memory controller to properly handle
>ZONE_DEVICE pages when uncharging
> patch 06 - documentation patch
> 
> Previous posting:
> v1 https://lkml.org/lkml/2017/4/7/638
> v2 https://lwn.net/Articles/725412/
> v3 https://lwn.net/Articles/727114/
> v4 https://lwn.net/Articles/727692/
> 
> Jérôme Glisse (6):
>   mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
>   mm/device-public-memory: device memory cache coherent with CPU v4
>   mm/hmm: add new helper to hotplug CDM memory region v3
>   mm/memcontrol: allow to uncharge page without using page->lru field
>   mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
> v3
>   mm/hmm: documents how device memory is accounted in rss and memcg
> 
>  Documentation/vm/hmm.txt |  40 
>  fs/proc/task_mmu.c   |   2 +-
>  include/linux/hmm.h  |   7 +-
>  include/linux/ioport.h   |   1 +
>  include/linux/memremap.h |  25 -
>  include/linux/mm.h   |  20 ++--
>  kernel/memremap.c|  19 ++--
>  mm/Kconfig   |  11 +++
>  mm/gup.c |   7 ++
>  mm/hmm.c |  89 --
>  mm/madvise.c |   2 +-
>  mm/memcontrol.c  | 231 
> ++-
>  mm/memory.c  |  46 +-
>  mm/migrate.c |  57 +++-
>  mm/swap.c|  11 +++
>  15 files changed, 434 insertions(+), 134 deletions(-)
>

Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5

2017-07-17 Thread Bob Liu

On 2017/7/14 5:15, Jérôme Glisse wrote:
> Sorry i made horrible mistake on names in v4, i completly miss-
> understood the suggestion. So here i repost with proper naming.
> This is the only change since v3. Again sorry about the noise
> with v4.
> 
> Changes since v4:
>   - s/DEVICE_HOST/DEVICE_PUBLIC
> 
> Git tree:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5
> 
> 
> Cache coherent device memory apply to architecture with system bus
> like CAPI or CCIX. Device connected to such system bus can expose
> their memory to the system and allow cache coherent access to it
> from the CPU.
> 
> Even if for all intent and purposes device memory behave like regular
> memory, we still want to manage it in isolation from regular memory.
> Several reasons for that, first and foremost this memory is less
> reliable than regular memory if the device hangs because of invalid
> commands we can loose access to device memory. Second CPU access to
> this memory is expected to be slower than to regular memory. Third
> having random memory into device means that some of the bus bandwith
> wouldn't be available to the device but would be use by CPU access.
> 
> This is why we want to manage such memory in isolation from regular
> memory. Kernel should not try to use this memory even as last resort
> when running out of memory, at least for now.
>

I think set a very large node distance for "Cache Coherent Device Memory" may 
be a easier way to address these concerns.

--
Regards,
Bob Liu


 
> This patchset add a new type of ZONE_DEVICE memory (DEVICE_HOST)
> that is use to represent CDM memory. This patchset build on top of
> the HMM patchset that already introduce a new type of ZONE_DEVICE
> memory for private device memory (see HMM patchset).
> 
> The end result is that with this patchset if a device is in use in
> a process you might have private anonymous memory or file back
> page memory using ZONE_DEVICE (DEVICE_HOST). Thus care must be
> taken to not overwritte lru fields of such pages.
> 
> Hence all core mm changes are done to address assumption that any
> process memory is back by a regular struct page that is part of
> the lru. ZONE_DEVICE page are not on the lru and the lru pointer
> of struct page are use to store device specific informations.
> 
> Thus this patchset update all code path that would make assumptions
> about lruness of a process page.
> 
> patch 01 - rename DEVICE_PUBLIC to DEVICE_HOST to free DEVICE_PUBLIC name
> patch 02 - add DEVICE_PUBLIC type to ZONE_DEVICE (all core mm changes)
> patch 03 - add an helper to HMM for hotplug of CDM memory
> patch 04 - preparatory patch for memory controller changes (memch)
> patch 05 - update memory controller to properly handle
>ZONE_DEVICE pages when uncharging
> patch 06 - documentation patch
> 
> Previous posting:
> v1 https://lkml.org/lkml/2017/4/7/638
> v2 https://lwn.net/Articles/725412/
> v3 https://lwn.net/Articles/727114/
> v4 https://lwn.net/Articles/727692/
> 
> Jérôme Glisse (6):
>   mm/zone-device: rename DEVICE_PUBLIC to DEVICE_HOST
>   mm/device-public-memory: device memory cache coherent with CPU v4
>   mm/hmm: add new helper to hotplug CDM memory region v3
>   mm/memcontrol: allow to uncharge page without using page->lru field
>   mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
> v3
>   mm/hmm: documents how device memory is accounted in rss and memcg
> 
>  Documentation/vm/hmm.txt |  40 
>  fs/proc/task_mmu.c   |   2 +-
>  include/linux/hmm.h  |   7 +-
>  include/linux/ioport.h   |   1 +
>  include/linux/memremap.h |  25 -
>  include/linux/mm.h   |  20 ++--
>  kernel/memremap.c|  19 ++--
>  mm/Kconfig   |  11 +++
>  mm/gup.c |   7 ++
>  mm/hmm.c |  89 --
>  mm/madvise.c |   2 +-
>  mm/memcontrol.c  | 231 
> ++-
>  mm/memory.c  |  46 +-
>  mm/migrate.c |  57 +++-
>  mm/swap.c|  11 +++
>  15 files changed, 434 insertions(+), 134 deletions(-)
>

Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23

2017-06-23 Thread Bob Liu

Hi,

On Thu, May 25, 2017 at 1:20 AM, Jérôme Glisse <jgli...@redhat.com> wrote:
> Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
> test same kernel as kbuild system, git branch:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v23
>
> Change since v22 is use of static key for special ZONE_DEVICE case in
> put_page() and build fix for architecture with no mmu.
>
> Everything else is the same. Below is the long description of what HMM
> is about and why. At the end of this email i describe briefly each patch
> and suggest reviewers for each of them.
>
>
> Heterogeneous Memory Management (HMM) (description and justification)
>
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, â) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
>
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, â) but this get extremely complex with advance data structure
> (list, tree, graph, â) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
>
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only). This kind of feature is also appearing in
> various other operating systems.
>
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism

It looks like the address space sharing and device memory management
are two different things. They don't depend on each other and HMM has
helpers for both.

Is it possible to separate these two things into two patchsets?
Which will make it's more easy to review and also follow the "Do one
thing, and do it well" philosophy.

Thanks,
Bob Liu

Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v23

2017-06-23 Thread Bob Liu

Hi,

On Thu, May 25, 2017 at 1:20 AM, Jérôme Glisse  wrote:
> Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
> test same kernel as kbuild system, git branch:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v23
>
> Change since v22 is use of static key for special ZONE_DEVICE case in
> put_page() and build fix for architecture with no mmu.
>
> Everything else is the same. Below is the long description of what HMM
> is about and why. At the end of this email i describe briefly each patch
> and suggest reviewers for each of them.
>
>
> Heterogeneous Memory Management (HMM) (description and justification)
>
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, â) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
>
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, â) but this get extremely complex with advance data structure
> (list, tree, graph, â) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
>
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only). This kind of feature is also appearing in
> various other operating systems.
>
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism

It looks like the address space sharing and device memory management
are two different things. They don't depend on each other and HMM has
helpers for both.

Is it possible to separate these two things into two patchsets?
Which will make it's more easy to review and also follow the "Do one
thing, and do it well" philosophy.

Thanks,
Bob Liu

Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-04-23 Thread Bob Liu

On 2017/4/11 11:17, js1...@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo@lge.com>
> 
> Changed from v6
> o Rebase on next-20170405
> o Add a fix for lowmem mapping on ARM (last patch)
> o Re-organize the cover letter
> 
> Changes from v5
> o Rebase on next-20161013
> o Cosmetic change on patch 1
> o Optimize span of ZONE_CMA on multiple node system
> 
> Changes from v4
> o Rebase on next-20160825
> o Add general fix patch for lowmem reserve
> o Fix lowmem reserve ratio
> o Fix zone span optimizaion per Vlastimil
> o Fix pageset initialization
> o Change invocation timing on cma_init_reserved_areas()
> 
> Changes from v3
> o Rebase on next-20160805
> o Split first patch per Vlastimil
> o Remove useless function parameter per Vlastimil
> o Add code comment per Vlastimil
> o Add following description on cover-letter
> 
> Changes from v2
> o Rebase on next-20160525
> o No other changes except following description
> 
> Changes from v1
> o Separate some patches which deserve to submit independently
> o Modify description to reflect current kernel state
> (e.g. high-order watermark problem disappeared by Mel's work)
> o Don't increase SECTION_SIZE_BITS to make a room in page flags
> (detailed reason is on the patch that adds ZONE_CMA)
> o Adjust ZONE_CMA population code
> 
> 
> Hello,
> 
> This is the 7th version of ZONE_CMA patchset. One patch is added
> to fix potential problem on ARM. Other changes are just due to rebase.
> 
> This patchset has long history and got some reviews before. This
> cover-letter has the summary and my opinion on those reviews. Content
> order is so confusing so I make a simple index. If anyone want to
> understand the history properly, please read them by reverse order.
> 
> PART 1. Strong points of the zone approach
> PART 2. Summary in LSF/MM 2016 discussion
> PART 3. Original motivation of this patchset
> 
> * PART 1 *
> 
> CMA has many problems and I mentioned them on the bottom of the
> cover letter. These problems comes from limitation of CMA memory that
> should be always migratable for device usage. I think that introducing
> a new zone is the best approach to solve them. Here are the reasons.
> 
> Zone is introduced to solve some issues due to H/W addressing limitation.
> MM subsystem is implemented to work efficiently with these zones.
> Allocation/reclaim logic in MM consider this limitation very much.
> What I did in this patchset is introducing a new zone and extending zone's
> concept slightly. New concept is that zone can have not only H/W addressing
> limitation but also S/W limitation to guarantee page migration.
> This concept is originated from ZONE_MOVABLE and it works well
> for a long time. So, ZONE_CMA should not be special at this moment.
> 
> There is a major concern from Mel that ZONE_MOVABLE which has
> S/W limitation causes highmem/lowmem problem. Highmem/lowmem problem is
> that some of memory cannot be usable for kernel memory due to limitation
> of the zone. It causes to break LRU ordering and makes hard to find kernel
> usable memory when memory pressure.
> 
> However, important point is that this problem doesn't come from
> implementation detail (ZONE_MOVABLE/MIGRATETYPE). Even if we implement it
> by MIGRATETYPE instead of by ZONE_MOVABLE, we cannot use that type of
> memory for kernel allocation because it isn't migratable. So, it will cause
> to break LRU ordering, too. We cannot avoid the problem in any case.
> Therefore, we should focus on which solution is better for maintenance
> and not intrusive for MM subsystem.
> 
> In this viewpoint, I think that zone approach is better. As mentioned
> earlier, MM subsystem already have many infrastructures to deal with
> zone's H/W addressing limitation. Adding S/W limitation on zone concept
> and adding a new zone doesn't change anything. It will work by itself.
> My patchset can remove many hooks related to CMA area management in MM
> while solving the problems. More hooks are required to solve the problems
> if we choose MIGRATETYPE approach.
> 

Agree, there are already too many hooks and pain to maintain/bugfix.
It looks better if choose this ZONE_CMA approach.

--
Regards,
Bob Liu

Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-04-23 Thread Bob Liu

On 2017/4/11 11:17, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> Changed from v6
> o Rebase on next-20170405
> o Add a fix for lowmem mapping on ARM (last patch)
> o Re-organize the cover letter
> 
> Changes from v5
> o Rebase on next-20161013
> o Cosmetic change on patch 1
> o Optimize span of ZONE_CMA on multiple node system
> 
> Changes from v4
> o Rebase on next-20160825
> o Add general fix patch for lowmem reserve
> o Fix lowmem reserve ratio
> o Fix zone span optimizaion per Vlastimil
> o Fix pageset initialization
> o Change invocation timing on cma_init_reserved_areas()
> 
> Changes from v3
> o Rebase on next-20160805
> o Split first patch per Vlastimil
> o Remove useless function parameter per Vlastimil
> o Add code comment per Vlastimil
> o Add following description on cover-letter
> 
> Changes from v2
> o Rebase on next-20160525
> o No other changes except following description
> 
> Changes from v1
> o Separate some patches which deserve to submit independently
> o Modify description to reflect current kernel state
> (e.g. high-order watermark problem disappeared by Mel's work)
> o Don't increase SECTION_SIZE_BITS to make a room in page flags
> (detailed reason is on the patch that adds ZONE_CMA)
> o Adjust ZONE_CMA population code
> 
> 
> Hello,
> 
> This is the 7th version of ZONE_CMA patchset. One patch is added
> to fix potential problem on ARM. Other changes are just due to rebase.
> 
> This patchset has long history and got some reviews before. This
> cover-letter has the summary and my opinion on those reviews. Content
> order is so confusing so I make a simple index. If anyone want to
> understand the history properly, please read them by reverse order.
> 
> PART 1. Strong points of the zone approach
> PART 2. Summary in LSF/MM 2016 discussion
> PART 3. Original motivation of this patchset
> 
> * PART 1 *
> 
> CMA has many problems and I mentioned them on the bottom of the
> cover letter. These problems comes from limitation of CMA memory that
> should be always migratable for device usage. I think that introducing
> a new zone is the best approach to solve them. Here are the reasons.
> 
> Zone is introduced to solve some issues due to H/W addressing limitation.
> MM subsystem is implemented to work efficiently with these zones.
> Allocation/reclaim logic in MM consider this limitation very much.
> What I did in this patchset is introducing a new zone and extending zone's
> concept slightly. New concept is that zone can have not only H/W addressing
> limitation but also S/W limitation to guarantee page migration.
> This concept is originated from ZONE_MOVABLE and it works well
> for a long time. So, ZONE_CMA should not be special at this moment.
> 
> There is a major concern from Mel that ZONE_MOVABLE which has
> S/W limitation causes highmem/lowmem problem. Highmem/lowmem problem is
> that some of memory cannot be usable for kernel memory due to limitation
> of the zone. It causes to break LRU ordering and makes hard to find kernel
> usable memory when memory pressure.
> 
> However, important point is that this problem doesn't come from
> implementation detail (ZONE_MOVABLE/MIGRATETYPE). Even if we implement it
> by MIGRATETYPE instead of by ZONE_MOVABLE, we cannot use that type of
> memory for kernel allocation because it isn't migratable. So, it will cause
> to break LRU ordering, too. We cannot avoid the problem in any case.
> Therefore, we should focus on which solution is better for maintenance
> and not intrusive for MM subsystem.
> 
> In this viewpoint, I think that zone approach is better. As mentioned
> earlier, MM subsystem already have many infrastructures to deal with
> zone's H/W addressing limitation. Adding S/W limitation on zone concept
> and adding a new zone doesn't change anything. It will work by itself.
> My patchset can remove many hooks related to CMA area management in MM
> while solving the problems. More hooks are required to solve the problems
> if we choose MIGRATETYPE approach.
> 

Agree, there are already too many hooks and pain to maintain/bugfix.
It looks better if choose this ZONE_CMA approach.

--
Regards,
Bob Liu

Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v18

2017-03-17 Thread Bob Liu

On 2017/3/17 7:49, Jerome Glisse wrote:
> On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
>> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse  
>> wrote:
>>
>>> Cliff note:
>>
>> "Cliff's notes" isn't appropriate for a large feature such as this. 
>> Where's the long-form description?  One which permits readers to fully
>> understand the requirements, design, alternative designs, the
>> implementation, the interface(s), etc?
>>
>> Have you ever spoken about HMM at a conference?  If so, the supporting
>> presentation documents might help here.  That's the level of detail
>> which should be presented here.
> 
> Longer description of patchset rational, motivation and design choices
> were given in the first few posting of the patchset to which i included
> a link in my cover letter. Also given that i presented that for last 3
> or 4 years to mm summit and kernel summit i thought that by now peoples
> were familiar about the topic and wanted to spare them the long version.
> My bad.
> 
> I attach a patch that is a first stab at a Documentation/hmm.txt that
> explain the motivation and rational behind HMM. I can probably add a
> section about how to use HMM from device driver point of view.
> 

And a simple example program/pseudo-code make use of the device memory 
would also very useful for person don't have GPU programming experience :)

Regards,
Bob

Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v18

2017-03-17 Thread Bob Liu

On 2017/3/17 7:49, Jerome Glisse wrote:
> On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
>> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse  
>> wrote:
>>
>>> Cliff note:
>>
>> "Cliff's notes" isn't appropriate for a large feature such as this. 
>> Where's the long-form description?  One which permits readers to fully
>> understand the requirements, design, alternative designs, the
>> implementation, the interface(s), etc?
>>
>> Have you ever spoken about HMM at a conference?  If so, the supporting
>> presentation documents might help here.  That's the level of detail
>> which should be presented here.
> 
> Longer description of patchset rational, motivation and design choices
> were given in the first few posting of the patchset to which i included
> a link in my cover letter. Also given that i presented that for last 3
> or 4 years to mm summit and kernel summit i thought that by now peoples
> were familiar about the topic and wanted to spare them the long version.
> My bad.
> 
> I attach a patch that is a first stab at a Documentation/hmm.txt that
> explain the motivation and rational behind HMM. I can probably add a
> section about how to use HMM from device driver point of view.
> 

And a simple example program/pseudo-code make use of the device memory 
would also very useful for person don't have GPU programming experience :)

Regards,
Bob

Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v18

2017-03-17 Thread Bob Liu

On 2017/3/17 7:49, Jerome Glisse wrote:
> On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
>> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse  
>> wrote:
>>
>>> Cliff note:
>>
>> "Cliff's notes" isn't appropriate for a large feature such as this. 
>> Where's the long-form description?  One which permits readers to fully
>> understand the requirements, design, alternative designs, the
>> implementation, the interface(s), etc?
>>
>> Have you ever spoken about HMM at a conference?  If so, the supporting
>> presentation documents might help here.  That's the level of detail
>> which should be presented here.
> 
> Longer description of patchset rational, motivation and design choices
> were given in the first few posting of the patchset to which i included
> a link in my cover letter. Also given that i presented that for last 3
> or 4 years to mm summit and kernel summit i thought that by now peoples
> were familiar about the topic and wanted to spare them the long version.
> My bad.
> 
> I attach a patch that is a first stab at a Documentation/hmm.txt that
> explain the motivation and rational behind HMM. I can probably add a
> section about how to use HMM from device driver point of view.
> 

Please, that would be very helpful!

> +3) Share address space and migration
> +
> +HMM intends to provide two main features. First one is to share the address
> +space by duplication the CPU page table into the device page table so same
> +address point to same memory and this for any valid main memory address in
> +the process address space.

Is this an optional feature?
I mean the device don't have to duplicate the CPU page table.
But only make use of the second(migration) feature.

> +The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that 
> does
> +allow to allocate a struct page for each page of the device memory. Those 
> page
> +are special because the CPU can not map them. They however allow to migrate
> +main memory to device memory using exhisting migration mechanism and 
> everything
> +looks like if page was swap out to disk from CPU point of view. Using a 
> struct
> +page gives the easiest and cleanest integration with existing mm mechanisms.
> +Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
> +for the device memory and second to perform migration. Policy decision of 
> what
> +and when to migrate things is left to the device driver.
> +
> +Note that any CPU acess to a device page trigger a page fault which initiate 
> a
> +migration back to system memory so that CPU can access it.

A bit confused here, do you mean CPU access to a main memory page but that page 
has been migrated to device memory?
Then a page fault will be triggered and initiate a migration back.

Thanks,
Bob

Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v18

2017-03-17 Thread Bob Liu

On 2017/3/17 7:49, Jerome Glisse wrote:
> On Thu, Mar 16, 2017 at 01:43:21PM -0700, Andrew Morton wrote:
>> On Thu, 16 Mar 2017 12:05:19 -0400 J__r__me Glisse  
>> wrote:
>>
>>> Cliff note:
>>
>> "Cliff's notes" isn't appropriate for a large feature such as this. 
>> Where's the long-form description?  One which permits readers to fully
>> understand the requirements, design, alternative designs, the
>> implementation, the interface(s), etc?
>>
>> Have you ever spoken about HMM at a conference?  If so, the supporting
>> presentation documents might help here.  That's the level of detail
>> which should be presented here.
> 
> Longer description of patchset rational, motivation and design choices
> were given in the first few posting of the patchset to which i included
> a link in my cover letter. Also given that i presented that for last 3
> or 4 years to mm summit and kernel summit i thought that by now peoples
> were familiar about the topic and wanted to spare them the long version.
> My bad.
> 
> I attach a patch that is a first stab at a Documentation/hmm.txt that
> explain the motivation and rational behind HMM. I can probably add a
> section about how to use HMM from device driver point of view.
> 

Please, that would be very helpful!

> +3) Share address space and migration
> +
> +HMM intends to provide two main features. First one is to share the address
> +space by duplication the CPU page table into the device page table so same
> +address point to same memory and this for any valid main memory address in
> +the process address space.

Is this an optional feature?
I mean the device don't have to duplicate the CPU page table.
But only make use of the second(migration) feature.

> +The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that 
> does
> +allow to allocate a struct page for each page of the device memory. Those 
> page
> +are special because the CPU can not map them. They however allow to migrate
> +main memory to device memory using exhisting migration mechanism and 
> everything
> +looks like if page was swap out to disk from CPU point of view. Using a 
> struct
> +page gives the easiest and cleanest integration with existing mm mechanisms.
> +Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
> +for the device memory and second to perform migration. Policy decision of 
> what
> +and when to migrate things is left to the device driver.
> +
> +Note that any CPU acess to a device page trigger a page fault which initiate 
> a
> +migration back to system memory so that CPU can access it.

A bit confused here, do you mean CPU access to a main memory page but that page 
has been migrated to device memory?
Then a page fault will be triggered and initiate a migration back.

Thanks,
Bob

Re: [HMM 16/16] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2

2017-03-17 Thread Bob Liu

Hi Jérôme,

On 2017/3/17 0:05, Jérôme Glisse wrote:
> This introduce a dummy HMM device class so device driver can use it to
> create hmm_device for the sole purpose of registering device memory.

May I ask where is the latest dummy HMM device driver?
I can only get this one: https://patchwork.kernel.org/patch/4352061/

Thanks,
Bob

> It is usefull to device driver that want to manage multiple physical
> device memory under same struct device umbrella.
> 
> Changed since v1:
>   - Improve commit message
>   - Add drvdata parameter to set on struct device
> 
> Signed-off-by: Jérôme Glisse 
> Signed-off-by: Evgeny Baskakov 
> Signed-off-by: John Hubbard 
> Signed-off-by: Mark Hairgrove 
> Signed-off-by: Sherry Cheung 
> Signed-off-by: Subhash Gutti 
> ---
>  include/linux/hmm.h | 22 +++-
>  mm/hmm.c| 96 
> +
>  2 files changed, 117 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 3054ce7..e4e6b36 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -79,11 +79,11 @@
>  
>  #if IS_ENABLED(CONFIG_HMM)
>  
> +#include 
>  #include 
>  #include 
>  #include 
>  
> -
>  struct hmm;
>  
>  /*
> @@ -433,6 +433,26 @@ static inline unsigned long 
> hmm_devmem_page_get_drvdata(struct page *page)
>  
>   return drvdata[1];
>  }
> +
> +
> +/*
> + * struct hmm_device - fake device to hang device memory onto
> + *
> + * @device: device struct
> + * @minor: device minor number
> + */
> +struct hmm_device {
> + struct device   device;
> + unsignedminor;
> +};
> +
> +/*
> + * Device driver that wants to handle multiple devices memory through a 
> single
> + * fake device can use hmm_device to do so. This is purely a helper and it
> + * is not needed to make use of any HMM functionality.
> + */
> +struct hmm_device *hmm_device_new(void *drvdata);
> +void hmm_device_put(struct hmm_device *hmm_device);
>  #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
>  
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 019f379..c477bd1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1132,4 +1133,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
>   return 0;
>  }
>  EXPORT_SYMBOL(hmm_devmem_fault_range);
> +
> +/*
> + * A device driver that wants to handle multiple devices memory through a
> + * single fake device can use hmm_device to do so. This is purely a helper
> + * and it is not needed to make use of any HMM functionality.
> + */
> +#define HMM_DEVICE_MAX 256
> +
> +static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
> +static DEFINE_SPINLOCK(hmm_device_lock);
> +static struct class *hmm_device_class;
> +static dev_t hmm_device_devt;
> +
> +static void hmm_device_release(struct device *device)
> +{
> + struct hmm_device *hmm_device;
> +
> + hmm_device = container_of(device, struct hmm_device, device);
> + spin_lock(_device_lock);
> + clear_bit(hmm_device->minor, hmm_device_mask);
> + spin_unlock(_device_lock);
> +
> + kfree(hmm_device);
> +}
> +
> +struct hmm_device *hmm_device_new(void *drvdata)
> +{
> + struct hmm_device *hmm_device;
> + int ret;
> +
> + hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
> + if (!hmm_device)
> + return ERR_PTR(-ENOMEM);
> +
> + ret = alloc_chrdev_region(_device->device.devt,0,1,"hmm_device");
> + if (ret < 0) {
> + kfree(hmm_device);
> + return NULL;
> + }
> +
> + spin_lock(_device_lock);
> + hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
> + if (hmm_device->minor >= HMM_DEVICE_MAX) {
> + spin_unlock(_device_lock);
> + kfree(hmm_device);
> + return NULL;
> + }
> + set_bit(hmm_device->minor, hmm_device_mask);
> + spin_unlock(_device_lock);
> +
> + dev_set_name(_device->device, "hmm_device%d", hmm_device->minor);
> + hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
> + hmm_device->minor);
> + hmm_device->device.release = hmm_device_release;
> + dev_set_drvdata(_device->device, drvdata);
> + hmm_device->device.class = hmm_device_class;
> + device_initialize(_device->device);
> +
> + return hmm_device;
> +}
> +EXPORT_SYMBOL(hmm_device_new);
> +
> +void hmm_device_put(struct hmm_device *hmm_device)
> +{
> + put_device(_device->device);
> +}
> +EXPORT_SYMBOL(hmm_device_put);
> +
> +static int __init hmm_init(void)
> +{
> + int ret;
> +
> + ret = alloc_chrdev_region(_device_devt, 0,
> +   HMM_DEVICE_MAX,
> +   "hmm_device");
> + if (ret)
> + return ret;
> +
> +

Re: [HMM 16/16] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2

2017-03-17 Thread Bob Liu

Hi Jérôme,

On 2017/3/17 0:05, Jérôme Glisse wrote:
> This introduce a dummy HMM device class so device driver can use it to
> create hmm_device for the sole purpose of registering device memory.

May I ask where is the latest dummy HMM device driver?
I can only get this one: https://patchwork.kernel.org/patch/4352061/

Thanks,
Bob

> It is usefull to device driver that want to manage multiple physical
> device memory under same struct device umbrella.
> 
> Changed since v1:
>   - Improve commit message
>   - Add drvdata parameter to set on struct device
> 
> Signed-off-by: Jérôme Glisse 
> Signed-off-by: Evgeny Baskakov 
> Signed-off-by: John Hubbard 
> Signed-off-by: Mark Hairgrove 
> Signed-off-by: Sherry Cheung 
> Signed-off-by: Subhash Gutti 
> ---
>  include/linux/hmm.h | 22 +++-
>  mm/hmm.c| 96 
> +
>  2 files changed, 117 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 3054ce7..e4e6b36 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -79,11 +79,11 @@
>  
>  #if IS_ENABLED(CONFIG_HMM)
>  
> +#include 
>  #include 
>  #include 
>  #include 
>  
> -
>  struct hmm;
>  
>  /*
> @@ -433,6 +433,26 @@ static inline unsigned long 
> hmm_devmem_page_get_drvdata(struct page *page)
>  
>   return drvdata[1];
>  }
> +
> +
> +/*
> + * struct hmm_device - fake device to hang device memory onto
> + *
> + * @device: device struct
> + * @minor: device minor number
> + */
> +struct hmm_device {
> + struct device   device;
> + unsignedminor;
> +};
> +
> +/*
> + * Device driver that wants to handle multiple devices memory through a 
> single
> + * fake device can use hmm_device to do so. This is purely a helper and it
> + * is not needed to make use of any HMM functionality.
> + */
> +struct hmm_device *hmm_device_new(void *drvdata);
> +void hmm_device_put(struct hmm_device *hmm_device);
>  #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
>  
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 019f379..c477bd1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1132,4 +1133,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
>   return 0;
>  }
>  EXPORT_SYMBOL(hmm_devmem_fault_range);
> +
> +/*
> + * A device driver that wants to handle multiple devices memory through a
> + * single fake device can use hmm_device to do so. This is purely a helper
> + * and it is not needed to make use of any HMM functionality.
> + */
> +#define HMM_DEVICE_MAX 256
> +
> +static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
> +static DEFINE_SPINLOCK(hmm_device_lock);
> +static struct class *hmm_device_class;
> +static dev_t hmm_device_devt;
> +
> +static void hmm_device_release(struct device *device)
> +{
> + struct hmm_device *hmm_device;
> +
> + hmm_device = container_of(device, struct hmm_device, device);
> + spin_lock(_device_lock);
> + clear_bit(hmm_device->minor, hmm_device_mask);
> + spin_unlock(_device_lock);
> +
> + kfree(hmm_device);
> +}
> +
> +struct hmm_device *hmm_device_new(void *drvdata)
> +{
> + struct hmm_device *hmm_device;
> + int ret;
> +
> + hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
> + if (!hmm_device)
> + return ERR_PTR(-ENOMEM);
> +
> + ret = alloc_chrdev_region(_device->device.devt,0,1,"hmm_device");
> + if (ret < 0) {
> + kfree(hmm_device);
> + return NULL;
> + }
> +
> + spin_lock(_device_lock);
> + hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
> + if (hmm_device->minor >= HMM_DEVICE_MAX) {
> + spin_unlock(_device_lock);
> + kfree(hmm_device);
> + return NULL;
> + }
> + set_bit(hmm_device->minor, hmm_device_mask);
> + spin_unlock(_device_lock);
> +
> + dev_set_name(_device->device, "hmm_device%d", hmm_device->minor);
> + hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
> + hmm_device->minor);
> + hmm_device->device.release = hmm_device_release;
> + dev_set_drvdata(_device->device, drvdata);
> + hmm_device->device.class = hmm_device_class;
> + device_initialize(_device->device);
> +
> + return hmm_device;
> +}
> +EXPORT_SYMBOL(hmm_device_new);
> +
> +void hmm_device_put(struct hmm_device *hmm_device)
> +{
> + put_device(_device->device);
> +}
> +EXPORT_SYMBOL(hmm_device_put);
> +
> +static int __init hmm_init(void)
> +{
> + int ret;
> +
> + ret = alloc_chrdev_region(_device_devt, 0,
> +   HMM_DEVICE_MAX,
> +   "hmm_device");
> + if (ret)
> + return ret;
> +
> + hmm_device_class = class_create(THIS_MODULE, "hmm_device");
> + if (IS_ERR(hmm_device_class)) {
> +

Re: mm allocation failure and hang when running xfstests generic/269 on xfs

2017-03-01 Thread Bob Liu

On 2017/3/2 13:19, Xiong Zhou wrote:
> On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
>> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
>>> Hi,
>>>
>>> It's reproduciable, not everytime though. Ext4 works fine.
>>
>> On ext4 fsstress won't run bulkstat because it doesn't exist.  Either
>> way this smells like a MM issue to me as there were not XFS changes
>> in that area recently.
> 
> Yap.
> 
> First bad commit:
> 

It looks like not a bug.
>From below commit, the allocation failure print was due to current process 
>received SIGKILL signal.
You may need to confirm whether that's the case. 

Regards,
Bob

> commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> Author: Michal Hocko 
> Date:   Fri Feb 24 14:58:53 2017 -0800
> 
> vmalloc: back off when the current task is killed
> 
> Reverting this commit on top of
>   e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> survives the tests.
>

Re: mm allocation failure and hang when running xfstests generic/269 on xfs

2017-03-01 Thread Bob Liu

On 2017/3/2 13:19, Xiong Zhou wrote:
> On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
>> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
>>> Hi,
>>>
>>> It's reproduciable, not everytime though. Ext4 works fine.
>>
>> On ext4 fsstress won't run bulkstat because it doesn't exist.  Either
>> way this smells like a MM issue to me as there were not XFS changes
>> in that area recently.
> 
> Yap.
> 
> First bad commit:
> 

It looks like not a bug.
>From below commit, the allocation failure print was due to current process 
>received SIGKILL signal.
You may need to confirm whether that's the case. 

Regards,
Bob

> commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> Author: Michal Hocko 
> Date:   Fri Feb 24 14:58:53 2017 -0800
> 
> vmalloc: back off when the current task is killed
> 
> Reverting this commit on top of
>   e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> survives the tests.
>

Re: [PATCH V3 0/4] Define coherent device memory node

2017-02-26 Thread Bob Liu

On 2017/2/24 12:53, Jerome Glisse wrote:
> On Fri, Feb 24, 2017 at 09:06:19AM +0800, Bob Liu wrote:
>> On 2017/2/21 21:39, Anshuman Khandual wrote:
>>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>>> [...]
>>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>>   allocator. But the over all idea from the user space point of view
>>>>>   is that the application can allocate any generic buffer and try to
>>>>>   use the buffer either from the CPU side or from the device without
>>>>>   knowing about where the buffer is really mapped physically. That
>>>>>   gives a seamless and transparent view to the user space where CPU
>>>>>   compute and possible device based compute can work together. This
>>>>>   is not possible through a driver allocated buffer.
>>>>
>>>> But how are you going to define any policy around that. Who is allowed
>>>
>>> The user space VMA can define the policy with a mbind(MPOL_BIND) call
>>> with CDM/CDMs in the nodemask.
>>>
>>>> to allocate and how much of this "special memory". Is it possible that
>>>
>>> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
>>> the nodemask can allocate from the CDM memory. "How much" gets controlled
>>> by how we fault from CPU and the default behavior of the buddy allocator.
>>>
>>>> we will eventually need some access control mechanism? If yes then mbind
>>>
>>> No access control mechanism is needed. If an application wants to use
>>> CDM memory by specifying in the mbind() it can. Nothing prevents it
>>> from using the CDM memory.
>>>
>>>> is really not suitable interface to (ab)use. Also what should happen if
>>>> the mbind mentions only CDM memory and that is depleted?
>>>
>>> IIUC *only CDM* cannot be requested from user space as there are no user
>>> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
>>> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
>>> have zones of the system including CDM ones. If the resultant CDM zones
>>> run out of memory, we fail the allocation request as usual.
>>>
>>>>
>>>> Could you also explain why the transparent view is really better than
>>>> using a device specific mmap (aka CDM awareness)?
>>>
>>> Okay with a transparent view, we can achieve a control flow of application
>>> like the following.
>>>
>>> (1) Allocate a buffer:  alloc_buffer(buf, size)
>>> (2) CPU compute on buffer:  cpu_compute(buf, size)
>>> (3) Device compute on buffer:   device_compute(buf, size)
>>> (4) CPU compute on buffer:  cpu_compute(buf, size)
>>> (5) Release the buffer: release_buffer(buf, size)
>>>
>>> With assistance from a device specific driver, the actual page mapping of
>>> the buffer can change between system RAM and device memory depending on
>>> which side is accessing at a given point. This will be achieved through
>>> driver initiated migrations.
>>>
>>
>> Sorry, I'm a bit confused here.
>> What's the difference with the Heterogeneous memory management?
>> Which also "allows to use device memory transparently inside any process
>> without any modifications to process program code."
> 
> HMM is first and foremost for platform (like Intel) where CPU can not
> access device memory in cache coherent way or at all. CDM is for more
> advance platform with a system bus that allow the CPU to access device
> memory in cache coherent way.
> 
> Hence CDM was design to integrate more closely in existing concept like
> NUMA. From my point of view it is like another level in the memory
> hierarchy. Nowaday you have local node memory and other node memory.
> In not too distant future you will have fast CPU on die memory, local
> memory (you beloved DDR3/DDR4), slightly slower but gigantic persistant
> memory and also device memory (all those local to a node).
> 
> On top of that you will still have the regular NUMA hierarchy between
> nodes. But each node will have its own local hierarchy of memory.
> 
> CDM wants to integrate with existing memory hinting API and i believe
> this is needed to get some experience with how end user might want to
> use this to fine tune their application.
> 
> Some bit of HMM are generic and will be reuse by CDM, for instance the
> DMA capable memory migration helpers. Wether they can also share HMM
> approach of using ZONE_DEVICE is yet to be proven but it comes with
> limitations (can't be on lru or have device lru) that might hinder a
> closer integration of CDM memory with many aspect of kernel mm.
> 
> 
> This is my own view and it likely differ in some way from the view of
> the people behind CDM :)
> 

Got it, thank you for the kindly explanation.
And also thank you, John.

Regards,
Bob

Re: [PATCH V3 0/4] Define coherent device memory node

2017-02-26 Thread Bob Liu

On 2017/2/24 12:53, Jerome Glisse wrote:
> On Fri, Feb 24, 2017 at 09:06:19AM +0800, Bob Liu wrote:
>> On 2017/2/21 21:39, Anshuman Khandual wrote:
>>> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>>>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>>>> [...]
>>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>>>   allocator. But the over all idea from the user space point of view
>>>>>   is that the application can allocate any generic buffer and try to
>>>>>   use the buffer either from the CPU side or from the device without
>>>>>   knowing about where the buffer is really mapped physically. That
>>>>>   gives a seamless and transparent view to the user space where CPU
>>>>>   compute and possible device based compute can work together. This
>>>>>   is not possible through a driver allocated buffer.
>>>>
>>>> But how are you going to define any policy around that. Who is allowed
>>>
>>> The user space VMA can define the policy with a mbind(MPOL_BIND) call
>>> with CDM/CDMs in the nodemask.
>>>
>>>> to allocate and how much of this "special memory". Is it possible that
>>>
>>> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
>>> the nodemask can allocate from the CDM memory. "How much" gets controlled
>>> by how we fault from CPU and the default behavior of the buddy allocator.
>>>
>>>> we will eventually need some access control mechanism? If yes then mbind
>>>
>>> No access control mechanism is needed. If an application wants to use
>>> CDM memory by specifying in the mbind() it can. Nothing prevents it
>>> from using the CDM memory.
>>>
>>>> is really not suitable interface to (ab)use. Also what should happen if
>>>> the mbind mentions only CDM memory and that is depleted?
>>>
>>> IIUC *only CDM* cannot be requested from user space as there are no user
>>> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
>>> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
>>> have zones of the system including CDM ones. If the resultant CDM zones
>>> run out of memory, we fail the allocation request as usual.
>>>
>>>>
>>>> Could you also explain why the transparent view is really better than
>>>> using a device specific mmap (aka CDM awareness)?
>>>
>>> Okay with a transparent view, we can achieve a control flow of application
>>> like the following.
>>>
>>> (1) Allocate a buffer:  alloc_buffer(buf, size)
>>> (2) CPU compute on buffer:  cpu_compute(buf, size)
>>> (3) Device compute on buffer:   device_compute(buf, size)
>>> (4) CPU compute on buffer:  cpu_compute(buf, size)
>>> (5) Release the buffer: release_buffer(buf, size)
>>>
>>> With assistance from a device specific driver, the actual page mapping of
>>> the buffer can change between system RAM and device memory depending on
>>> which side is accessing at a given point. This will be achieved through
>>> driver initiated migrations.
>>>
>>
>> Sorry, I'm a bit confused here.
>> What's the difference with the Heterogeneous memory management?
>> Which also "allows to use device memory transparently inside any process
>> without any modifications to process program code."
> 
> HMM is first and foremost for platform (like Intel) where CPU can not
> access device memory in cache coherent way or at all. CDM is for more
> advance platform with a system bus that allow the CPU to access device
> memory in cache coherent way.
> 
> Hence CDM was design to integrate more closely in existing concept like
> NUMA. From my point of view it is like another level in the memory
> hierarchy. Nowaday you have local node memory and other node memory.
> In not too distant future you will have fast CPU on die memory, local
> memory (you beloved DDR3/DDR4), slightly slower but gigantic persistant
> memory and also device memory (all those local to a node).
> 
> On top of that you will still have the regular NUMA hierarchy between
> nodes. But each node will have its own local hierarchy of memory.
> 
> CDM wants to integrate with existing memory hinting API and i believe
> this is needed to get some experience with how end user might want to
> use this to fine tune their application.
> 
> Some bit of HMM are generic and will be reuse by CDM, for instance the
> DMA capable memory migration helpers. Wether they can also share HMM
> approach of using ZONE_DEVICE is yet to be proven but it comes with
> limitations (can't be on lru or have device lru) that might hinder a
> closer integration of CDM memory with many aspect of kernel mm.
> 
> 
> This is my own view and it likely differ in some way from the view of
> the people behind CDM :)
> 

Got it, thank you for the kindly explanation.
And also thank you, John.

Regards,
Bob

Re: [PATCH V3 0/4] Define coherent device memory node

2017-02-23 Thread Bob Liu

On 2017/2/21 21:39, Anshuman Khandual wrote:
> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>> [...]
>>> * User space using mbind() to get CDM memory is an additional benefit
>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>   allocator. But the over all idea from the user space point of view
>>>   is that the application can allocate any generic buffer and try to
>>>   use the buffer either from the CPU side or from the device without
>>>   knowing about where the buffer is really mapped physically. That
>>>   gives a seamless and transparent view to the user space where CPU
>>>   compute and possible device based compute can work together. This
>>>   is not possible through a driver allocated buffer.
>>
>> But how are you going to define any policy around that. Who is allowed
> 
> The user space VMA can define the policy with a mbind(MPOL_BIND) call
> with CDM/CDMs in the nodemask.
> 
>> to allocate and how much of this "special memory". Is it possible that
> 
> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
> the nodemask can allocate from the CDM memory. "How much" gets controlled
> by how we fault from CPU and the default behavior of the buddy allocator.
> 
>> we will eventually need some access control mechanism? If yes then mbind
> 
> No access control mechanism is needed. If an application wants to use
> CDM memory by specifying in the mbind() it can. Nothing prevents it
> from using the CDM memory.
> 
>> is really not suitable interface to (ab)use. Also what should happen if
>> the mbind mentions only CDM memory and that is depleted?
> 
> IIUC *only CDM* cannot be requested from user space as there are no user
> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
> have zones of the system including CDM ones. If the resultant CDM zones
> run out of memory, we fail the allocation request as usual.
> 
>>
>> Could you also explain why the transparent view is really better than
>> using a device specific mmap (aka CDM awareness)?
> 
> Okay with a transparent view, we can achieve a control flow of application
> like the following.
> 
> (1) Allocate a buffer:alloc_buffer(buf, size)
> (2) CPU compute on buffer:cpu_compute(buf, size)
> (3) Device compute on buffer: device_compute(buf, size)
> (4) CPU compute on buffer:cpu_compute(buf, size)
> (5) Release the buffer:   release_buffer(buf, size)
> 
> With assistance from a device specific driver, the actual page mapping of
> the buffer can change between system RAM and device memory depending on
> which side is accessing at a given point. This will be achieved through
> driver initiated migrations.
> 

Sorry, I'm a bit confused here.
What's the difference with the Heterogeneous memory management?
Which also "allows to use device memory transparently inside any process
without any modifications to process program code."

Thanks,
-Bob

>>  
>>> * The placement of the memory on the buffer can happen on system memory
>>>   when the CPU faults while accessing it. But a driver can manage the
>>>   migration between system RAM and CDM memory once the buffer is being
>>>   used from CPU and the device interchangeably. As you have mentioned
>>>   driver will have more information about where which part of the buffer
>>>   should be placed at any point of time and it can make it happen with
>>>   migration. So both allocation and placement are decided by the driver
>>>   during runtime. CDM provides the framework for this can kind device
>>>   assisted compute and driver managed memory placements.
>>>
>>> * If any application is not using CDM memory for along time placed on
>>>   its buffer and another application is forced to fallback on system
>>>   RAM when it really wanted is CDM, the driver can detect these kind
>>>   of situations through memory access patterns on the device HW and
>>>   take necessary migration decisions.

Re: [PATCH V3 0/4] Define coherent device memory node

2017-02-23 Thread Bob Liu

On 2017/2/21 21:39, Anshuman Khandual wrote:
> On 02/21/2017 04:41 PM, Michal Hocko wrote:
>> On Fri 17-02-17 17:11:57, Anshuman Khandual wrote:
>> [...]
>>> * User space using mbind() to get CDM memory is an additional benefit
>>>   we get by making the CDM plug in as a node and be part of the buddy
>>>   allocator. But the over all idea from the user space point of view
>>>   is that the application can allocate any generic buffer and try to
>>>   use the buffer either from the CPU side or from the device without
>>>   knowing about where the buffer is really mapped physically. That
>>>   gives a seamless and transparent view to the user space where CPU
>>>   compute and possible device based compute can work together. This
>>>   is not possible through a driver allocated buffer.
>>
>> But how are you going to define any policy around that. Who is allowed
> 
> The user space VMA can define the policy with a mbind(MPOL_BIND) call
> with CDM/CDMs in the nodemask.
> 
>> to allocate and how much of this "special memory". Is it possible that
> 
> Any user space application with mbind(MPOL_BIND) call with CDM/CDMs in
> the nodemask can allocate from the CDM memory. "How much" gets controlled
> by how we fault from CPU and the default behavior of the buddy allocator.
> 
>> we will eventually need some access control mechanism? If yes then mbind
> 
> No access control mechanism is needed. If an application wants to use
> CDM memory by specifying in the mbind() it can. Nothing prevents it
> from using the CDM memory.
> 
>> is really not suitable interface to (ab)use. Also what should happen if
>> the mbind mentions only CDM memory and that is depleted?
> 
> IIUC *only CDM* cannot be requested from user space as there are no user
> visible interface which can translate to __GFP_THISNODE. MPOL_BIND with
> CDM in the nodemask will eventually pick a FALLBACK zonelist which will
> have zones of the system including CDM ones. If the resultant CDM zones
> run out of memory, we fail the allocation request as usual.
> 
>>
>> Could you also explain why the transparent view is really better than
>> using a device specific mmap (aka CDM awareness)?
> 
> Okay with a transparent view, we can achieve a control flow of application
> like the following.
> 
> (1) Allocate a buffer:alloc_buffer(buf, size)
> (2) CPU compute on buffer:cpu_compute(buf, size)
> (3) Device compute on buffer: device_compute(buf, size)
> (4) CPU compute on buffer:cpu_compute(buf, size)
> (5) Release the buffer:   release_buffer(buf, size)
> 
> With assistance from a device specific driver, the actual page mapping of
> the buffer can change between system RAM and device memory depending on
> which side is accessing at a given point. This will be achieved through
> driver initiated migrations.
> 

Sorry, I'm a bit confused here.
What's the difference with the Heterogeneous memory management?
Which also "allows to use device memory transparently inside any process
without any modifications to process program code."

Thanks,
-Bob

>>  
>>> * The placement of the memory on the buffer can happen on system memory
>>>   when the CPU faults while accessing it. But a driver can manage the
>>>   migration between system RAM and CDM memory once the buffer is being
>>>   used from CPU and the device interchangeably. As you have mentioned
>>>   driver will have more information about where which part of the buffer
>>>   should be placed at any point of time and it can make it happen with
>>>   migration. So both allocation and placement are decided by the driver
>>>   during runtime. CDM provides the framework for this can kind device
>>>   assisted compute and driver managed memory placements.
>>>
>>> * If any application is not using CDM memory for along time placed on
>>>   its buffer and another application is forced to fallback on system
>>>   RAM when it really wanted is CDM, the driver can detect these kind
>>>   of situations through memory access patterns on the device HW and
>>>   take necessary migration decisions.

Re: [PATCH V3 1/4] mm: Define coherent device memory (CDM) node

2017-02-17 Thread Bob Liu

Hi Anshuman,

I have a few questions about coherent device memory.

On Wed, Feb 15, 2017 at 8:07 PM, Anshuman Khandual
 wrote:
> There are certain devices like specialized accelerator, GPU cards, network
> cards, FPGA cards etc which might contain onboard memory which is coherent
> along with the existing system RAM while being accessed either from the CPU
> or from the device. They share some similar properties with that of normal

What's the general size of this kind of memory?

> system RAM but at the same time can also be different with respect to
> system RAM.
>
> User applications might be interested in using this kind of coherent device

What kind of applications?

> memory explicitly or implicitly along side the system RAM utilizing all
> possible core memory functions like anon mapping (LRU), file mapping (LRU),
> page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations

I didn't see the benefit to manage the onboard memory same way as system RAM.
Why not just map this kind of onborad memory to userspace directly?
And only those specific applications can manage/access/use it.

It sounds not very good to complicate the core memory framework a lot
because of some not widely used devices and uncertain applications.

-
Regards,
Bob

> etc. To achieve this kind of tight integration with core memory subsystem,
> the device onboard coherent memory must be represented as a memory only
> NUMA node. At the same time arch must export some kind of a function to
> identify of this node as a coherent device memory not any other regular
> cpu less memory only NUMA node.
>
> After achieving the integration with core memory subsystem coherent device
> memory might still need some special consideration inside the kernel. There
> can be a variety of coherent memory nodes with different expectations from
> the core kernel memory. But right now only one kind of special treatment is
> considered which requires certain isolation.
>
> Now consider the case of a coherent device memory node type which requires
> isolation. This kind of coherent memory is onboard an external device
> attached to the system through a link where there is always a chance of a
> link failure taking down the entire memory node with it. More over the
> memory might also have higher chance of ECC failure as compared to the
> system RAM. Hence allocation into this kind of coherent memory node should
> be regulated. Kernel allocations must not come here. Normal user space
> allocations too should not come here implicitly (without user application
> knowing about it). This summarizes isolation requirement of certain kind of
> coherent device memory node as an example. There can be different kinds of
> isolation requirement also.
>
> Some coherent memory devices might not require isolation altogether after
> all. Then there might be other coherent memory devices which might require
> some other special treatment after being part of core memory representation
> . For now, will look into isolation seeking coherent device memory node not
> the other ones.
>
> To implement the integration as well as isolation, the coherent memory node
> must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
> the node_states[] array. During memory hotplug operations, the new nodemask
> N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
> memory nodes. This also creates the following new sysfs based interface to
> list down all the coherent memory nodes of the system.
>
> /sys/devices/system/node/is_coherent_node
>
> Architectures must export function arch_check_node_cdm() which identifies
> any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
>
> Signed-off-by: Anshuman Khandual 
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  7 
>  arch/powerpc/Kconfig|  1 +
>  arch/powerpc/mm/numa.c  |  7 
>  drivers/base/node.c |  6 +++
>  include/linux/nodemask.h| 58 
> -
>  mm/Kconfig  |  4 ++
>  mm/memory_hotplug.c |  3 ++
>  mm/page_alloc.c |  8 +++-
>  8 files changed, 91 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node 
> b/Documentation/ABI/stable/sysfs-devices-node
> index 5b2d0f0..5df18f7 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -29,6 +29,13 @@ Description:
> Nodes that have regular or high memory.
> Depends on CONFIG_HIGHMEM.
>
> +What:  /sys/devices/system/node/is_cdm_node
> +Date:  January 2017
> +Contact:   Linux Memory Management list 
> +Description:
> +   Lists the nodemask of nodes that have coherent device

Re: [PATCH V3 1/4] mm: Define coherent device memory (CDM) node

2017-02-17 Thread Bob Liu

Hi Anshuman,

I have a few questions about coherent device memory.

On Wed, Feb 15, 2017 at 8:07 PM, Anshuman Khandual
 wrote:
> There are certain devices like specialized accelerator, GPU cards, network
> cards, FPGA cards etc which might contain onboard memory which is coherent
> along with the existing system RAM while being accessed either from the CPU
> or from the device. They share some similar properties with that of normal

What's the general size of this kind of memory?

> system RAM but at the same time can also be different with respect to
> system RAM.
>
> User applications might be interested in using this kind of coherent device

What kind of applications?

> memory explicitly or implicitly along side the system RAM utilizing all
> possible core memory functions like anon mapping (LRU), file mapping (LRU),
> page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations

I didn't see the benefit to manage the onboard memory same way as system RAM.
Why not just map this kind of onborad memory to userspace directly?
And only those specific applications can manage/access/use it.

It sounds not very good to complicate the core memory framework a lot
because of some not widely used devices and uncertain applications.

-
Regards,
Bob

> etc. To achieve this kind of tight integration with core memory subsystem,
> the device onboard coherent memory must be represented as a memory only
> NUMA node. At the same time arch must export some kind of a function to
> identify of this node as a coherent device memory not any other regular
> cpu less memory only NUMA node.
>
> After achieving the integration with core memory subsystem coherent device
> memory might still need some special consideration inside the kernel. There
> can be a variety of coherent memory nodes with different expectations from
> the core kernel memory. But right now only one kind of special treatment is
> considered which requires certain isolation.
>
> Now consider the case of a coherent device memory node type which requires
> isolation. This kind of coherent memory is onboard an external device
> attached to the system through a link where there is always a chance of a
> link failure taking down the entire memory node with it. More over the
> memory might also have higher chance of ECC failure as compared to the
> system RAM. Hence allocation into this kind of coherent memory node should
> be regulated. Kernel allocations must not come here. Normal user space
> allocations too should not come here implicitly (without user application
> knowing about it). This summarizes isolation requirement of certain kind of
> coherent device memory node as an example. There can be different kinds of
> isolation requirement also.
>
> Some coherent memory devices might not require isolation altogether after
> all. Then there might be other coherent memory devices which might require
> some other special treatment after being part of core memory representation
> . For now, will look into isolation seeking coherent device memory node not
> the other ones.
>
> To implement the integration as well as isolation, the coherent memory node
> must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
> the node_states[] array. During memory hotplug operations, the new nodemask
> N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
> memory nodes. This also creates the following new sysfs based interface to
> list down all the coherent memory nodes of the system.
>
> /sys/devices/system/node/is_coherent_node
>
> Architectures must export function arch_check_node_cdm() which identifies
> any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
>
> Signed-off-by: Anshuman Khandual 
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  7 
>  arch/powerpc/Kconfig|  1 +
>  arch/powerpc/mm/numa.c  |  7 
>  drivers/base/node.c |  6 +++
>  include/linux/nodemask.h| 58 
> -
>  mm/Kconfig  |  4 ++
>  mm/memory_hotplug.c |  3 ++
>  mm/page_alloc.c |  8 +++-
>  8 files changed, 91 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node 
> b/Documentation/ABI/stable/sysfs-devices-node
> index 5b2d0f0..5df18f7 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -29,6 +29,13 @@ Description:
> Nodes that have regular or high memory.
> Depends on CONFIG_HIGHMEM.
>
> +What:  /sys/devices/system/node/is_cdm_node
> +Date:  January 2017
> +Contact:   Linux Memory Management list 
> +Description:
> +   Lists the nodemask of nodes that have coherent device memory.
> +   Depends on CONFIG_COHERENT_DEVICE.
> +
>  What:

[PATCH] block: xen-blkback: don't get/put blkif ref for each queue

2016-09-26 Thread Bob Liu

xen_blkif_get/put() for each queue is useless, and introduces a bug:

If there is I/O inflight when removing device, xen_blkif_disconnect() will
return -EBUSY and xen_blkif_put() not be called.
Which means the references are leaked, then even if I/O completed, the
xen_blkif_put() won't call xen_blkif_deferred_free() to free resources anymore.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkback/xenbus.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 3cc6d1d..2e1bb6d 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -159,7 +159,6 @@ static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
init_waitqueue_head(>shutdown_wq);
ring->blkif = blkif;
ring->st_print = jiffies;
-   xen_blkif_get(blkif);
}
 
return 0;
@@ -296,7 +295,6 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
BUG_ON(ring->free_pages_num != 0);
BUG_ON(ring->persistent_gnt_c != 0);
WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
-   xen_blkif_put(blkif);
}
blkif->nr_ring_pages = 0;
/*
-- 
1.7.10.4

[PATCH] block: xen-blkback: don't get/put blkif ref for each queue

2016-09-26 Thread Bob Liu

xen_blkif_get/put() for each queue is useless, and introduces a bug:

If there is I/O inflight when removing device, xen_blkif_disconnect() will
return -EBUSY and xen_blkif_put() not be called.
Which means the references are leaked, then even if I/O completed, the
xen_blkif_put() won't call xen_blkif_deferred_free() to free resources anymore.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkback/xenbus.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 3cc6d1d..2e1bb6d 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -159,7 +159,6 @@ static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
init_waitqueue_head(>shutdown_wq);
ring->blkif = blkif;
ring->st_print = jiffies;
-   xen_blkif_get(blkif);
}
 
return 0;
@@ -296,7 +295,6 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
BUG_ON(ring->free_pages_num != 0);
BUG_ON(ring->persistent_gnt_c != 0);
WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));
-   xen_blkif_put(blkif);
}
blkif->nr_ring_pages = 0;
/*
-- 
1.7.10.4

Re: [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu


On 07/28/2016 09:19 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 26, 2016 at 01:19:35PM +0800, Bob Liu wrote:
>> Two places didn't get updated when 64KB page granularity was introduced, this
>> patch fix them.
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> Acked-by: Roger Pau Monné <roger@citrix.com>
> 
> Could you rebase this on xen-tip/for-linus-4.8 pls?

Done, sent the v2 for you to pick up.

> 
>> ---
>>  drivers/block/xen-blkfront.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index fcc5b4e..032fc94 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -1321,7 +1321,7 @@ free_shadow:
>>  rinfo->ring_ref[i] = GRANT_INVALID_REF;
>>  }
>>  }
>> -free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * PAGE_SIZE));
>> +free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
>>  rinfo->ring.sring = NULL;
>>  
>>  if (rinfo->irq)
>> @@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
>>  
>>  blkfront_gather_backend_features(info);
>>  segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -blk_queue_max_segments(info->rq, segs);
>> +blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
>>  
>>  for (r_index = 0; r_index < info->nr_rings; r_index++) {
>>  struct blkfront_ring_info *rinfo = >rinfo[r_index];
>> -- 
>> 1.7.10.4
>>

Re: [PATCH 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu


On 07/28/2016 09:19 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 26, 2016 at 01:19:35PM +0800, Bob Liu wrote:
>> Two places didn't get updated when 64KB page granularity was introduced, this
>> patch fix them.
>>
>> Signed-off-by: Bob Liu 
>> Acked-by: Roger Pau Monné 
> 
> Could you rebase this on xen-tip/for-linus-4.8 pls?

Done, sent the v2 for you to pick up.

> 
>> ---
>>  drivers/block/xen-blkfront.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index fcc5b4e..032fc94 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -1321,7 +1321,7 @@ free_shadow:
>>  rinfo->ring_ref[i] = GRANT_INVALID_REF;
>>  }
>>  }
>> -free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * PAGE_SIZE));
>> +free_pages((unsigned long)rinfo->ring.sring, 
>> get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
>>  rinfo->ring.sring = NULL;
>>  
>>  if (rinfo->irq)
>> @@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
>>  
>>  blkfront_gather_backend_features(info);
>>  segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> -blk_queue_max_segments(info->rq, segs);
>> +blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
>>  
>>  for (r_index = 0; r_index < info->nr_rings; r_index++) {
>>  struct blkfront_ring_info *rinfo = >rinfo[r_index];
>> -- 
>> 1.7.10.4
>>

[PATCH v2 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu

Two places didn't get updated when 64KB page granularity was introduced,
this patch fix them.

Signed-off-by: Bob Liu <bob@oracle.com>
Acked-by: Roger Pau Monné <roger@citrix.com>
---
 drivers/block/xen-blkfront.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca0536e..36d9a0d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1318,7 +1318,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
bio_list_init(_list);
INIT_LIST_HEAD();
 
-- 
1.7.10.4

[PATCH v2 1/3] xen-blkfront: fix places not updated after introducing 64KB page granularity

2016-07-28 Thread Bob Liu

Two places didn't get updated when 64KB page granularity was introduced,
this patch fix them.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ca0536e..36d9a0d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1318,7 +1318,7 @@ free_shadow:
rinfo->ring_ref[i] = GRANT_INVALID_REF;
}
}
-   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * PAGE_SIZE));
+   free_pages((unsigned long)rinfo->ring.sring, 
get_order(info->nr_ring_pages * XEN_PAGE_SIZE));
rinfo->ring.sring = NULL;
 
if (rinfo->irq)
@@ -2013,7 +2013,7 @@ static int blkif_recover(struct blkfront_info *info)
 
blkfront_gather_backend_features(info);
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
-   blk_queue_max_segments(info->rq, segs);
+   blk_queue_max_segments(info->rq, segs / GRANTS_PER_PSEG);
bio_list_init(_list);
INIT_LIST_HEAD();
 
-- 
1.7.10.4

[PATCH 3/3] xen-blkfront: free resources if xlvbd_alloc_gendisk fails

2016-07-28 Thread Bob Liu

Current code forgets to free resources in the failure path of
xlvbd_alloc_gendisk(), this patch fix it.

Signed-off-by: Bob Liu <bob@oracle.com>
---
 drivers/block/xen-blkfront.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d5ed60b..d8429d4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2446,7 +2446,7 @@ static void blkfront_connect(struct blkfront_info *info)
if (err) {
xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
 info->xbdev->otherend);
-   return;
+   goto fail;
}
 
xenbus_switch_state(info->xbdev, XenbusStateConnected);
@@ -2459,6 +2459,11 @@ static void blkfront_connect(struct blkfront_info *info)
add_disk(info->gd);
 
info->is_ready = 1;
+   return;
+
+fail:
+   blkif_free(info, 0);
+   return;
 }
 
 /**
-- 
1.7.10.4

[PATCH 3/3] xen-blkfront: free resources if xlvbd_alloc_gendisk fails

2016-07-28 Thread Bob Liu

Current code forgets to free resources in the failure path of
xlvbd_alloc_gendisk(), this patch fix it.

Signed-off-by: Bob Liu 
---
 drivers/block/xen-blkfront.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d5ed60b..d8429d4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2446,7 +2446,7 @@ static void blkfront_connect(struct blkfront_info *info)
if (err) {
xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
 info->xbdev->otherend);
-   return;
+   goto fail;
}
 
xenbus_switch_state(info->xbdev, XenbusStateConnected);
@@ -2459,6 +2459,11 @@ static void blkfront_connect(struct blkfront_info *info)
add_disk(info->gd);
 
info->is_ready = 1;
+   return;
+
+fail:
+   blkif_free(info, 0);
+   return;
 }
 
 /**
-- 
1.7.10.4

[PATCH v2 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-28 Thread Bob Liu

blk_mq_update_nr_hw_queues() reset all queue limits to default which it's
not as xen-blkfront expected, introducing blkif_set_queue_limits() to reset
limits with initial correct values.

Signed-off-by: Bob Liu <bob@oracle.com>
Acked-by: Roger Pau Monné <roger@citrix.com>
---
 drivers/block/xen-blkfront.c |   87 +++---
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 36d9a0d..d5ed60b 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -910,9 +912,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -944,37 +982,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1139,16 +1151,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue

[PATCH v2 2/3] xen-blkfront: introduce blkif_set_queue_limits()

2016-07-28 Thread Bob Liu

blk_mq_update_nr_hw_queues() reset all queue limits to default which it's
not as xen-blkfront expected, introducing blkif_set_queue_limits() to reset
limits with initial correct values.

Signed-off-by: Bob Liu 
Acked-by: Roger Pau Monné 
---
 drivers/block/xen-blkfront.c |   87 +++---
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 36d9a0d..d5ed60b 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -189,6 +189,8 @@ struct blkfront_info
struct mutex mutex;
struct xenbus_device *xbdev;
struct gendisk *gd;
+   u16 sector_size;
+   unsigned int physical_sector_size;
int vdevice;
blkif_vdev_t handle;
enum blkif_state connected;
@@ -910,9 +912,45 @@ static struct blk_mq_ops blkfront_mq_ops = {
.map_queue = blk_mq_map_queue,
 };
 
+static void blkif_set_queue_limits(struct blkfront_info *info)
+{
+   struct request_queue *rq = info->rq;
+   struct gendisk *gd = info->gd;
+   unsigned int segments = info->max_indirect_segments ? :
+   BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+
+   if (info->feature_discard) {
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
+   blk_queue_max_discard_sectors(rq, get_capacity(gd));
+   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_alignment = info->discard_alignment;
+   if (info->feature_secdiscard)
+   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
+   }
+
+   /* Hard sector size and max sectors impersonate the equiv. hardware. */
+   blk_queue_logical_block_size(rq, info->sector_size);
+   blk_queue_physical_block_size(rq, info->physical_sector_size);
+   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
+
+   /* Each segment in a request is up to an aligned page in size. */
+   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
+   blk_queue_max_segment_size(rq, PAGE_SIZE);
+
+   /* Ensure a merged request will fit in a single I/O ring slot. */
+   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
+
+   /* Make sure buffer addresses are sector-aligned. */
+   blk_queue_dma_alignment(rq, 511);
+
+   /* Make sure we don't use bounce buffers. */
+   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
+}
+
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-   unsigned int physical_sector_size,
-   unsigned int segments)
+   unsigned int physical_sector_size)
 {
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -944,37 +982,11 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
}
 
rq->queuedata = info;
-   queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
-
-   if (info->feature_discard) {
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rq);
-   blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
-   rq->limits.discard_alignment = info->discard_alignment;
-   if (info->feature_secdiscard)
-   queue_flag_set_unlocked(QUEUE_FLAG_SECDISCARD, rq);
-   }
-
-   /* Hard sector size and max sectors impersonate the equiv. hardware. */
-   blk_queue_logical_block_size(rq, sector_size);
-   blk_queue_physical_block_size(rq, physical_sector_size);
-   blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);
-
-   /* Each segment in a request is up to an aligned page in size. */
-   blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
-   blk_queue_max_segment_size(rq, PAGE_SIZE);
-
-   /* Ensure a merged request will fit in a single I/O ring slot. */
-   blk_queue_max_segments(rq, segments / GRANTS_PER_PSEG);
-
-   /* Make sure buffer addresses are sector-aligned. */
-   blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
-
-   gd->queue = rq;
-
+   info->rq = gd->queue = rq;
+   info->gd = gd;
+   info->sector_size = sector_size;
+   info->physical_sector_size = physical_sector_size;
+   blkif_set_queue_limits(info);
return 0;
 }
 
@@ -1139,16 +1151,11 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);
 
-   if (xlvbd_init_blk_queue(gd, sector_size, physical_sector_size,
-info-&g

Re: [PATCH] drivers: virtio_blk: notify blk-core when hw-queue number changes

2016-07-28 Thread Bob Liu


On 06/19/2016 06:10 AM, Paolo Bonzini wrote:
> 
> 
> On 13/06/2016 11:58, Bob Liu wrote:
>> A guest might be migrated to other hosts with different num_queues, the
>> blk-core should aware of that else the reference of >vqs[qid] may be 
>> wrong.
>>
>> Signed-off-by: Bob Liu <bob@oracle.com>
>> ---
>>  drivers/block/virtio_blk.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 42758b5..c169238 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -819,6 +819,9 @@ static int virtblk_restore(struct virtio_device *vdev)
>>  if (ret)
>>  return ret;
>>  
>> +if (vblk->num_vqs != vblk->tag_set.nr_hw_queues)
>> +blk_mq_update_nr_hw_queues(>tag_set, vblk->num_vqs);
>> +
>>  virtio_device_ready(vdev);
>>  
>>  blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>>
> 
> This should never happen; it'd be a configuration problem.
> 

Do you mean all hosts have to be configured with the same number of ->num_vqs?
What about cases like migrating a guest from HostA to HostB while HostB is much 
more powerful
and would like to run more hardware queues to get better performance.

Thanks,
Bob Liu

Re: [PATCH] drivers: virtio_blk: notify blk-core when hw-queue number changes

2016-07-28 Thread Bob Liu


On 06/19/2016 06:10 AM, Paolo Bonzini wrote:
> 
> 
> On 13/06/2016 11:58, Bob Liu wrote:
>> A guest might be migrated to other hosts with different num_queues, the
>> blk-core should aware of that else the reference of >vqs[qid] may be 
>> wrong.
>>
>> Signed-off-by: Bob Liu 
>> ---
>>  drivers/block/virtio_blk.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 42758b5..c169238 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -819,6 +819,9 @@ static int virtblk_restore(struct virtio_device *vdev)
>>  if (ret)
>>  return ret;
>>  
>> +if (vblk->num_vqs != vblk->tag_set.nr_hw_queues)
>> +blk_mq_update_nr_hw_queues(>tag_set, vblk->num_vqs);
>> +
>>  virtio_device_ready(vdev);
>>  
>>  blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>>
> 
> This should never happen; it'd be a configuration problem.
> 

Do you mean all hosts have to be configured with the same number of ->num_vqs?
What about cases like migrating a guest from HostA to HostB while HostB is much 
more powerful
and would like to run more hardware queues to get better performance.

Thanks,
Bob Liu

Re: [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu


On 07/27/2016 10:24 PM, Roger Pau Monné wrote:
> On Wed, Jul 27, 2016 at 07:21:05PM +0800, Bob Liu wrote:
>>
>> On 07/27/2016 06:59 PM, Roger Pau Monné wrote:
>>> On Wed, Jul 27, 2016 at 11:21:25AM +0800, Bob Liu wrote:
>>> [...]
>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>> ssize_t count)
>>>> +{
>>>> +  /*
>>>> +   * Prevent new requests even to software request queue.
>>>> +   */
>>>> +  blk_mq_freeze_queue(info->rq);
>>>> +
>>>> +  /*
>>>> +   * Guarantee no uncompleted reqs.
>>>> +   */
>>>
>>> I'm also wondering, why do you need to guarantee that there are no 
>>> uncompleted requests? The resume procedure is going to call blkif_recover 
>>> that will take care of requeuing any unfinished requests that are on the 
>>> ring.
>>>
>>
>> Because there may have requests in the software request queue with more 
>> segments than
>> we can handle(if info->max_indirect_segments is reduced).
>>
>> The blkif_recover() can't handle this since blk-mq was introduced,
>> because there is no way to iterate the sw-request queues(blk_fetch_request() 
>> can't be used by blk-mq).
>>
>> So there is a bug in blkif_recover(), I was thinking implement the suspend 
>> function of blkfront_driver like:
> 
> Hm, this is a regression and should be fixed ASAP. I'm still not sure I 
> follow, don't blk_queue_max_segments change the number of segments the 
> requests on the queue are going to have? So that you will only have to 
> re-queue the requests already on the ring?
> 

That's not enough, request queues were split to software queues and hardware 
queues since blk-mq was introduced.
We need to consider two more things:
 * Stop new requests be added to software queues before 
blk_queue_max_segments() is called(still using old 'max-indirect-segments').
   I didn't see other way except call blk_mq_freeze_queue().

 * Requests already in software queues but with old 'max-indirect-segments' 
also have to be re-queued based on new 'max-indirect-segments'.
   Neither blk-mq API can do this.

> Waiting for the whole queue to be flushed before suspending is IMHO not 
> acceptable, it introduces an unbounded delay during migration if the backend 
> is slow for some reason.
> 

Right, I also hope there is better solution.

-- 
Regards,
-Bob

Re: [PATCH v3] xen-blkfront: dynamic configuration of per-vbd resources

2016-07-27 Thread Bob Liu


On 07/27/2016 10:24 PM, Roger Pau Monné wrote:
> On Wed, Jul 27, 2016 at 07:21:05PM +0800, Bob Liu wrote:
>>
>> On 07/27/2016 06:59 PM, Roger Pau Monné wrote:
>>> On Wed, Jul 27, 2016 at 11:21:25AM +0800, Bob Liu wrote:
>>> [...]
>>>> +static ssize_t dynamic_reconfig_device(struct blkfront_info *info, 
>>>> ssize_t count)
>>>> +{
>>>> +  /*
>>>> +   * Prevent new requests even to software request queue.
>>>> +   */
>>>> +  blk_mq_freeze_queue(info->rq);
>>>> +
>>>> +  /*
>>>> +   * Guarantee no uncompleted reqs.
>>>> +   */
>>>
>>> I'm also wondering, why do you need to guarantee that there are no 
>>> uncompleted requests? The resume procedure is going to call blkif_recover 
>>> that will take care of requeuing any unfinished requests that are on the 
>>> ring.
>>>
>>
>> Because there may have requests in the software request queue with more 
>> segments than
>> we can handle(if info->max_indirect_segments is reduced).
>>
>> The blkif_recover() can't handle this since blk-mq was introduced,
>> because there is no way to iterate the sw-request queues(blk_fetch_request() 
>> can't be used by blk-mq).
>>
>> So there is a bug in blkif_recover(), I was thinking implement the suspend 
>> function of blkfront_driver like:
> 
> Hm, this is a regression and should be fixed ASAP. I'm still not sure I 
> follow, don't blk_queue_max_segments change the number of segments the 
> requests on the queue are going to have? So that you will only have to 
> re-queue the requests already on the ring?
> 

That's not enough, request queues were split to software queues and hardware 
queues since blk-mq was introduced.
We need to consider two more things:
 * Stop new requests be added to software queues before 
blk_queue_max_segments() is called(still using old 'max-indirect-segments').
   I didn't see other way except call blk_mq_freeze_queue().

 * Requests already in software queues but with old 'max-indirect-segments' 
also have to be re-queued based on new 'max-indirect-segments'.
   Neither blk-mq API can do this.

> Waiting for the whole queue to be flushed before suspending is IMHO not 
> acceptable, it introduces an unbounded delay during migration if the backend 
> is slow for some reason.
> 

Right, I also hope there is better solution.

-- 
Regards,
-Bob

1 2 3 4 5 6 7 8 9 >

1 - 100 of 824 matches

Mail list logo