Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-13 Thread Vincent Guittot
On Thu, 14 Dec 2023 at 06:43, Viresh Kumar  wrote:
>
> On 12-12-23, 15:27, Vincent Guittot wrote:
> > @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy 
> > *policy,
> >   policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> >   trace_cpu_frequency_limits(policy);
> >
> > + cpus = policy->related_cpus;
> > + cpufreq_update_pressure(cpus, policy->max);
> > +
> >   policy->cached_target_freq = UINT_MAX;
>
> One more question, why are you doing this from cpufreq_set_policy ? If
> due to cpufreq cooling or from userspace, we end up limiting the
> maximum possible frequency, will this routine always get called ?

Yes, any update of a FREQ_QOS_MAX ends up calling cpufreq_set_policy()
to update the policy->max


>
> --
> viresh



[PATCH v5 3/4] mm/memory_hotplug: export mhp_supports_memmap_on_memory()

2023-12-13 Thread Vishal Verma
In preparation for adding sysfs ABI to toggle memmap_on_memory semantics
for drivers adding memory, export the mhp_supports_memmap_on_memory()
helper. This allows drivers to check if memmap_on_memory support is
available before trying to request it, and display an appropriate
message if it isn't available. As part of this, remove the size argument
to this - with recent updates to allow memmap_on_memory for larger
ranges, and the internal splitting of altmaps into respective memory
blocks, the size argument is meaningless.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Suggested-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
 include/linux/memory_hotplug.h |  6 ++
 mm/memory_hotplug.c| 17 ++---
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7d2076583494..ebc9d528f00c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -121,6 +121,7 @@ struct mhp_params {
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
 struct range mhp_get_pluggable_range(bool need_mapping);
+bool mhp_supports_memmap_on_memory(void);
 
 /*
  * Zone resizing functions
@@ -262,6 +263,11 @@ static inline bool movable_node_is_enabled(void)
return false;
 }
 
+static bool mhp_supports_memmap_on_memory(void)
+{
+   return false;
+}
+
 static inline void pgdat_kswapd_lock(pg_data_t *pgdat) {}
 static inline void pgdat_kswapd_unlock(pg_data_t *pgdat) {}
 static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 926e1cfb10e9..751664c519f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1325,7 +1325,7 @@ static inline bool 
arch_supports_memmap_on_memory(unsigned long vmemmap_size)
 }
 #endif
 
-static bool mhp_supports_memmap_on_memory(unsigned long size)
+bool mhp_supports_memmap_on_memory(void)
 {
unsigned long vmemmap_size = memory_block_memmap_size();
unsigned long memmap_pages = memory_block_memmap_on_memory_pages();
@@ -1334,17 +1334,11 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 * Besides having arch support and the feature enabled at runtime, we
 * need a few more assumptions to hold true:
 *
-* a) We span a single memory block: memory onlining/offlinin;g happens
-*in memory block granularity. We don't want the vmemmap of online
-*memory blocks to reside on offline memory blocks. In the future,
-*we might want to support variable-sized memory blocks to make the
-*feature more versatile.
-*
-* b) The vmemmap pages span complete PMDs: We don't want vmemmap code
+* a) The vmemmap pages span complete PMDs: We don't want vmemmap code
 *to populate memory from the altmap for unrelated parts (i.e.,
 *other memory blocks)
 *
-* c) The vmemmap pages (and thereby the pages that will be exposed to
+* b) The vmemmap pages (and thereby the pages that will be exposed to
 *the buddy) have to cover full pageblocks: memory 
onlining/offlining
 *code requires applicable ranges to be page-aligned, for example, 
to
 *set the migratetypes properly.
@@ -1356,7 +1350,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 *   altmap as an alternative source of memory, and we do not 
exactly
 *   populate a single PMD.
 */
-   if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+   if (!mhp_memmap_on_memory())
return false;
 
/*
@@ -1379,6 +1373,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 
return arch_supports_memmap_on_memory(vmemmap_size);
 }
+EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
 
 static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
 {
@@ -1512,7 +1507,7 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 * Self hosted memmap array
 */
if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
-   mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
+   mhp_supports_memmap_on_memory()) {
ret = create_altmaps_and_memory_blocks(nid, group, start, size);
if (ret)
goto error;

-- 
2.41.0




[PATCH v5 4/4] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-13 Thread Vishal Verma
Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Tested-by: Li Zhijian 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c   | 38 +
 Documentation/ABI/testing/sysfs-bus-dax | 17 +++
 2 files changed, 55 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 6226de131d17..f4d3beec507c 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1245,6 +1245,43 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t memmap_on_memory_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct dax_device_driver *dax_drv;
+   ssize_t rc;
+   bool val;
+
+   rc = kstrtobool(buf, );
+   if (rc)
+   return rc;
+
+   if (val == true && !mhp_supports_memmap_on_memory()) {
+   dev_dbg(dev, "memmap_on_memory is not available\n");
+   return -EOPNOTSUPP;
+   }
+
+   guard(device)(dev);
+   dax_drv = to_dax_drv(dev->driver);
+   if (dax_drv && dev_dax->memmap_on_memory != val &&
+   dax_drv->type == DAXDRV_KMEM_TYPE)
+   return -EBUSY;
+   dev_dax->memmap_on_memory = val;
+
+   return len;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
 static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int 
n)
 {
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1271,6 +1308,7 @@ static struct attribute *dev_dax_attributes[] = {
_attr_align.attr,
_attr_resource.attr,
_attr_numa_node.attr,
+   _attr_memmap_on_memory.attr,
NULL,
 };
 
diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
index 6359f7bc9bf4..40d9965733b2 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -134,3 +134,20 @@ KernelVersion: v5.1
 Contact:   nvd...@lists.linux.dev
 Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date:  October, 2023
+KernelVersion: v6.8
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Control the memmap_on_memory setting if the dax device
+   were to be hotplugged as system memory. This determines whether
+   the 'altmap' for the hotplugged memory will be placed on the
+   device being hotplugged (memmap_on_memory=1) or if it will be
+   placed on regular memory (memmap_on_memory=0). This attribute
+   must be set before the device is handed over to the 'kmem'
+   driver (i.e.  hotplugged into system-ram). Additionally, this
+   depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
+   memmap_on_memory parameter for memory_hotplug. This is
+   typically set on the kernel command line -
+   memory_hotplug.memmap_on_memory set to 'true' or 'force'."

-- 
2.41.0




[PATCH v5 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-13 Thread Vishal Verma
Use the guard(device) macro to lock a 'struct device', and unlock it
automatically when going out of scope using Scope Based Resource
Management semantics. A lot of the sysfs attribute writes in
drivers/dax/bus.c benefit from a cleanup using these, so change these
where applicable.

Cc: Joao Martins 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 143 ++
 1 file changed, 59 insertions(+), 84 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..6226de131d17 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
struct dax_region *dax_region = dev_get_drvdata(dev);
-   unsigned long long size;
 
-   device_lock(dev);
-   size = dax_region_avail_size(dax_region);
-   device_unlock(dev);
+   guard(device)(dev);
 
-   return sprintf(buf, "%llu\n", size);
+   return sprintf(buf, "%llu\n", dax_region_avail_size(dax_region));
 }
 static DEVICE_ATTR_RO(available_size);
 
@@ -309,17 +306,14 @@ static ssize_t seed_show(struct device *dev,
 {
struct dax_region *dax_region = dev_get_drvdata(dev);
struct device *seed;
-   ssize_t rc;
 
if (is_static(dax_region))
return -EINVAL;
 
-   device_lock(dev);
+   guard(device)(dev);
seed = dax_region->seed;
-   rc = sprintf(buf, "%s\n", seed ? dev_name(seed) : "");
-   device_unlock(dev);
 
-   return rc;
+   return sprintf(buf, "%s\n", seed ? dev_name(seed) : "");
 }
 static DEVICE_ATTR_RO(seed);
 
@@ -328,24 +322,28 @@ static ssize_t create_show(struct device *dev,
 {
struct dax_region *dax_region = dev_get_drvdata(dev);
struct device *youngest;
-   ssize_t rc;
 
if (is_static(dax_region))
return -EINVAL;
 
-   device_lock(dev);
+   guard(device)(dev);
youngest = dax_region->youngest;
-   rc = sprintf(buf, "%s\n", youngest ? dev_name(youngest) : "");
-   device_unlock(dev);
 
-   return rc;
+   return sprintf(buf, "%s\n", youngest ? dev_name(youngest) : "");
 }
 
 static ssize_t create_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t len)
 {
struct dax_region *dax_region = dev_get_drvdata(dev);
+   struct dev_dax_data data = {
+   .dax_region = dax_region,
+   .size = 0,
+   .id = -1,
+   .memmap_on_memory = false,
+   };
unsigned long long avail;
+   struct dev_dax *dev_dax;
ssize_t rc;
int val;
 
@@ -358,38 +356,25 @@ static ssize_t create_store(struct device *dev, struct 
device_attribute *attr,
if (val != 1)
return -EINVAL;
 
-   device_lock(dev);
+   guard(device)(dev);
avail = dax_region_avail_size(dax_region);
if (avail == 0)
-   rc = -ENOSPC;
-   else {
-   struct dev_dax_data data = {
-   .dax_region = dax_region,
-   .size = 0,
-   .id = -1,
-   .memmap_on_memory = false,
-   };
-   struct dev_dax *dev_dax = devm_create_dev_dax();
+   return -ENOSPC;
 
-   if (IS_ERR(dev_dax))
-   rc = PTR_ERR(dev_dax);
-   else {
-   /*
-* In support of crafting multiple new devices
-* simultaneously multiple seeds can be created,
-* but only the first one that has not been
-* successfully bound is tracked as the region
-* seed.
-*/
-   if (!dax_region->seed)
-   dax_region->seed = _dax->dev;
-   dax_region->youngest = _dax->dev;
-   rc = len;
-   }
-   }
-   device_unlock(dev);
+   dev_dax = devm_create_dev_dax();
+   if (IS_ERR(dev_dax))
+   return PTR_ERR(dev_dax);
 
-   return rc;
+   /*
+* In support of crafting multiple new devices simultaneously multiple
+* seeds can be created, but only the first one that has not been
+* successfully bound is tracked as the region seed.
+*/
+   if (!dax_region->seed)
+   dax_region->seed = _dax->dev;
+   dax_region->youngest = _dax->dev;
+
+   return len;
 }
 static DEVICE_ATTR_RW(create);
 
@@ -481,12 +466,9 @@ static int __free_dev_dax_id(struct dev_dax *dev_dax)
 static int free_dev_dax_id(struct dev_dax *dev_dax)
 {
struct device *dev = _dax->dev;
-   int rc;
 
-   device_lock(dev);
-   rc = __free_dev_dax_id(dev_dax);
-   device_unlock(dev);
-   return rc;
+   

[PATCH v5 1/4] Documentatiion/ABI: Add ABI documentation for sys-bus-dax

2023-12-13 Thread Vishal Verma
Add the missing sysfs ABI documentation for the device DAX subsystem.
Various ABI attributes under this have been present since v5.1, and more
have been added over time. In preparation for adding a new attribute,
add this file with the historical details.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 Documentation/ABI/testing/sysfs-bus-dax | 136 
 1 file changed, 136 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
new file mode 100644
index ..6359f7bc9bf4
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -0,0 +1,136 @@
+What:  /sys/bus/dax/devices/daxX.Y/align
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Provides a way to specify an alignment for a dax device.
+   Values allowed are constrained by the physical address ranges
+   that back the dax device, and also by arch requirements.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (WO) Provides a way to allocate a mapping range under a dax
+   device. Specified in the format -.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'start' attribute indicates the
+   start physical address for the given range. The 'end' attribute
+   indicates the end physical address for the given range. The
+   'page_offset' attribute indicates the offset of the current
+   range in the dax device.
+
+What:  /sys/bus/dax/devices/daxX.Y/resource
+Date:  June, 2019
+KernelVersion: v5.3
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The resource attribute indicates the starting physical
+   address of a dax device. In case of a device with multiple
+   constituent ranges, it indicates the starting address of the
+   first range.
+
+What:  /sys/bus/dax/devices/daxX.Y/size
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) The size attribute indicates the total size of a dax
+   device. For creating subdivided dax devices, or for resizing
+   an existing device, the new size can be written to this as
+   part of the reconfiguration process.
+
+What:  /sys/bus/dax/devices/daxX.Y/numa_node
+Date:  November, 2019
+KernelVersion: v5.5
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) If NUMA is enabled and the platform has affinitized the
+   backing device for this dax device, emit the CPU node
+   affinity for this device.
+
+What:  /sys/bus/dax/devices/daxX.Y/target_node
+Date:  February, 2019
+KernelVersion: v5.1
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The target-node attribute is the Linux numa-node that a
+   device-dax instance may create when it is online. Prior to
+   being online the device's 'numa_node' property reflects the
+   closest online cpu node which is the typical expectation of a
+   device 'numa_node'. Once it is online it becomes its own
+   distinct numa node.
+
+What:  $(readlink -f 
/sys/bus/dax/devices/daxX.Y)/../dax_region/available_size
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The available_size attribute tracks available dax region
+   capacity. This only applies to volatile hmem devices, not pmem
+   devices, since pmem devices are defined by nvdimm namespace
+   boundaries.
+
+What:  $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/size
+Date:  July, 2017
+KernelVersion: v5.1
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The size attribute indicates the size of a given dax region
+   in bytes.
+
+What:  $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/align
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The align attribute indicates alignment of the dax region.
+   Changes on align may not always be valid, when say certain
+  

[PATCH v5 0/4] Add DAX ABI for memmap_on_memory

2023-12-13 Thread Vishal Verma
The DAX drivers were missing sysfs ABI documentation entirely.  Add this
missing documentation for the sysfs ABI for DAX regions and Dax devices
in patch 1. Switch to guard(device) semantics for Scope Based Resource
Management for device_{lock,unlock} flows in drivers/dax/bus.c in patch
2. Export mhp_supports_memmap_on_memory() in patch 3. Add a new ABI for
toggling memmap_on_memory semantics in patch 4.

The missing ABI was spotted in [1], this series is a split of the new
ABI additions behind the initial documentation creation.

[1]: 
https://lore.kernel.org/linux-cxl/651f27b728fef_ae7e729...@dwillia2-xfh.jf.intel.com.notmuch/

---
This series depends on [2] which adds the definition for guard(device).
[2]: 
https://lore.kernel.org/r/170250854466.1522182.17555361077409628655.st...@dwillia2-xfh.jf.intel.com

---

Other Logistics -

Andrew, would you prefer patch 3 to go through mm? Or through the dax
tree with an mm ack? The remaining patches are all contained to dax, but
do depend on the memmap_on_memory set that is currently in mm-stable.

---

Changes in v5:
- Export and check mhp_supports_memmap_on_memory() in the DAX sysfs ABI
  (David)
- Obtain dax_drv under the device lock (Ying)
- Check dax_drv for NULL before dereferencing it (Ying)
- Clean up some repetition in sysfs-bus-dax documentation entries
  (Jonathan)
- A few additional cleanups enabled by guard(device) (Jonathan)
- Drop the DEFINE_GUARD() part of patch 2, add dependency on Dan's patch
  above so it can be backported / applied separately (Jonathan, Dan)
- Link to v4: 
https://lore.kernel.org/r/20231212-vv-dax_abi-v4-0-1351758f0...@intel.com

Changes in v4:
- Hold the device lock when checking if the dax_dev is bound to kmem
  (Ying, Dan)
- Remove dax region checks (and locks) as they were unnecessary.
- Introduce guard(device) for device lock/unlock (Dan)
- Convert the rest of drivers/dax/bus.c to guard(device)
- Link to v3: 
https://lore.kernel.org/r/20231211-vv-dax_abi-v3-0-acf6cc1bd...@intel.com

Changes in v3:
- Fix typo in ABI docs (Zhijian Li)
- Add kernel config and module parameter dependencies to the ABI docs
  entry (David Hildenbrand)
- Ensure kmem isn't active when setting the sysfs attribute (Ying
  Huang)
- Simplify returning from memmap_on_memory_store()
- Link to v2: 
https://lore.kernel.org/r/20231206-vv-dax_abi-v2-0-f4f4f2336...@intel.com

Changes in v2:
- Fix CC lists, patch 1/2 didn't get sent correctly in v1
- Link to v1: 
https://lore.kernel.org/r/20231206-vv-dax_abi-v1-0-474eb88e2...@intel.com

Cc: 
Cc: 
Cc: 
Cc: David Hildenbrand 
Cc: Dave Hansen 
Cc: Huang Ying 
Cc: Greg Kroah-Hartman 
Cc: 
To: Dan Williams 
To: Vishal Verma 
To: Dave Jiang 
To: Andrew Morton 
To: Oscar Salvador 

---
Vishal Verma (4):
  Documentatiion/ABI: Add ABI documentation for sys-bus-dax
  dax/bus: Use guard(device) in sysfs attribute helpers
  mm/memory_hotplug: export mhp_supports_memmap_on_memory()
  dax: add a sysfs knob to control memmap_on_memory behavior

 include/linux/memory_hotplug.h  |   6 ++
 drivers/dax/bus.c   | 181 +---
 mm/memory_hotplug.c |  17 ++-
 Documentation/ABI/testing/sysfs-bus-dax | 153 +++
 4 files changed, 262 insertions(+), 95 deletions(-)
---
base-commit: a6e0c2ca980d75d5ac6b2902c5c0028eaf094db3
change-id: 20231025-vv-dax_abi-17a219c46076

Best regards,
-- 
Vishal Verma 




Re: [PATCH] nvdimm-btt: simplify code with the scope based resource management

2023-12-13 Thread dinghao . liu
> > It's a little strange that we do not check super immediately after 
> > allocation.
> > How about this:
> > 
> >  static int discover_arenas(struct btt *btt)
> >  {
> > int ret = 0;
> > struct arena_info *arena;
> > -   struct btt_sb *super;
> > size_t remaining = btt->rawsize;
> > u64 cur_nlba = 0;
> > size_t cur_off = 0;
> > int num_arenas = 0;
> >  
> > -   super = kzalloc(sizeof(*super), GFP_KERNEL);
> > +   struct btt_sb *super __free(kfree) = 
> > +   kzalloc(sizeof(*super), GFP_KERNEL);
> > if (!super)
> > return -ENOMEM;
> >  
> > while (remaining) {
> >  
> 
> That's fine by me

I will resend a new patch soon, thanks!

Regards,
Dinghao

Re: [PATCH 2/2] ARM: dts: qcom: msm8926-motorola-peregrine: Add initial device tree

2023-12-13 Thread Krzysztof Kozlowski
On 13/12/2023 21:33, André Apitzsch wrote:
> This dts adds support for Motorola Moto G 4G released in 2013.
> 
> Add a device tree with initial support for:
> 
> - GPIO keys
> - Hall sensor
> - SDHCI
> - Vibrator
> 
> Signed-off-by: André Apitzsch 
> ---
>  arch/arm/boot/dts/qcom/Makefile|   1 +
>  .../dts/qcom/qcom-msm8926-motorola-peregrine.dts   | 297 
> +
>  2 files changed, 298 insertions(+)
> 
> diff --git a/arch/arm/boot/dts/qcom/Makefile b/arch/arm/boot/dts/qcom/Makefile
> index 0cb272f4fa45..9cc1e14e6cd0 100644
> --- a/arch/arm/boot/dts/qcom/Makefile
> +++ b/arch/arm/boot/dts/qcom/Makefile
> @@ -35,6 +35,7 @@ dtb-$(CONFIG_ARCH_QCOM) += \
>   qcom-msm8926-htc-memul.dtb \
>   qcom-msm8926-microsoft-superman-lte.dtb \
>   qcom-msm8926-microsoft-tesla.dtb \
> + qcom-msm8926-motorola-peregrine.dtb \
>   qcom-msm8960-cdp.dtb \
>   qcom-msm8960-samsung-expressatt.dtb \
>   qcom-msm8974-lge-nexus5-hammerhead.dtb \
> diff --git a/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts 
> b/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts
> new file mode 100644
> index ..3c5256120502
> --- /dev/null
> +++ b/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts
> @@ -0,0 +1,297 @@
> +// SPDX-License-Identifier: BSD-3-Clause
> +
> +/dts-v1/;
> +
> +#include "qcom-msm8226.dtsi"
> +#include "pm8226.dtsi"
> +
> +/delete-node/ _region;
> +
> +/ {
> + model = "Motorola Moto G 4G";
> + compatible = "motorola,peregrine", "qcom,msm8926", "qcom,msm8226";
> + chassis-type = "handset";
> +
> + aliases {
> + mmc0 = _1; /* SDC1 eMMC slot */
> + mmc1 = _2; /* SDC2 SD card slot */
> + };
> +
> + chosen {
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges;
> +
> + framebuffer0: framebuffer@320 {
> + compatible = "simple-framebuffer";
> + reg = <0x0320 0x80>;
> + width = <720>;
> + height = <1280>;
> + stride = <(720 * 3)>;
> + format = "r8g8b8";
> + };
> + };
> +
> + gpio-hall-sensor {
> + compatible = "gpio-keys";
> +
> + label = "GPIO Hall Effect Sensor";
> +
> + event-hall-sensor {
> + label = "Hall Effect Sensor";
> + gpios = < 51 GPIO_ACTIVE_LOW>;
> + linux,input-type = ;
> + linux,code = ;
> + linux,can-disable;
> + };
> + };
> +
> + gpio-keys {

No need to have two nodes for gpio-keys. Combine them.

> + compatible = "gpio-keys";
> +
> + key-volume-up {
> + label = "Volume Up";



Best regards,
Krzysztof




Re: [PATCH 1/2] dt-bindings: arm: qcom: Add Motorola Moto G 4G

2023-12-13 Thread Krzysztof Kozlowski
On 13/12/2023 21:33, André Apitzsch wrote:
> Document the compatible for the MSM8926-based Motorola Moto G 4G smartphone.
> 
> Signed-off-by: André Apitzsch 
> ---

Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 09:55:23AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> > On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin 
> > > > > >  wrote:
> > 
> > [...]
> > > 
> > > Apparently schedule is already called?
> > > 
> > 
> > What about this: 
> > 
> > static int vhost_task_fn(void *data)
> > {
> > <...>
> > did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> > not mistaken
> > if (!did_work)
> > schedule();
> > <...>
> > }
> > 
> > static bool vhost_worker(void *data)
> > {
> > struct vhost_worker *worker = data;
> > struct vhost_work *work, *work_next;
> > struct llist_node *node;
> > 
> > node = llist_del_all(>work_list);
> > if (node) {
> > <...>
> > llist_for_each_entry_safe(work, work_next, node, node) {
> > <...>
> > }
> > }
> > 
> > return !!node;
> > }
> > 
> > The llist_for_each_entry_safe does not actually change the node value, 
> > doesn't it?
> > 
> > If it does not change it, !!node would return 1.
> > Thereby skipping the schedule.
> > 
> > This was changed recently with:
> > f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> > 
> > It returned a hardcoded 0 before. The commit message explicitly mentions 
> > this
> > change to make vhost_worker return 1 if it did something.
> > 
> > Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> > the woken up kworker right away.
> 
> 
> So we are actually making an effort to be nice.
> Documentation/kernel-hacking/hacking.rst says:
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
> cond_resched(); /* Will sleep */
> 
> 
> and this is what vhost.c does.
> 
> At this point I'm not sure why it's appropriate to call schedule() as opposed 
> to
> cond_resched(). Ideas?
> 

Peter, would appreciate feedback on this. When is cond_resched()
insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
be updated to require schedule() instead?


> -- 
> MST




Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Linus Torvalds
On Wed, 13 Dec 2023 at 18:45, Steven Rostedt  wrote:
>
> tl;dr;  The ring-buffer timestamp requires a 64-bit cmpxchg to keep the
> timestamps in sync (only in the slow paths). I was told that 64-bit cmpxchg
> can be extremely slow on 32-bit architectures. So I created a rb_time_t
> that on 64-bit was a normal local64_t type, and on 32-bit it's represented
> by 3 32-bit words and a counter for synchronization. But this now requires
> three 32-bit cmpxchgs for where one simple 64-bit cmpxchg would do.

It's not that a 64-bit cmpxchg is even slow. It doesn't EXIST AT ALL
on older 32-bit x86 machines.

Which is why we have

arch/x86/lib/cmpxchg8b_emu.S

which emulates it on machines that don't have the CX8 capability
("CX8" being the x86 capability flag name for the cmpxchg8b
instruction, aka 64-bit cmpxchg).

Which only works because those older 32-bit cpu's also don't do SMP,
so there are no SMP cache coherency issues, only interrupt atomicity
issues.

IOW, the way to do an atomic 64-bit cmpxchg on the affected hardware
is to simply disable interrupts.

In other words - it's not just slow.  It's *really* slow. As in 10x
slower, not "slightly slower".

> We started discussing how much time this is actually saving to be worth the
> complexity, and actually found some hardware to test. One Atom processor.

That atom processor won't actually show the issue. It's much too
recent. So your "test" is actually worthless.

And you probably did this all with a kernel config that had
CONFIG_X86_CMPXCHG64 set anyway, which wouldn't even boot on a i486
machine.

So in fact your test was probably doubly broken, in that not only
didn't you test the slow case, you tested something that wouldn't even
have worked in the environment where the slow case happened.

Now, the real question is if anybody cares about CPUs that don't have
cmpxchg8b support.

IBecause in practice, it's really just old 486-class machines (and a
couple of clone manufacturers who _claimed_ to be Pentium class, but
weren't - there was also some odd thing with Windows breaking if you
had CPUID claiming to support CX8

We dropped support for the original 80386 some time ago. I'd actually
be willing to drop support for ll pre-cmpxchg8b machines, and get rid
of the emulation.

I also suspect that from a perf angle, none of this matters. The
emulation being slow probably is a non-issue, simply because even if
you run on an old i486 machine, you probably won't be doing perf or
tracing on it.

 Linus



Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-13 Thread Viresh Kumar
On 12-12-23, 15:27, Vincent Guittot wrote:
> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy 
> *policy,
>   policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
>   trace_cpu_frequency_limits(policy);
>  
> + cpus = policy->related_cpus;
> + cpufreq_update_pressure(cpus, policy->max);
> +
>   policy->cached_target_freq = UINT_MAX;

One more question, why are you doing this from cpufreq_set_policy ? If
due to cpufreq cooling or from userspace, we end up limiting the
maximum possible frequency, will this routine always get called ?

-- 
viresh



Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-13 Thread Viresh Kumar
On 13-12-23, 16:41, Tim Chen wrote:
> Seems like the pressure value computed from the first CPU applies to all CPU.
> Will this be valid for non-homogeneous CPUs that could have different
> max_freq and max_capacity?

The will be part of different cpufreq policies and so it will work
fine.

-- 
viresh



[PATCH v4] trace/kprobe: Display the actual notrace function when rejecting a probe

2023-12-13 Thread Naveen N Rao
Trying to probe update_sd_lb_stats() using perf results in the below
message in the kernel log:
trace_kprobe: Could not probe notrace function _text

This is because 'perf probe' specifies the kprobe location as an offset
from '_text':
$ sudo perf probe -D update_sd_lb_stats
p:probe/update_sd_lb_stats _text+1830728

However, the error message is misleading and doesn't help convey the
actual notrace function that is being probed. Fix this by looking up the
actual function name that is being probed. With this fix, we now get the
below message in the kernel log:
trace_kprobe: Could not probe notrace function 
update_sd_lb_stats.constprop.0

Signed-off-by: Naveen N Rao 
---
v4: Use printk format specifier %ps with probe address to lookup the 
symbol, as suggested by Masami.

 kernel/trace/trace_kprobe.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 3d7a180a8427..0017404d6e8d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -487,8 +487,8 @@ static int __register_trace_kprobe(struct trace_kprobe *tk)
return -EINVAL;
 
if (within_notrace_func(tk)) {
-   pr_warn("Could not probe notrace function %s\n",
-   trace_kprobe_symbol(tk));
+   pr_warn("Could not probe notrace function %ps\n",
+   (void *)trace_kprobe_address(tk));
return -EINVAL;
}
 

base-commit: 4758560fa268cecfa1144f015aa9f2525d164b7e
-- 
2.43.0




Re: [PATCH v3] trace/kprobe: Display the actual notrace function when rejecting a probe

2023-12-13 Thread Naveen N Rao
On Thu, Dec 14, 2023 at 08:02:10AM +0900, Masami Hiramatsu wrote:
> On Wed, 13 Dec 2023 20:09:14 +0530
> Naveen N Rao  wrote:
> 
> > Trying to probe update_sd_lb_stats() using perf results in the below
> > message in the kernel log:
> >   trace_kprobe: Could not probe notrace function _text
> > 
> > This is because 'perf probe' specifies the kprobe location as an offset
> > from '_text':
> >   $ sudo perf probe -D update_sd_lb_stats
> >   p:probe/update_sd_lb_stats _text+1830728
> > 
> > However, the error message is misleading and doesn't help convey the
> > actual notrace function that is being probed. Fix this by looking up the
> > actual function name that is being probed. With this fix, we now get the
> > below message in the kernel log:
> >   trace_kprobe: Could not probe notrace function 
> > update_sd_lb_stats.constprop.0
> > 
> > Signed-off-by: Naveen N Rao 
> > ---
> > v3: Remove tk parameter from within_notrace_func() as suggested by 
> > Masami
> > 
> >  kernel/trace/trace_kprobe.c | 11 ++-
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index 3d7a180a8427..dc36c6ed6131 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -449,9 +449,8 @@ static bool __within_notrace_func(unsigned long addr)
> > return !ftrace_location_range(addr, addr + size - 1);
> >  }
> >  
> > -static bool within_notrace_func(struct trace_kprobe *tk)
> > +static bool within_notrace_func(unsigned long addr)
> >  {
> > -   unsigned long addr = trace_kprobe_address(tk);
> > char symname[KSYM_NAME_LEN], *p;
> >  
> > if (!__within_notrace_func(addr))
> > @@ -471,12 +470,14 @@ static bool within_notrace_func(struct trace_kprobe 
> > *tk)
> > return true;
> >  }
> >  #else
> > -#define within_notrace_func(tk)(false)
> > +#define within_notrace_func(addr)  (false)
> >  #endif
> >  
> >  /* Internal register function - just handle k*probes and flags */
> >  static int __register_trace_kprobe(struct trace_kprobe *tk)
> >  {
> > +   unsigned long addr = trace_kprobe_address(tk);
> > +   char symname[KSYM_NAME_LEN];
> > int i, ret;
> >  
> > ret = security_locked_down(LOCKDOWN_KPROBES);
> > @@ -486,9 +487,9 @@ static int __register_trace_kprobe(struct trace_kprobe 
> > *tk)
> > if (trace_kprobe_is_registered(tk))
> > return -EINVAL;
> >  
> > -   if (within_notrace_func(tk)) {
> > +   if (within_notrace_func(addr)) {
> > pr_warn("Could not probe notrace function %s\n",
> > -   trace_kprobe_symbol(tk));
> > +   lookup_symbol_name(addr, symname) ? 
> > trace_kprobe_symbol(tk) : symname);
> 
> Can we just use %ps and (void *)trace_kprobe_address(tk) here?
> 
> That will be simpler.

Indeed - that is much simpler. v4 on its way...

Thanks!
- Naveen



[PATCH v2] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.

This means that this complex code is not only complex but also not even
faster than just using 64-bit cmpxchg.

Nuke it!

This is basically a revert of 10464b4aa605e ("ring-buffer: Add rb_time_t
64 bit operations for speeding up 32 bit").

Cc: sta...@vger.kernel.org
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for 
speeding up 32 bit")
Acked-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231213211126.24f8c...@gandalf.local.home/

- Removed left over debug code

- Rebased on my urgent branch before the updates to this code, thinking
  that I'm going to send this now and have this get backported to the
  stable kernels, instead of the other updates.

 kernel/trace/ring_buffer.c | 221 ++---
 1 file changed, 11 insertions(+), 210 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..42f1bb8c794c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 /*
@@ -463,27 +464,9 @@ enum {
RB_CTX_MAX
 };
 
-#if BITS_PER_LONG == 32
-#define RB_TIME_32
-#endif
-
-/* To test on 64 bit machines */
-//#define RB_TIME_32
-
-#ifdef RB_TIME_32
-
-struct rb_time_struct {
-   local_t cnt;
-   local_t top;
-   local_t bottom;
-   local_t msb;
-};
-#else
-#include 
 struct rb_time_struct {
local64_t   time;
 };
-#endif
 typedef struct rb_time_struct rb_time_t;
 
 #define MAX_NEST   5
@@ -573,179 +556,9 @@ struct ring_buffer_iter {
int missed_events;
 };
 
-#ifdef RB_TIME_32
-
-/*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
- *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
- *
- *  - Reads may fail if it interrupted a modification of the time stamp.
- *  It will succeed if it did not interrupt another write even if
- *  the read itself is interrupted by a write.
- *  It returns whether it was successful or not.
- *
- *  - Writes always succeed and will overwrite other writes and writes
- *  that were done by events interrupting the current write.
- *
- *  - A write followed by a read of the same time stamp will always succeed,
- *  but may not contain the same value.
- *
- *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
- *  Other than that, it acts like a normal cmpxchg.
- *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
- *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
- * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
- */
-#define RB_TIME_SHIFT  30
-#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
-#define RB_TIME_MSB_SHIFT   60
-
-static inline int rb_time_cnt(unsigned long val)
-{
-   return (val >> RB_TIME_SHIFT) & 3;
-}
-
-static inline u64 rb_time_val(unsigned long top, unsigned long bottom)
-{
-   u64 val;
-
-   val = top & RB_TIME_VAL_MASK;
-   val <<= RB_TIME_SHIFT;
-   val |= bottom & RB_TIME_VAL_MASK;
-
-   return val;

Re: [PATCH] virtio_balloon: stay awake while adjusting balloon

2023-12-13 Thread David Stevens
On Wed, Dec 13, 2023 at 5:44 PM David Hildenbrand  wrote:
>
> On 11.12.23 12:43, David Stevens wrote:
> > From: David Stevens 
> >
>
> Hi David,
>
> > Add a wakeup event for when the balloon is inflating or deflating.
> > Userspace can enable this wakeup event to prevent the system from
> > suspending while the balloon is being adjusted. This allows
> > /sys/power/wakeup_count to be used without breaking virtio_balloon's
> > cooperative memory management.
>
> Can you add/share some more details

I'm working on enabling support for Linux s2Idle in our Android
virtual machine, to restrict apps from running in the background
without holding an Android partial wakelock. With the patch I recently
sent out [1], since crosvm advertises native PCI power management for
virtio devices, the Android guest can properly enter s2idle, and it
can be woken up by incoming IO. However, one of the remaining problems
is that when the host needs to reclaim memory from the guest via the
virtio-balloon, there is nothing preventing the guest from entering
s2idle before the balloon driver finishes returning memory to the
host.

One alternative to this approach would be to add a virtballoon_suspend
callback to abort suspend if the balloon is inflating/adjusting.
However, it seems cleaner to just prevent suspend in the first place.

[1] https://lore.kernel.org/lkml/20231208070754.3132339-1-steve...@chromium.org/

> >
> > Signed-off-by: David Stevens 
> > ---
> >   drivers/virtio/virtio_balloon.c | 59 +++--
> >   1 file changed, 49 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index 1fe93e93f5bc..811d8937246a 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -119,6 +119,11 @@ struct virtio_balloon {
> >   /* Free page reporting device */
> >   struct virtqueue *reporting_vq;
> >   struct page_reporting_dev_info pr_dev_info;
> > +
> > + /* State for keeping the wakeup_source active while adjusting the 
> > balloon */
> > + spinlock_t adjustment_lock;
> > + u32 adjustment_seqno;
>
> Using a simple flag that gets set when updating the balloon size and
> test-and-clear when testing for changes should be easier to get.
>
> bool adjustment_balloon_size_changed;
>
> or sth like that.

That's a good simplification, thanks.

> > + bool adjustment_in_progress;
> >   };
> >
> >   static const struct virtio_device_id id_table[] = {
> > @@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
> > virtio_balloon *vb)
> >   queue_work(vb->balloon_wq, >report_free_page_work);
> >   }
> >
> > +static void start_update_balloon_size(struct virtio_balloon *vb)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(>adjustment_lock, flags);
> > + vb->adjustment_seqno++;
> > + if (!vb->adjustment_in_progress) {
> > + vb->adjustment_in_progress = true;
> > + pm_stay_awake(>vdev->dev);
> > + }
> > + spin_unlock_irqrestore(>adjustment_lock, flags);
> > +
> > + queue_work(system_freezable_wq, >update_balloon_size_work);
> > +}
> > +
> > +static void end_update_balloon_size(struct virtio_balloon *vb, u32 seqno)
> > +{
> > + spin_lock(>adjustment_lock);
> > + if (vb->adjustment_seqno == seqno && vb->adjustment_in_progress) {
> > + vb->adjustment_in_progress = false;
> > + pm_relax(>vdev->dev);
> > + }
> > + spin_unlock(>adjustment_lock);
> > +}
> > +
> >   static void virtballoon_changed(struct virtio_device *vdev)
> >   {
> >   struct virtio_balloon *vb = vdev->priv;
> > @@ -444,8 +474,7 @@ static void virtballoon_changed(struct virtio_device 
> > *vdev)
> >
> >   spin_lock_irqsave(>stop_update_lock, flags);
> >   if (!vb->stop_update) {
> > - queue_work(system_freezable_wq,
> > ->update_balloon_size_work);
> > + start_update_balloon_size(vb);
> >   virtio_balloon_queue_free_page_work(vb);
> >   }
> >   spin_unlock_irqrestore(>stop_update_lock, flags);
> > @@ -473,22 +502,29 @@ static void update_balloon_size_func(struct 
> > work_struct *work)
> >   {
> >   struct virtio_balloon *vb;
> >   s64 diff;
> > + u32 seqno;
> >
> >   vb = container_of(work, struct virtio_balloon,
> > update_balloon_size_work);
> > - diff = towards_target(vb);
> >
> > - if (!diff)
> > - return;
> > + spin_lock(>adjustment_lock);
> > + seqno = vb->adjustment_seqno;
> > + spin_unlock(>adjustment_lock);
> >
> > - if (diff > 0)
> > - diff -= fill_balloon(vb, diff);
> > - else
> > - diff += leak_balloon(vb, -diff);
> > - update_balloon_size(vb);
> > + diff = towards_target(vb);
> > +
> > + if (diff) {
> > + if (diff > 0)
> > + diff -= fill_balloon(vb, diff);
> > 

Re: [PATCH v6 1/3] scsi: ufs: qcom: dt-bindings: Add SC7280 compatible string

2023-12-13 Thread Martin K. Petersen


Luca,

> Document the compatible string for the UFS found on SC7280.

Applied to 6.8/scsi-staging, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering



BUG: unable to handle kernel paging request in bpf_probe_read_compat_str

2023-12-13 Thread xingwei lee
Hello I found a bug in net/bpf in the lastest upstream linux and
comfired in the lastest net tree and lastest net bpf titled BUG:
unable to handle kernel paging request in bpf_probe_read_compat_str

If you fix this issue, please add the following tag to the commit:
Reported-by: xingwei Lee 

kernel: net 9702817384aa4a3700643d0b26e71deac0172cfd / bpf
2f2fee2bf74a7e31d06fc6cb7ba2bd4dd7753c99
Kernel config: 
https://syzkaller.appspot.com/text?tag=KernelConfig=b50bd31249191be8

in the lastest bpf tree, the crash like:

TITLE: BUG: unable to handle kernel paging request in bpf_probe_read_compat_str
CORRUPTED: false ()
MAINTAINERS (TO): [a...@linux-foundation.org linux...@kvack.org]
MAINTAINERS (CC): [linux-kernel@vger.kernel.org]

BUG: unable to handle page fault for address: ff0
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD cf7a067 P4D cf7a067 PUD cf7c067 PMD cf9f067 0
Oops:  [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8219 Comm: 9de Not tainted 6.7.0-rc41
Hardware name: QEMU Standard PC (i440FX + PIIX, 4
RIP: 0010:strncpy_from_kernel_nofault+0xc4/0x270 mm/maccess.c:91
Code: 83 85 6c 17 00 00 01 48 8b 2c 24 eb 18 e8 0
RSP: 0018:c900114e7ac0 EFLAGS: 00010293
RAX:  RBX: c900114e7b30 RCX:2
RDX: 8880183abcc0 RSI: 81b8c9c4 RDI:c
RBP: ff60 R08: 0001 R09:0
R10: 0001 R11: 0001 R12:8
R13: ff60 R14: 0008 R15:0
FS:  () GS:88823bc0(0
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff60 CR3: 0cf77000 CR4:0
PKRU: 5554
Call Trace:

bpf_probe_read_kernel_str_common kernel/trace/bpf_trace.c:262 [inline]
bpf_probe_read_compat_str kernel/trace/bpf_trace.c:310 [inline]
bpf_probe_read_compat_str+0x12f/0x170 kernel/trace/bpf_trace.c:303
bpf_prog_f17ebaf3f5f7baf8+0x42/0x44
bpf_dispatcher_nop_func include/linux/bpf.h:1196 [inline]
__bpf_prog_run include/linux/filter.h:651 [inline]
bpf_prog_run include/linux/filter.h:658 [inline]
__bpf_trace_run kernel/trace/bpf_trace.c:2307 [inline]
bpf_trace_run2+0x14e/0x410 kernel/trace/bpf_trace.c:2346
trace_kfree include/trace/events/kmem.h:94 [inline]
kfree+0xec/0x150 mm/slab_common.c:1043
vma_numab_state_free include/linux/mm.h:638 [inline]
__vm_area_free+0x3e/0x140 kernel/fork.c:525
remove_vma+0x128/0x170 mm/mmap.c:146
exit_mmap+0x453/0xa70 mm/mmap.c:3332
__mmput+0x12a/0x4d0 kernel/fork.c:1349
mmput+0x62/0x70 kernel/fork.c:1371
exit_mm kernel/exit.c:567 [inline]
do_exit+0x9aa/0x2ac0 kernel/exit.c:858
do_group_exit+0xd4/0x2a0 kernel/exit.c:1021
__do_sys_exit_group kernel/exit.c:1032 [inline]
__se_sys_exit_group kernel/exit.c:1030 [inline]
__x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1030
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x41/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b


=* repro.c =*
// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef __NR_bpf
#define __NR_bpf 321
#endif

#define BITMASK(bf_off, bf_len) (((1ull << (bf_len)) - 1) << (bf_off))
#define STORE_BY_BITMASK(type, htobe, addr, val, bf_off, bf_len) \
 *(type*)(addr) =   \
 htobe((htobe(*(type*)(addr)) & ~BITMASK((bf_off), (bf_len))) | \
   (((type)(val) << (bf_off)) & BITMASK((bf_off), (bf_len

uint64_t r[1] = {0x};

int main(void) {
 syscall(__NR_mmap, /*addr=*/0x1000ul, /*len=*/0x1000ul, /*prot=*/0ul,
 /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
 syscall(__NR_mmap, /*addr=*/0x2000ul, /*len=*/0x100ul, /*prot=*/7ul,
 /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
 syscall(__NR_mmap, /*addr=*/0x2100ul, /*len=*/0x1000ul, /*prot=*/0ul,
 /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
 intptr_t res = 0;
 *(uint32_t*)0x20c0 = 0x11;
 *(uint32_t*)0x20c4 = 0xb;
 *(uint64_t*)0x20c8 = 0x2180;
 *(uint8_t*)0x2180 = 0x18;
 STORE_BY_BITMASK(uint8_t, , 0x2181, 0, 0, 4);
 STORE_BY_BITMASK(uint8_t, , 0x2181, 0, 4, 4);
 *(uint16_t*)0x2182 = 0;
 *(uint32_t*)0x2184 = 0;
 *(uint8_t*)0x2188 = 0;
 *(uint8_t*)0x2189 = 0;
 *(uint16_t*)0x218a = 0;
 *(uint32_t*)0x218c = 0;
 *(uint8_t*)0x2190 = 0x18;
 STORE_BY_BITMASK(uint8_t, , 0x2191, 1, 0, 4);
 STORE_BY_BITMASK(uint8_t, , 0x2191, 0, 4, 4);
 *(uint16_t*)0x2192 = 0;
 *(uint32_t*)0x2194 = 0x25702020;
 *(uint8_t*)0x2198 = 0;
 *(uint8_t*)0x2199 = 0;
 *(uint16_t*)0x219a = 0;
 *(uint32_t*)0x219c = 0x20202000;
 STORE_BY_BITMASK(uint8_t, , 0x21a0, 3, 0, 3);
 STORE_BY_BITMASK(uint8_t, , 0x21a0, 3, 3, 2);
 STORE_BY_BITMASK(uint8_t, , 0x21a0, 3, 5, 3);
 STORE_BY_BITMASK(uint8_t, , 0x21a1, 0xa, 0, 4);
 STORE_BY_BITMASK(uint8_t, , 0x21a1, 1, 4, 4);
 *(uint16_t*)0x21a2 = 0xfff8;
 *(uint32_t*)0x21a4 = 

Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Mathieu Desnoyers

On 2023-12-13 21:11, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.


I literally had to dust off my old Eee PC for this :)



This means that this complex code is not only complex but also not even
faster than just using 64-bit cmpxchg.

Nuke it!



Acked-by: Mathieu Desnoyers 


@@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer,
struct rb_event_info info;
int nr_loops = 0;
int add_ts_default;
+   static int once;


(as you spotted, should be removed)

Thanks,

Mathieu

  
  	/* ring buffer does cmpxchg, make sure it is safe in NMI context */

if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Steven Rostedt
On Wed, 13 Dec 2023 21:11:26 -0500
Steven Rostedt  wrote:

> @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer,
>   struct rb_event_info info;
>   int nr_loops = 0;
>   int add_ts_default;
> + static int once;
>  
>   /* ring buffer does cmpxchg, make sure it is safe in NMI context */
>   if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&

Oops, some debug code got committed!

-- Steve



Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Steven Rostedt
Linus,

Looking for some advice on this.

tl;dr;  The ring-buffer timestamp requires a 64-bit cmpxchg to keep the
timestamps in sync (only in the slow paths). I was told that 64-bit cmpxchg
can be extremely slow on 32-bit architectures. So I created a rb_time_t
that on 64-bit was a normal local64_t type, and on 32-bit it's represented
by 3 32-bit words and a counter for synchronization. But this now requires
three 32-bit cmpxchgs for where one simple 64-bit cmpxchg would do.

Not having any 32-bit hardware to test on, I simply push through this
complex code for the 32-bit case. I tested it on both 32-bit (running on a
x86_64 machine) and 64-bit kernels and it seemed rather robust.

But now that I'm doing some heavy development on the ring buffer again, I
came across a few very subtle bugs in this code (and so has Mathieu Desnoyers).
We started discussing how much time this is actually saving to be worth the
complexity, and actually found some hardware to test. One Atom processor.

For the Atom it was 11.5ns for 32-bit and 16ns for 64-bit.

Granted, this isn't being contended on. But a 30% improvement doesn't
justify 3 to 1 cmpxchgs.

I plan on just nuking the whole thing (the patch below), which is basically
a revert of 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for
speeding up 32 bit").

Now my question to you. Should I bother with pushing to you the subtle
fixes to this code and send you the revert for the next merge window, or do
you think I should just say "screw it" and nuke it now?

Or do you think it's worth keeping for some other architecture that 3
32-bit cmpxchgs are still faster than a single 64-bit one?

Thanks,

-- Steve



On Wed, 13 Dec 2023 21:11:26 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Each event has a 27 bit timestamp delta that is used to hold the delta
> from the last event. If the time between events is greater than 2^27, then
> a timestamp is added that holds a 59 bit absolute timestamp.
> 
> Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
> time stamp"), if an interrupt interrupted an event in progress, all the
> events delta would be zero to not deal with the races that need to be
> handled. The commit a389d86f7fd09 changed that to handle the races giving
> all events, even those that preempt other events, still have an accurate
> timestamp.
> 
> To handle those races requires performing 64-bit cmpxchg on the
> timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
> very slow. To try to deal with this the timestamp logic was broken into
> two and then three 32-bit cmpxchgs, with the thought that two (or three)
> 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
> architectures.
> 
> Part of the problem with this is that I didn't have any 32-bit
> architectures to test on. After hitting several subtle bugs in this code,
> an effort was made to try and see if three 32-bit cmpxchgs are indeed
> faster than a single 64-bit. After a few people brushed off the dust of
> their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
> was faster than a single 64-bit, it was in the order of 50% at best, not
> 300%.
> 
> This means that this complex code is not only complex but also not even
> faster than just using 64-bit cmpxchg.
> 
> Nuke it!
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
> [
>Should we just push this now and mark it for stable?
>That is, just get rid of this logic for all kernels.
> ]
>  kernel/trace/ring_buffer.c | 226 ++---
>  1 file changed, 12 insertions(+), 214 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 3fce5e4b4f2b..b05416e14cb6 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  
>  /*
> @@ -463,27 +464,9 @@ enum {
>   RB_CTX_MAX
>  };
>  
> -#if BITS_PER_LONG == 32
> -#define RB_TIME_32
> -#endif
> -
> -/* To test on 64 bit machines */
> -//#define RB_TIME_32
> -
> -#ifdef RB_TIME_32
> -
> -struct rb_time_struct {
> - local_t cnt;
> - local_t top;
> - local_t bottom;
> - local_t msb;
> -};
> -#else
> -#include 
>  struct rb_time_struct {
>   local64_t   time;
>  };
> -#endif
>  typedef struct rb_time_struct rb_time_t;
>  
>  #define MAX_NEST 5
> @@ -573,183 +556,9 @@ struct ring_buffer_iter {
>   int missed_events;
>  };
>  
> -#ifdef RB_TIME_32
> -
> -/*
> - * On 32 bit machines, local64_t is very expensive. As the ring
> - * buffer doesn't need all the features of a true 64 bit atomic,
> - * on 32 bit, it uses these functions (64 still uses local64_t).
> - *
> - * For the ring buffer, 64 bit required operations for the time is
> - * the following:
> - *
> - *  - Reads may fail if it interrupted a modification of the time stamp.
> - *  

Re: [PATCH v2] vsock/virtio: Fix unsigned integer wrap around in virtio_transport_has_space()

2023-12-13 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski :

On Mon, 11 Dec 2023 19:23:17 +0300 you wrote:
> We need to do signed arithmetic if we expect condition
> `if (bytes < 0)` to be possible
> 
> Found by Linux Verification Center (linuxtesting.org) with SVACE
> 
> Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko")
> Signed-off-by: Nikolay Kuratov 
> 
> [...]

Here is the summary with links:
  - [v2] vsock/virtio: Fix unsigned integer wrap around in 
virtio_transport_has_space()
https://git.kernel.org/netdev/net/c/60316d7f10b1

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





[PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.

This means that this complex code is not only complex but also not even
faster than just using 64-bit cmpxchg.

Nuke it!

Signed-off-by: Steven Rostedt (Google) 
---
[
   Should we just push this now and mark it for stable?
   That is, just get rid of this logic for all kernels.
]
 kernel/trace/ring_buffer.c | 226 ++---
 1 file changed, 12 insertions(+), 214 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3fce5e4b4f2b..b05416e14cb6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 /*
@@ -463,27 +464,9 @@ enum {
RB_CTX_MAX
 };
 
-#if BITS_PER_LONG == 32
-#define RB_TIME_32
-#endif
-
-/* To test on 64 bit machines */
-//#define RB_TIME_32
-
-#ifdef RB_TIME_32
-
-struct rb_time_struct {
-   local_t cnt;
-   local_t top;
-   local_t bottom;
-   local_t msb;
-};
-#else
-#include 
 struct rb_time_struct {
local64_t   time;
 };
-#endif
 typedef struct rb_time_struct rb_time_t;
 
 #define MAX_NEST   5
@@ -573,183 +556,9 @@ struct ring_buffer_iter {
int missed_events;
 };
 
-#ifdef RB_TIME_32
-
-/*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
- *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
- *
- *  - Reads may fail if it interrupted a modification of the time stamp.
- *  It will succeed if it did not interrupt another write even if
- *  the read itself is interrupted by a write.
- *  It returns whether it was successful or not.
- *
- *  - Writes always succeed and will overwrite other writes and writes
- *  that were done by events interrupting the current write.
- *
- *  - A write followed by a read of the same time stamp will always succeed,
- *  but may not contain the same value.
- *
- *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
- *  Other than that, it acts like a normal cmpxchg.
- *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
- *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
- * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
- */
-#define RB_TIME_SHIFT  30
-#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
-#define RB_TIME_MSB_SHIFT   60
-
-static inline int rb_time_cnt(unsigned long val)
-{
-   return (val >> RB_TIME_SHIFT) & 3;
-}
-
-static inline u64 rb_time_val(unsigned long top, unsigned long bottom)
-{
-   u64 val;
-
-   val = top & RB_TIME_VAL_MASK;
-   val <<= RB_TIME_SHIFT;
-   val |= bottom & RB_TIME_VAL_MASK;
-
-   return val;
-}
-
-static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
-{
-   unsigned long top, bottom, msb;
-   unsigned long c;
-
-   /*
-* If the read is interrupted by a write, then the cnt will
-* be different. Loop until both top and bottom have been read
-* without interruption.
-*/
-   do {
-   c = local_read(>cnt);
-   top = local_read(>top);
-   bottom = 

[PATCH v3] tracing: Fix uaf issue when open the hist or hist_debug file

2023-12-13 Thread Zheng Yejian
KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.

  BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
  Read of size 8 at addr 242541e336b8 by task head/190

  CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x98/0xf8
   show_stack+0x1c/0x30
   dump_stack_lvl+0x44/0x58
   print_report+0xf0/0x5a0
   kasan_report+0x80/0xc0
   __asan_report_load8_noabort+0x1c/0x28
   hist_show+0x11e0/0x1278
   seq_read_iter+0x344/0xd78
   seq_read+0x128/0x1c0
   vfs_read+0x198/0x6c8
   ksys_read+0xf4/0x1e0
   __arm64_sys_read+0x70/0xa8
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Allocated by task 188:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_alloc_info+0x20/0x30
   __kasan_slab_alloc+0x6c/0x80
   kmem_cache_alloc+0x15c/0x4a8
   trace_create_new_event+0x84/0x348
   __trace_add_new_event+0x18/0x88
   event_trace_add_tracer+0xc4/0x1a0
   trace_array_create_dir+0x6c/0x100
   trace_array_create+0x2e8/0x568
   instance_mkdir+0x48/0x80
   tracefs_syscall_mkdir+0x90/0xe8
   vfs_mkdir+0x3c4/0x610
   do_mkdirat+0x144/0x200
   __arm64_sys_mkdirat+0x8c/0xc0
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Freed by task 191:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_free_info+0x34/0x58
   __kasan_slab_free+0xe4/0x158
   kmem_cache_free+0x19c/0x508
   event_file_put+0xa0/0x120
   remove_event_file_dir+0x180/0x320
   event_trace_del_tracer+0xb0/0x180
   __remove_instance+0x224/0x508
   instance_rmdir+0x44/0x78
   tracefs_syscall_rmdir+0xbc/0x140
   vfs_rmdir+0x1cc/0x4c8
   do_rmdir+0x220/0x2b8
   __arm64_sys_unlinkat+0xc0/0x100
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

Suggested-by: Steven Rostedt 
Signed-off-by: Zheng Yejian 
---
 kernel/trace/trace.c |  6 ++
 kernel/trace/trace.h |  1 +
 kernel/trace/trace_events_hist.c | 12 
 3 files changed, 15 insertions(+), 4 deletions(-)

v3:
  - As suggested by Steve, put the tracing_single_release_file_tr() into
trace.c as a non static function, put the prototype in trace.h.
Link: 
https://lore.kernel.org/all/20231213080127.6ef26...@gandalf.local.home/

v2:
  - Introduce tracing_single_release_file_tr() to add the missing call for
single_release() as suggested by Steve;
Link: 
https://lore.kernel.org/all/20231212113546.6a51d...@gandalf.local.home/
  - Slightly modify the commit message and comments.
  - Link: 
https://lore.kernel.org/all/20231213015138.281888-1-zhengyeji...@huawei.com/

v1:
  Link: 
https://lore.kernel.org/all/20231212113317.4159890-1-zhengyeji...@huawei.com/

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fbcd3bafb93e..fc1b2ada98a5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4964,6 +4964,12 @@ int tracing_release_file_tr(struct inode *inode, struct 
file *filp)
return 0;
 }
 
+int tracing_single_release_file_tr(struct inode *inode, struct file *filp)
+{
+   tracing_release_file_tr(inode, filp);
+   return single_release(inode, filp);
+}
+
 static int tracing_mark_open(struct inode *inode, struct file *filp)
 {
stream_open(inode, filp);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b7f4ea25a194..0489e72c8169 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -617,6 +617,7 @@ int tracing_open_generic(struct inode *inode, struct file 
*filp);
 int tracing_open_generic_tr(struct inode *inode, struct file *filp);
 int tracing_open_file_tr(struct inode *inode, struct file *filp);
 int tracing_release_file_tr(struct inode *inode, struct file *filp);
+int tracing_single_release_file_tr(struct inode *inode, struct file *filp);
 bool tracing_is_disabled(void);
 bool tracer_tracing_is_on(struct trace_array *tr);
 void tracer_tracing_on(struct trace_array *tr);
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 1abc07fba1b9..5ecf3c8bde20 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5623,10 +5623,12 @@ static int event_hist_open(struct inode *inode, struct 
file *file)
 {
int ret;
 
-   ret = security_locked_down(LOCKDOWN_TRACEFS);
+   ret = 

Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-13 Thread Tim Chen
On Tue, 2023-12-12 at 15:27 +0100, Vincent Guittot wrote:
> Provide to the scheduler a feedback about the temporary max available
> capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> filtered as the pressure will happen for dozens ms or more.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  drivers/cpufreq/cpufreq.c | 48 +++
>  include/linux/cpufreq.h   | 10 
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 44db4f59c4cc..7d5f71be8d29 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, 
> unsigned int cpu)
>  }
>  EXPORT_SYMBOL(cpufreq_get_policy);
>  
> +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
> +
> +/**
> + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> + * @cpus: The related CPUs for which max capacity has been reduced
> + * @capped_freq : The maximum allowed frequency that CPUs can run at
> + *
> + * Update the value of cpufreq pressure for all @cpus in the mask. The
> + * cpumask should include all (online+offline) affected CPUs, to avoid
> + * operating on stale data when hot-plug is used for some CPUs. The
> + * @capped_freq reflects the currently allowed max CPUs frequency due to
> + * freq_qos capping. It might be also a boost frequency value, which is 
> bigger
> + * than the internal 'capacity_freq_ref' max frequency. In such case the
> + * pressure value should simply be removed, since this is an indication that
> + * there is no capping. The @capped_freq must be provided in kHz.
> + */
> +static void cpufreq_update_pressure(const struct cpumask *cpus,
> +   unsigned long capped_freq)
> +{
> + unsigned long max_capacity, capacity, pressure;
> + u32 max_freq;
> + int cpu;
> +
> + cpu = cpumask_first(cpus);
> + max_capacity = arch_scale_cpu_capacity(cpu);
> + max_freq = arch_scale_freq_ref(cpu);
> +
> + /*
> +  * Handle properly the boost frequencies, which should simply clean
> +  * the thermal pressure value.
> +  */
> + if (max_freq <= capped_freq)
> + capacity = max_capacity;
> + else
> + capacity = mult_frac(max_capacity, capped_freq, max_freq);
> +
> + pressure = max_capacity - capacity;
> +
> +
> + for_each_cpu(cpu, cpus)
> + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);

Seems like the pressure value computed from the first CPU applies to all CPU.
Will this be valid for non-homogeneous CPUs that could have different
max_freq and max_capacity?

Tim



Re: [PATCH] driver core: Add a guard() definition for the device_lock()

2023-12-13 Thread Verma, Vishal L
On Wed, 2023-12-13 at 15:02 -0800, Dan Williams wrote:
> At present there are ~200 usages of device_lock() in the kernel. Some of
> those usages lead to "goto unlock;" patterns which have proven to be
> error prone. Define a "device" guard() definition to allow for those to

"Define a definition" sounds a bit awkward, perhaps  "Add a .."?

> be cleaned up and prevent new ones from appearing.
> 
> Link: 
> http://lore.kernel.org/r/657897453dda8_269bd29...@dwillia2-mobl3.amr.corp.intel.com.notmuch
> Link: 
> http://lore.kernel.org/r/6577b0c2a02df_a04c529...@dwillia2-xfh.jf.intel.com.notmuch
> Cc: Vishal Verma 
> Cc: Ira Weiny 
> Cc: Peter Zijlstra 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Signed-off-by: Dan Williams 

Other than that, looks good,

Reviewed-by: Vishal Verma 

> ---
> Hi Greg,
> 
> I wonder if you might include this change in v6.7-rc to ease some patch
> sets alternately going through my tree and Andrew's tree. Those
> discussions are linked above. Alternately I can can just take it through
> my tree with your ack and the other use case can circle back to it in
> the v6.9 cycle.
> 
> I considered also defining a __free() helper similar to __free(mutex),
> but I think "__free(device)" would be a surprising name for something
> that drops a lock. Also, I like the syntax of guard(device) over
> something like guard(device_lock) since a 'struct device *' is the
> argument, not a lock type, but I'm open to your or Peter's thoughts on
> the naming.
> 
>  include/linux/device.h |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index d7a72a8749ea..6c83294395ac 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1007,6 +1007,8 @@ static inline void device_unlock(struct device *dev)
> mutex_unlock(>mutex);
>  }
>  
> +DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
> +
>  static inline void device_lock_assert(struct device *dev)
>  {
> lockdep_assert_held(>mutex);
> 


Re: [PATCH] driver core: Add a guard() definition for the device_lock()

2023-12-13 Thread Ira Weiny
Dan Williams wrote:
> At present there are ~200 usages of device_lock() in the kernel. Some of
> those usages lead to "goto unlock;" patterns which have proven to be
> error prone. Define a "device" guard() definition to allow for those to
> be cleaned up and prevent new ones from appearing.
> 
> Link: 
> http://lore.kernel.org/r/657897453dda8_269bd29...@dwillia2-mobl3.amr.corp.intel.com.notmuch
> Link: 
> http://lore.kernel.org/r/6577b0c2a02df_a04c529...@dwillia2-xfh.jf.intel.com.notmuch
> Cc: Vishal Verma 
> Cc: Ira Weiny 

Reviewed-by: Ira Weiny 

> Cc: Peter Zijlstra 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Signed-off-by: Dan Williams 
> ---
> Hi Greg,
> 
> I wonder if you might include this change in v6.7-rc to ease some patch
> sets alternately going through my tree and Andrew's tree. Those
> discussions are linked above. Alternately I can can just take it through
> my tree with your ack and the other use case can circle back to it in
> the v6.9 cycle.
> 
> I considered also defining a __free() helper similar to __free(mutex),
> but I think "__free(device)" would be a surprising name for something
> that drops a lock. Also, I like the syntax of guard(device) over
> something like guard(device_lock) since a 'struct device *' is the
> argument, not a lock type, but I'm open to your or Peter's thoughts on
> the naming.
> 
>  include/linux/device.h |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index d7a72a8749ea..6c83294395ac 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1007,6 +1007,8 @@ static inline void device_unlock(struct device *dev)
>   mutex_unlock(>mutex);
>  }
>  
> +DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
> +
>  static inline void device_lock_assert(struct device *dev)
>  {
>   lockdep_assert_held(>mutex);
> 





[PATCH v3 34/34] kmsan: Enable on s390

2023-12-13 Thread Ilya Leoshkevich
Now that everything else is in place, enable KMSAN in Kconfig.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3bec98d20283..160ad2220c53 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -153,6 +153,7 @@ config S390
select HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_VMALLOC
select HAVE_ARCH_KCSAN
+   select HAVE_ARCH_KMSAN
select HAVE_ARCH_KFENCE
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_SECCOMP_FILTER
-- 
2.43.0




[PATCH v3 20/34] s390: Use a larger stack for KMSAN

2023-12-13 Thread Ilya Leoshkevich
Adjust the stack size for the KMSAN-enabled kernel like it was done
for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double
the stack size"). Both tools have similar requirements.

Reviewed-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Makefile  | 2 +-
 arch/s390/include/asm/thread_info.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/Makefile b/arch/s390/Makefile
index 73873e451686..a7f5386d25ad 100644
--- a/arch/s390/Makefile
+++ b/arch/s390/Makefile
@@ -34,7 +34,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option
 KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
 
 UTS_MACHINE:= s390x
-STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384)
+STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384))
 CHECKFLAGS += -D__s390__ -D__s390x__
 
 export LD_BFD
diff --git a/arch/s390/include/asm/thread_info.h 
b/arch/s390/include/asm/thread_info.h
index a674c7d25da5..d02a709717b8 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -16,7 +16,7 @@
 /*
  * General size of kernel stacks
  */
-#ifdef CONFIG_KASAN
+#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN)
 #define THREAD_SIZE_ORDER 4
 #else
 #define THREAD_SIZE_ORDER 2
-- 
2.43.0




[PATCH v3 05/34] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces

2023-12-13 Thread Ilya Leoshkevich
Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Skip the comparison when this is the case.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 470b0b4afcc4..8a1bbbc723ab 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -20,7 +20,8 @@
 
 static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store)
 {
-   if ((u64)addr < TASK_SIZE)
+   if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) &&
+   (u64)addr < TASK_SIZE)
return true;
if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW))
return true;
-- 
2.43.0




[PATCH v3 21/34] s390/boot: Add the KMSAN runtime stub

2023-12-13 Thread Ilya Leoshkevich
It should be possible to have inline functions in the s390 header
files, which call kmsan_unpoison_memory(). The problem is that these
header files might be included by the decompressor, which does not
contain KMSAN runtime, causing linker errors.

Not compiling these calls if __SANITIZE_MEMORY__ is not defined -
either by changing kmsan-checks.h or at the call sites - may cause
unintended side effects, since calling these functions from an
uninstrumented code that is linked into the kernel is valid use case.

One might want to explicitly distinguish between the kernel and the
decompressor. Checking for a decompressor-specific #define is quite
heavy-handed, and will have to be done at all call sites.

A more generic approach is to provide a dummy kmsan_unpoison_memory()
definition. This produces some runtime overhead, but only when building
with CONFIG_KMSAN. The benefit is that it does not disturb the existing
KMSAN build logic and call sites don't need to be changed.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile | 1 +
 arch/s390/boot/kmsan.c  | 6 ++
 2 files changed, 7 insertions(+)
 create mode 100644 arch/s390/boot/kmsan.c

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index fb10fcd21221..096216a72e98 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -44,6 +44,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) 
$(CONFIG_PGSTE)) +=
 obj-$(CONFIG_RANDOMIZE_BASE)   += kaslr.o
 obj-y  += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o
 obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o
+obj-$(CONFIG_KMSAN) += kmsan.o
 obj-all := $(obj-y) piggy.o syms.o
 
 targets:= bzImage section_cmp.boot.data 
section_cmp.boot.preserved.data $(obj-y)
diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c
new file mode 100644
index ..e7b3ac48143e
--- /dev/null
+++ b/arch/s390/boot/kmsan.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+
+void kmsan_unpoison_memory(const void *address, size_t size)
+{
+}
-- 
2.43.0




[PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-12-13 Thread Ilya Leoshkevich
The pages for the KMSAN metadata associated with most kernel mappings
are taken from memblock by the common code. However, vmalloc and module
metadata needs to be defined by the architectures.

Be a little bit more careful than x86: allocate exactly MODULES_LEN
for the module shadow and origins, and then take 2/3 of vmalloc for
the vmalloc shadow and origins. This ensures that users passing small
vmalloc= values on the command line do not cause module metadata
collisions.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/startup.c|  8 
 arch/s390/include/asm/pgtable.h | 10 ++
 2 files changed, 18 insertions(+)

diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
index 8104e0e3d188..e37e7ffda430 100644
--- a/arch/s390/boot/startup.c
+++ b/arch/s390/boot/startup.c
@@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
MODULES_VADDR = MODULES_END - MODULES_LEN;
VMALLOC_END = MODULES_VADDR;
+#ifdef CONFIG_KMSAN
+   VMALLOC_END -= MODULES_LEN * 2;
+#endif
 
/* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
space left */
vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
_REGION3_SIZE));
+#ifdef CONFIG_KMSAN
+   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
+   vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE);
+   VMALLOC_END -= vmalloc_size * 2;
+#endif
VMALLOC_START = VMALLOC_END - vmalloc_size;
 
/* split remaining virtual space between 1:1 mapping & vmemmap array */
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 601e87fa8a9a..d764abeb9e6d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -107,6 +107,16 @@ static inline int is_module_addr(void *addr)
return 1;
 }
 
+#ifdef CONFIG_KMSAN
+#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START)
+#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END
+#define KMSAN_VMALLOC_ORIGIN_START (KMSAN_VMALLOC_SHADOW_START + \
+   KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_SHADOW_START (KMSAN_VMALLOC_ORIGIN_START + \
+   KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN)
+#endif
+
 /*
  * A 64 bit pagetable entry of S390 has following format:
  * |PFRA |0IPC|  OS  |
-- 
2.43.0




[PATCH v3 13/34] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()

2023-12-13 Thread Ilya Leoshkevich
Improve the readability by replacing the custom aligning logic with
ALIGN_DOWN(). Unlike other places where a similar sequence is used,
there is no size parameter that needs to be adjusted, so the standard
macro fits.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 2d57408c78ae..9c58f081d84f 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -123,14 +123,12 @@ struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void 
*address, u64 size,
  */
 void *kmsan_get_metadata(void *address, bool is_origin)
 {
-   u64 addr = (u64)address, pad, off;
+   u64 addr = (u64)address, off;
struct page *page;
void *ret;
 
-   if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) {
-   pad = addr % KMSAN_ORIGIN_SIZE;
-   addr -= pad;
-   }
+   if (is_origin)
+   addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE);
address = (void *)addr;
if (kmsan_internal_is_vmalloc_addr(address) ||
kmsan_internal_is_module_addr(address))
-- 
2.43.0




[PATCH v3 31/34] s390/uaccess: Add KMSAN support to put_user() and get_user()

2023-12-13 Thread Ilya Leoshkevich
put_user() uses inline assembly with precise constraints, so Clang is
in principle capable of instrumenting it automatically. Unfortunately,
one of the constraints contains a dereferenced user pointer, and Clang
does not currently distinguish user and kernel pointers. Therefore
KMSAN attempts to access shadow for user pointers, which is not a right
thing to do.

An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
work, since it's __always_inline. And __always_inline cannot be removed
due to the __put_user_bad() trick.

A different obvious fix of using the "a" instead of the "+Q" constraint
degrades the code quality, which is very important here, since it's a
hot path.

Instead, repurpose the __put_user_asm() macro to define
__put_user_{char,short,int,long}_noinstr() functions and mark them with
__no_sanitize_memory. For the non-KMSAN builds make them
__always_inline in order to keep the generated code quality. Also
define __put_user_{char,short,int,long}() functions, which call the
aforementioned ones and which *are* instrumented, because they call
KMSAN hooks, which may be implemented as macros.

The same applies to get_user() as well.

Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/uaccess.h | 111 +++-
 1 file changed, 79 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h
index 81ae8a98e7ec..c3c26dd1fc04 100644
--- a/arch/s390/include/asm/uaccess.h
+++ b/arch/s390/include/asm/uaccess.h
@@ -78,13 +78,24 @@ union oac {
 
 int __noreturn __put_user_bad(void);
 
-#define __put_user_asm(to, from, size) \
-({ \
+#ifdef CONFIG_KMSAN
+#define GET_PUT_USER_NOINSTR_ATTRIBUTES \
+   noinline __maybe_unused __no_sanitize_memory
+#else
+#define GET_PUT_USER_NOINSTR_ATTRIBUTES __always_inline
+#endif
+
+#define DEFINE_PUT_USER(type)  \
+static GET_PUT_USER_NOINSTR_ATTRIBUTES int \
+__put_user_##type##_noinstr(unsigned type __user *to,  \
+   unsigned type *from,\
+   unsigned long size) \
+{  \
union oac __oac_spec = {\
.oac1.as = PSW_BITS_AS_SECONDARY,   \
.oac1.a = 1,\
};  \
-   int __rc;   \
+   int rc; \
\
asm volatile(   \
"   lr  0,%[spec]\n"\
@@ -93,12 +104,28 @@ int __noreturn __put_user_bad(void);
"2:\n"  \
EX_TABLE_UA_STORE(0b, 2b, %[rc])\
EX_TABLE_UA_STORE(1b, 2b, %[rc])\
-   : [rc] "=" (__rc), [_to] "+Q" (*(to)) \
+   : [rc] "=" (rc), [_to] "+Q" (*(to))   \
: [_size] "d" (size), [_from] "Q" (*(from)),\
  [spec] "d" (__oac_spec.val)   \
: "cc", "0");   \
-   __rc;   \
-})
+   return rc;  \
+}  \
+   \
+static __always_inline int \
+__put_user_##type(unsigned type __user *to, unsigned type *from,   \
+ unsigned long size)   \
+{  \
+   int rc; \
+   \
+   rc = __put_user_##type##_noinstr(to, from, size);   \
+   instrument_put_user(*from, to, size);   \
+   return rc;  \
+}
+
+DEFINE_PUT_USER(char);
+DEFINE_PUT_USER(short);
+DEFINE_PUT_USER(int);
+DEFINE_PUT_USER(long);
 
 static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned 
long size)
 {
@@ -106,24 +133,24 @@ static __always_inline int __put_user_fn(void *x, void 
__user *ptr, unsigned lon
 
 

[PATCH v3 33/34] s390: Implement the architecture-specific kmsan functions

2023-12-13 Thread Ilya Leoshkevich
arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
prefix and calling kmsan_get_metadata() again.

kmsan_virt_addr_valid() delegates to virt_addr_valid().

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/kmsan.h | 43 +++
 1 file changed, 43 insertions(+)
 create mode 100644 arch/s390/include/asm/kmsan.h

diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h
new file mode 100644
index ..e572686d340c
--- /dev/null
+++ b/arch/s390/include/asm/kmsan.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_S390_KMSAN_H
+#define _ASM_S390_KMSAN_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef MODULE
+
+static inline bool is_lowcore_addr(void *addr)
+{
+   return addr >= (void *)_lowcore &&
+  addr < (void *)(_lowcore + 1);
+}
+
+static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
+{
+   if (is_lowcore_addr(addr)) {
+   /*
+* Different lowcores accessed via S390_lowcore are described
+* by the same struct page. Resolve the prefix manually in
+* order to get a distinct struct page.
+*/
+   addr += (void *)lowcore_ptr[raw_smp_processor_id()] -
+   (void *)_lowcore;
+   if (WARN_ON_ONCE(is_lowcore_addr(addr)))
+   return NULL;
+   return kmsan_get_metadata(addr, is_origin);
+   }
+   return NULL;
+}
+
+static inline bool kmsan_virt_addr_valid(void *addr)
+{
+   return virt_addr_valid(addr);
+}
+
+#endif /* !MODULE */
+
+#endif /* _ASM_S390_KMSAN_H */
-- 
2.43.0




[PATCH v3 27/34] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN

2023-12-13 Thread Ilya Leoshkevich
KMSAN generates the following false positives on s390x:

[6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[ ...]
[6.577050] Call Trace:
[6.619637]  [<0690d2de>] check_flags+0x1fe/0x210
[6.665411] ([<0690d2da>] check_flags+0x1fa/0x210)
[6.707478]  [<006cec1a>] lock_acquire+0x2ca/0xce0
[6.749959]  [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190
[6.794912]  [<041fc988>] __stack_depot_save+0x218/0x5b0
[6.838420]  [<0197affe>] __msan_poison_alloca+0xfe/0x1a0
[6.882985]  [<07c5827c>] start_kernel+0x70c/0xd50
[6.927454]  [<00100036>] startup_continue+0x36/0x40

Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that
interrupts are on, but on the CPU they are still off. KMSAN
instrumentation takes spinlocks, giving lockdep a chance to see and
complain about this discrepancy.

KMSAN instrumentation is inserted in order to poison the __mask
variable. Disable instrumentation in the respective functions. They are
very small and it's easy to see that no important metadata updates are
lost because of this.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/irqflags.h | 18 +++---
 drivers/s390/char/sclp.c |  2 +-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/irqflags.h b/arch/s390/include/asm/irqflags.h
index 02427b205c11..7353a88b2ae2 100644
--- a/arch/s390/include/asm/irqflags.h
+++ b/arch/s390/include/asm/irqflags.h
@@ -37,12 +37,19 @@ static __always_inline void __arch_local_irq_ssm(unsigned 
long flags)
asm volatile("ssm   %0" : : "Q" (flags) : "memory");
 }
 
-static __always_inline unsigned long arch_local_save_flags(void)
+#ifdef CONFIG_KMSAN
+#define ARCH_LOCAL_IRQ_ATTRIBUTES \
+   noinline notrace __no_sanitize_memory __maybe_unused
+#else
+#define ARCH_LOCAL_IRQ_ATTRIBUTES __always_inline
+#endif
+
+static ARCH_LOCAL_IRQ_ATTRIBUTES unsigned long arch_local_save_flags(void)
 {
return __arch_local_irq_stnsm(0xff);
 }
 
-static __always_inline unsigned long arch_local_irq_save(void)
+static ARCH_LOCAL_IRQ_ATTRIBUTES unsigned long arch_local_irq_save(void)
 {
return __arch_local_irq_stnsm(0xfc);
 }
@@ -52,7 +59,12 @@ static __always_inline void arch_local_irq_disable(void)
arch_local_irq_save();
 }
 
-static __always_inline void arch_local_irq_enable(void)
+static ARCH_LOCAL_IRQ_ATTRIBUTES void arch_local_irq_enable_external(void)
+{
+   __arch_local_irq_stosm(0x01);
+}
+
+static ARCH_LOCAL_IRQ_ATTRIBUTES void arch_local_irq_enable(void)
 {
__arch_local_irq_stosm(0x03);
 }
diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c
index d53ee34d398f..fb1d9949adca 100644
--- a/drivers/s390/char/sclp.c
+++ b/drivers/s390/char/sclp.c
@@ -736,7 +736,7 @@ sclp_sync_wait(void)
cr0_sync.val = cr0.val & ~CR0_IRQ_SUBCLASS_MASK;
cr0_sync.val |= 1UL << (63 - 54);
local_ctl_load(0, _sync);
-   __arch_local_irq_stosm(0x01);
+   arch_local_irq_enable_external();
/* Loop until driver state indicates finished request */
while (sclp_running_state != sclp_running_state_idle) {
/* Check for expired request timer */
-- 
2.43.0




[PATCH v3 32/34] s390/unwind: Disable KMSAN checks

2023-12-13 Thread Ilya Leoshkevich
The unwind code can read uninitialized frames. Furthermore, even in
the good case, KMSAN does not emit shadow for backchains. Therefore
disable it for the unwinding functions.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/unwind_bc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c
index 0ece156fdd7c..cd44be2b6ce8 100644
--- a/arch/s390/kernel/unwind_bc.c
+++ b/arch/s390/kernel/unwind_bc.c
@@ -49,6 +49,8 @@ static inline bool is_final_pt_regs(struct unwind_state 
*state,
   READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE;
 }
 
+/* Avoid KMSAN false positives from touching uninitialized frames. */
+__no_kmsan_checks
 bool unwind_next_frame(struct unwind_state *state)
 {
struct stack_info *info = >stack_info;
@@ -118,6 +120,8 @@ bool unwind_next_frame(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_next_frame);
 
+/* Avoid KMSAN false positives from touching uninitialized frames. */
+__no_kmsan_checks
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long first_frame)
 {
-- 
2.43.0




[PATCH v3 12/34] kmsan: Support SLAB_POISON

2023-12-13 Thread Ilya Leoshkevich
Avoid false KMSAN negatives with SLUB_DEBUG by allowing
kmsan_slab_free() to poison the freed memory, and by preventing
init_object() from unpoisoning new allocations by using __memset().

There are two alternatives to this approach. First, init_object()
can be marked with __no_sanitize_memory. This annotation should be used
with great care, because it drops all instrumentation from the
function, and any shadow writes will be lost. Even though this is not a
concern with the current init_object() implementation, this may change
in the future.

Second, kmsan_poison_memory() calls may be added after memset() calls.
The downside is that init_object() is called from
free_debug_processing(), in which case poisoning will erase the
distinction between simply uninitialized memory and UAF.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c |  2 +-
 mm/slub.c| 13 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 3acf010c9814..21004240 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -74,7 +74,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object)
return;
 
/* RCU slabs could be legally used after free within the RCU period */
-   if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)))
+   if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
/*
 * If there's a constructor, freed memory must remain in the same state
diff --git a/mm/slub.c b/mm/slub.c
index 63d281dfacdb..b111bc315e3f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1030,7 +1030,12 @@ static void init_object(struct kmem_cache *s, void 
*object, u8 val)
unsigned int poison_size = s->object_size;
 
if (s->flags & SLAB_RED_ZONE) {
-   memset(p - s->red_left_pad, val, s->red_left_pad);
+   /*
+* Use __memset() here and below in order to avoid overwriting
+* the KMSAN shadow. Keeping the shadow makes it possible to
+* distinguish uninit-value from use-after-free.
+*/
+   __memset(p - s->red_left_pad, val, s->red_left_pad);
 
if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
/*
@@ -1043,12 +1048,12 @@ static void init_object(struct kmem_cache *s, void 
*object, u8 val)
}
 
if (s->flags & __OBJECT_POISON) {
-   memset(p, POISON_FREE, poison_size - 1);
-   p[poison_size - 1] = POISON_END;
+   __memset(p, POISON_FREE, poison_size - 1);
+   __memset(p + poison_size - 1, POISON_END, 1);
}
 
if (s->flags & SLAB_RED_ZONE)
-   memset(p + poison_size, val, s->inuse - poison_size);
+   __memset(p + poison_size, val, s->inuse - poison_size);
 }
 
 static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
-- 
2.43.0




[PATCH v3 30/34] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs

2023-12-13 Thread Ilya Leoshkevich
This is normally done by the generic entry code, but the
kernel_stack_overflow() flow bypasses it.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/traps.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 1d2aa448d103..f299b1203a20 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -260,6 +261,11 @@ static void monitor_event_exception(struct pt_regs *regs)
 
 void kernel_stack_overflow(struct pt_regs *regs)
 {
+   /*
+* Normally regs are unpoisoned by the generic entry code, but
+* kernel_stack_overflow() is a rare case that is called bypassing it.
+*/
+   kmsan_unpoison_entry_regs(regs);
bust_spinlocks(1);
printk("Kernel stack overflow.\n");
show_regs(regs);
-- 
2.43.0




[PATCH v3 24/34] s390/cpumf: Unpoison STCCTM output buffer

2023-12-13 Thread Ilya Leoshkevich
stcctm() uses the "Q" constraint for dest, therefore KMSAN does not
understand that it fills multiple doublewords pointed to by dest, not
just one. This results in false positives.

Unpoison the whole dest manually with kmsan_unpoison_memory().

Reported-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpu_mf.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h
index a0de5b9b02ea..9e4bbc3e53f8 100644
--- a/arch/s390/include/asm/cpu_mf.h
+++ b/arch/s390/include/asm/cpu_mf.h
@@ -10,6 +10,7 @@
 #define _ASM_S390_CPU_MF_H
 
 #include 
+#include 
 #include 
 #include 
 
@@ -239,6 +240,11 @@ static __always_inline int stcctm(enum stcctm_ctr_set set, 
u64 range, u64 *dest)
: "=d" (cc)
: "Q" (*dest), "d" (range), "i" (set)
: "cc", "memory");
+   /*
+* If cc == 2, less than RANGE counters are stored, but it's not easy
+* to tell how many. Always unpoison the whole range for simplicity.
+*/
+   kmsan_unpoison_memory(dest, range * sizeof(u64));
return cc;
 }
 
-- 
2.43.0




[PATCH v3 10/34] kmsan: Export panic_on_kmsan

2023-12-13 Thread Ilya Leoshkevich
When building the kmsan test as a module, modpost fails with the
following error message:

ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined!

Export panic_on_kmsan in order to improve the KMSAN usability for
modules.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/report.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index 02736ec757f2..c79d3b0d2d0d 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -20,6 +20,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock);
 /* Protected by kmsan_report_lock */
 static char report_local_descr[DESCR_SIZE];
 int panic_on_kmsan __read_mostly;
+EXPORT_SYMBOL_GPL(panic_on_kmsan);
 
 #ifdef MODULE_PARAM_PREFIX
 #undef MODULE_PARAM_PREFIX
-- 
2.43.0




[PATCH v3 29/34] s390/string: Add KMSAN support

2023-12-13 Thread Ilya Leoshkevich
Add KMSAN support for the s390 implementations of the string functions.
Do this similar to how it's already done for KASAN, except that the
optimized memset{16,32,64}() functions need to be disabled: it's
important for KMSAN to know that they initialized something.

The way boot code is built with regard to string functions is
problematic, since most files think it's configured with sanitizers,
but boot/string.c doesn't. This creates various problems with the
memset64() definitions, depending on whether the code is built with
sanitizers or fortify. This should probably be streamlined, but in the
meantime resolve the issues by introducing the IN_BOOT_STRING_C macro,
similar to the existing IN_ARCH_STRING_C macro.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/string.c| 16 
 arch/s390/include/asm/string.h | 20 +++-
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c
index faccb33b462c..f6b9b1df48a8 100644
--- a/arch/s390/boot/string.c
+++ b/arch/s390/boot/string.c
@@ -1,11 +1,18 @@
 // SPDX-License-Identifier: GPL-2.0
+#define IN_BOOT_STRING_C 1
 #include 
 #include 
 #include 
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
+#undef CONFIG_KMSAN
 #include "../lib/string.c"
 
+/*
+ * Duplicate some functions from the common lib/string.c
+ * instead of fully including it.
+ */
+
 int strncmp(const char *cs, const char *ct, size_t count)
 {
unsigned char c1, c2;
@@ -22,6 +29,15 @@ int strncmp(const char *cs, const char *ct, size_t count)
return 0;
 }
 
+void *memset64(uint64_t *s, uint64_t v, size_t count)
+{
+   uint64_t *xs = s;
+
+   while (count--)
+   *xs++ = v;
+   return s;
+}
+
 char *skip_spaces(const char *str)
 {
while (isspace(*str))
diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h
index 351685de53d2..2ab868cbae6c 100644
--- a/arch/s390/include/asm/string.h
+++ b/arch/s390/include/asm/string.h
@@ -15,15 +15,12 @@
 #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMMOVE/* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */
-#define __HAVE_ARCH_MEMSET16   /* arch function */
-#define __HAVE_ARCH_MEMSET32   /* arch function */
-#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 void *memcpy(void *dest, const void *src, size_t n);
 void *memset(void *s, int c, size_t n);
 void *memmove(void *dest, const void *src, size_t n);
 
-#ifndef CONFIG_KASAN
+#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
 #define __HAVE_ARCH_MEMCHR /* inline & arch function */
 #define __HAVE_ARCH_MEMCMP /* arch function */
 #define __HAVE_ARCH_MEMSCAN/* inline & arch function */
@@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n);
 #define __HAVE_ARCH_STRNCPY/* arch function */
 #define __HAVE_ARCH_STRNLEN/* inline & arch function */
 #define __HAVE_ARCH_STRSTR /* arch function */
+#define __HAVE_ARCH_MEMSET16   /* arch function */
+#define __HAVE_ARCH_MEMSET32   /* arch function */
+#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 /* Prototypes for non-inlined arch strings functions. */
 int memcmp(const void *s1, const void *s2, size_t n);
@@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n);
 char *strncat(char *dest, const char *src, size_t n);
 char *strncpy(char *dest, const char *src, size_t n);
 char *strstr(const char *s1, const char *s2);
-#endif /* !CONFIG_KASAN */
+#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */
 
 #undef __HAVE_ARCH_STRCHR
 #undef __HAVE_ARCH_STRNCHR
@@ -74,20 +74,30 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count);
 void *__memset32(uint32_t *s, uint32_t v, size_t count);
 void *__memset64(uint64_t *s, uint64_t v, size_t count);
 
+#ifdef __HAVE_ARCH_MEMSET16
 static inline void *memset16(uint16_t *s, uint16_t v, size_t count)
 {
return __memset16(s, v, count * sizeof(v));
 }
+#endif
 
+#ifdef __HAVE_ARCH_MEMSET32
 static inline void *memset32(uint32_t *s, uint32_t v, size_t count)
 {
return __memset32(s, v, count * sizeof(v));
 }
+#endif
 
+#ifdef __HAVE_ARCH_MEMSET64
+#ifdef IN_BOOT_STRING_C
+void *memset64(uint64_t *s, uint64_t v, size_t count);
+#else
 static inline void *memset64(uint64_t *s, uint64_t v, size_t count)
 {
return __memset64(s, v, count * sizeof(v));
 }
+#endif
+#endif
 
 #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || 
defined(__NO_FORTIFY))
 
-- 
2.43.0




[PATCH v3 09/34] kmsan: Expose kmsan_get_metadata()

2023-12-13 Thread Ilya Leoshkevich
Each s390 CPU has lowcore pages associated with it. Each CPU sees its
own lowcore at virtual address 0 through a hardware mechanism called
prefixing. Additionally, all lowcores are mapped to non-0 virtual
addresses stored in the lowcore_ptr[] array.

When lowcore is accessed through virtual address 0, one needs to
resolve metadata for lowcore_ptr[raw_smp_processor_id()].

Expose kmsan_get_metadata() to make it possible to do this from the
arch code.

Signed-off-by: Ilya Leoshkevich 
---
 include/linux/kmsan.h  | 9 +
 mm/kmsan/instrumentation.c | 1 +
 mm/kmsan/kmsan.h   | 1 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index e0c23a32cdf0..fe6c2212bdb1 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -230,6 +230,15 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out);
  */
 void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
 
+/**
+ * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins.
+ * @addr:  kernel address.
+ * @is_origin: whether to return origins or shadow.
+ *
+ * Return NULL if metadata cannot be found.
+ */
+void *kmsan_get_metadata(void *addr, bool is_origin);
+
 #else
 
 static inline void kmsan_init_shadow(void)
diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 8a1bbbc723ab..94b49fac9d8b 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -14,6 +14,7 @@
 
 #include "kmsan.h"
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index adf443bcffe8..34b83c301d57 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -66,7 +66,6 @@ struct shadow_origin_ptr {
 
 struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size,
 bool store);
-void *kmsan_get_metadata(void *addr, bool is_origin);
 void __init kmsan_init_alloc_meta_for_range(void *start, void *end);
 
 enum kmsan_bug_reason {
-- 
2.43.0




[PATCH v3 26/34] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()

2023-12-13 Thread Ilya Leoshkevich
s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/ftrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..3cc5e6d011a9 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -300,6 +301,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
return;
 
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
regs = ftrace_get_regs(fregs);
p = get_kprobe((kprobe_opcode_t *)ip);
if (!regs || unlikely(!p) || kprobe_disabled(p))
-- 
2.43.0




[PATCH v3 08/34] kmsan: Remove an x86-specific #include from kmsan.h

2023-12-13 Thread Ilya Leoshkevich
Replace the x86-specific asm/pgtable_64_types.h #include with the
linux/pgtable.h one, which all architectures have.

While at it, sort the headers alphabetically for the sake of
consistency with other KMSAN code.

Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")
Suggested-by: Heiko Carstens 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index a14744205435..adf443bcffe8 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -10,14 +10,14 @@
 #ifndef __MM_KMSAN_KMSAN_H
 #define __MM_KMSAN_KMSAN_H
 
-#include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
 
 #define KMSAN_ALLOCA_MAGIC_ORIGIN 0xabcd0100
 #define KMSAN_CHAIN_MAGIC_ORIGIN 0xabcd0200
-- 
2.43.0




[PATCH v3 25/34] s390/diag: Unpoison diag224() output buffer

2023-12-13 Thread Ilya Leoshkevich
Diagnose 224 stores 4k bytes, which cannot be deduced from the inline
assembly constraints. This leads to KMSAN false positives.

Unpoison the output buffer manually with kmsan_unpoison_memory().

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/diag.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kernel/diag.c b/arch/s390/kernel/diag.c
index 92fdc35f028c..fb83a21014d0 100644
--- a/arch/s390/kernel/diag.c
+++ b/arch/s390/kernel/diag.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -255,6 +256,7 @@ int diag224(void *ptr)
"1:\n"
EX_TABLE(0b,1b)
: "+d" (rc) :"d" (0), "d" (addr) : "memory");
+   kmsan_unpoison_memory(ptr, PAGE_SIZE);
return rc;
 }
 EXPORT_SYMBOL(diag224);
-- 
2.43.0




[PATCH v3 23/34] s390/cpacf: Unpoison the results of cpacf_trng()

2023-12-13 Thread Ilya Leoshkevich
Prevent KMSAN from complaining about buffers filled by cpacf_trng()
being uninitialized.

Tested-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpacf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h
index b378e2b57ad8..2bb6b4e7e082 100644
--- a/arch/s390/include/asm/cpacf.h
+++ b/arch/s390/include/asm/cpacf.h
@@ -12,6 +12,7 @@
 #define _ASM_S390_CPACF_H
 
 #include 
+#include 
 
 /*
  * Instruction opcodes for the CPACF instructions
@@ -473,6 +474,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long 
ucbuf_len,
: [ucbuf] "+" (u.pair), [cbuf] "+" (c.pair)
: [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO)
: "cc", "memory", "0");
+   kmsan_unpoison_memory(ucbuf, ucbuf_len);
+   kmsan_unpoison_memory(cbuf, cbuf_len);
 }
 
 /**
-- 
2.43.0




[PATCH v3 02/34] kmsan: Make the tests compatible with kmsan.panic=1

2023-12-13 Thread Ilya Leoshkevich
It's useful to have both tests and kmsan.panic=1 during development,
but right now the warnings, that the tests cause, lead to kernel
panics.

Temporarily set kmsan.panic=0 for the duration of the KMSAN testing.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan_test.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c
index 07d3a3a5a9c5..9bfd11674fe3 100644
--- a/mm/kmsan/kmsan_test.c
+++ b/mm/kmsan/kmsan_test.c
@@ -659,9 +659,13 @@ static void test_exit(struct kunit *test)
 {
 }
 
+static int orig_panic_on_kmsan;
+
 static int kmsan_suite_init(struct kunit_suite *suite)
 {
register_trace_console(probe_console, NULL);
+   orig_panic_on_kmsan = panic_on_kmsan;
+   panic_on_kmsan = 0;
return 0;
 }
 
@@ -669,6 +673,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite)
 {
unregister_trace_console(probe_console, NULL);
tracepoint_synchronize_unregister();
+   panic_on_kmsan = orig_panic_on_kmsan;
 }
 
 static struct kunit_suite kmsan_test_suite = {
-- 
2.43.0




[PATCH v3 22/34] s390/checksum: Add a KMSAN check

2023-12-13 Thread Ilya Leoshkevich
Add a KMSAN check to the CKSM inline assembly, similar to how it was
done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm
instruction").

Acked-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/checksum.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h
index 69837eec2ff5..55ba0ddd8eab 100644
--- a/arch/s390/include/asm/checksum.h
+++ b/arch/s390/include/asm/checksum.h
@@ -13,6 +13,7 @@
 #define _S390_CHECKSUM_H
 
 #include 
+#include 
 #include 
 
 /*
@@ -35,6 +36,7 @@ static inline __wsum csum_partial(const void *buff, int len, 
__wsum sum)
};
 
kasan_check_read(buff, len);
+   kmsan_check_memory(buff, len);
asm volatile(
"0: cksm%[sum],%[rp]\n"
"   jo  0b\n"
-- 
2.43.0




[PATCH v3 18/34] kmsan: Accept ranges starting with 0 on s390

2023-12-13 Thread Ilya Leoshkevich
On s390 the virtual address 0 is valid (current CPU's lowcore is mapped
there), therefore KMSAN should not complain about it.

Disable the respective check on s390. There doesn't seem to be a
Kconfig option to describe this situation, so explicitly check for
s390.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/init.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index ffedf4dbc49d..7a3df4d359f8 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -33,7 +33,10 @@ static void __init kmsan_record_future_shadow_range(void 
*start, void *end)
bool merged = false;
 
KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
-   KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
+   KMSAN_WARN_ON((nstart >= nend) ||
+ /* Virtual address 0 is valid on s390. */
+ (!IS_ENABLED(CONFIG_S390) && !nstart) ||
+ !nend);
nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
nend = ALIGN(nend, PAGE_SIZE);
 
-- 
2.43.0




[PATCH v3 19/34] s390: Turn off KMSAN for boot, vdso and purgatory

2023-12-13 Thread Ilya Leoshkevich
All other sanitizers are disabled for these components as well.
While at it, add a comment to boot and purgatory.

Reviewed-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile  | 2 ++
 arch/s390/kernel/vdso32/Makefile | 3 ++-
 arch/s390/kernel/vdso64/Makefile | 3 ++-
 arch/s390/purgatory/Makefile | 2 ++
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index c7c81e5f9218..fb10fcd21221 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -3,11 +3,13 @@
 # Makefile for the linux s390-specific parts of the memory manager.
 #
 
+# Tooling runtimes are unavailable and cannot be linked for early boot code
 KCOV_INSTRUMENT := n
 GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR)
 KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR)
diff --git a/arch/s390/kernel/vdso32/Makefile b/arch/s390/kernel/vdso32/Makefile
index caec7db6f966..7cbec6b0b11f 100644
--- a/arch/s390/kernel/vdso32/Makefile
+++ b/arch/s390/kernel/vdso32/Makefile
@@ -32,11 +32,12 @@ obj-y += vdso32_wrapper.o
 targets += vdso32.lds
 CPPFLAGS_vdso32.lds += -P -C -U$(ARCH)
 
-# Disable gcov profiling, ubsan and kasan for VDSO code
+# Disable gcov profiling, ubsan, kasan and kmsan for VDSO code
 GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 # Force dependency (incbin is bad)
 $(obj)/vdso32_wrapper.o : $(obj)/vdso32.so
diff --git a/arch/s390/kernel/vdso64/Makefile b/arch/s390/kernel/vdso64/Makefile
index e3c9085f8fa7..6f3252712f64 100644
--- a/arch/s390/kernel/vdso64/Makefile
+++ b/arch/s390/kernel/vdso64/Makefile
@@ -36,11 +36,12 @@ obj-y += vdso64_wrapper.o
 targets += vdso64.lds
 CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
 
-# Disable gcov profiling, ubsan and kasan for VDSO code
+# Disable gcov profiling, ubsan, kasan and kmsan for VDSO code
 GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 # Force dependency (incbin is bad)
 $(obj)/vdso64_wrapper.o : $(obj)/vdso64.so
diff --git a/arch/s390/purgatory/Makefile b/arch/s390/purgatory/Makefile
index 4e930f566878..4e421914e50f 100644
--- a/arch/s390/purgatory/Makefile
+++ b/arch/s390/purgatory/Makefile
@@ -15,11 +15,13 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
 $(obj)/mem.o: $(srctree)/arch/s390/lib/mem.S FORCE
$(call if_changed_rule,as_o_S)
 
+# Tooling runtimes are unavailable and cannot be linked for purgatory code
 KCOV_INSTRUMENT := n
 GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes
 KBUILD_CFLAGS += -Wno-pointer-sign -Wno-sign-compare
-- 
2.43.0




[PATCH v3 17/34] lib/zlib: Unpoison DFLTCC output buffers

2023-12-13 Thread Ilya Leoshkevich
The constraints of the DFLTCC inline assembly are not precise: they
do not communicate the size of the output buffers to the compiler, so
it cannot automatically instrument it.

Add the manual kmsan_unpoison_memory() calls for the output buffers.
The logic is the same as in [1].

[1] 
https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5

Reported-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 lib/zlib_dfltcc/dfltcc.h  |  1 +
 lib/zlib_dfltcc/dfltcc_util.h | 24 
 2 files changed, 25 insertions(+)

diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h
index b96232bdd44d..0f2a16d7a48a 100644
--- a/lib/zlib_dfltcc/dfltcc.h
+++ b/lib/zlib_dfltcc/dfltcc.h
@@ -80,6 +80,7 @@ struct dfltcc_param_v0 {
 uint8_t csb[1152];
 };
 
+static_assert(offsetof(struct dfltcc_param_v0, csb) == 384);
 static_assert(sizeof(struct dfltcc_param_v0) == 1536);
 
 #define CVT_CRC32 0
diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h
index 4a46b5009f0d..e481c6ea09b5 100644
--- a/lib/zlib_dfltcc/dfltcc_util.h
+++ b/lib/zlib_dfltcc/dfltcc_util.h
@@ -2,6 +2,8 @@
 #ifndef DFLTCC_UTIL_H
 #define DFLTCC_UTIL_H
 
+#include "dfltcc.h"
+#include 
 #include 
 
 /*
@@ -20,6 +22,7 @@ typedef enum {
 #define DFLTCC_CMPR 2
 #define DFLTCC_XPND 4
 #define HBT_CIRCULAR (1 << 7)
+#define DFLTCC_FN_MASK ((1 << 7) - 1)
 #define HB_BITS 15
 #define HB_SIZE (1 << HB_BITS)
 
@@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc(
 )
 {
 Byte *t2 = op1 ? *op1 : NULL;
+unsigned char *orig_t2 = t2;
 size_t t3 = len1 ? *len1 : 0;
 const Byte *t4 = op2 ? *op2 : NULL;
 size_t t5 = len2 ? *len2 : 0;
@@ -59,6 +63,26 @@ static inline dfltcc_cc dfltcc(
  : "cc", "memory");
 t2 = r2; t3 = r3; t4 = r4; t5 = r5;
 
+switch (fn & DFLTCC_FN_MASK) {
+case DFLTCC_QAF:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param));
+break;
+case DFLTCC_GDHT:
+kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb));
+break;
+case DFLTCC_CMPR:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(
+orig_t2,
+t2 - orig_t2 +
+(((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1));
+break;
+case DFLTCC_XPND:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(orig_t2, t2 - orig_t2);
+break;
+}
+
 if (op1)
 *op1 = t2;
 if (len1)
-- 
2.43.0




[PATCH v3 11/34] kmsan: Allow disabling KMSAN checks for the current task

2023-12-13 Thread Ilya Leoshkevich
Like for KASAN, it's useful to temporarily disable KMSAN checks around,
e.g., redzone accesses. Introduce kmsan_disable_current() and
kmsan_enable_current(), which are similar to their KASAN counterparts.

Make them reentrant in order to handle memory allocations in interrupt
context. Repurpose the allow_reporting field for this.

Signed-off-by: Ilya Leoshkevich 
---
 Documentation/dev-tools/kmsan.rst |  4 ++--
 include/linux/kmsan.h | 24 
 include/linux/kmsan_types.h   |  2 +-
 mm/kmsan/core.c   |  1 -
 mm/kmsan/hooks.c  | 18 +++---
 mm/kmsan/report.c |  7 ---
 6 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/Documentation/dev-tools/kmsan.rst 
b/Documentation/dev-tools/kmsan.rst
index 323eedad53cd..022a823f5f1b 100644
--- a/Documentation/dev-tools/kmsan.rst
+++ b/Documentation/dev-tools/kmsan.rst
@@ -338,11 +338,11 @@ Per-task KMSAN state
 
 
 Every task_struct has an associated KMSAN task state that holds the KMSAN
-context (see above) and a per-task flag disallowing KMSAN reports::
+context (see above) and a per-task counter disallowing KMSAN reports::
 
   struct kmsan_context {
 ...
-bool allow_reporting;
+unsigned int depth;
 struct kmsan_context_state cstate;
 ...
   }
diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index fe6c2212bdb1..23de1b3d6aee 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -239,6 +239,22 @@ void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
  */
 void *kmsan_get_metadata(void *addr, bool is_origin);
 
+/*
+ * kmsan_enable_current(): Enable KMSAN for the current task.
+ *
+ * Each kmsan_enable_current() current call must be preceded by a
+ * kmsan_disable_current() call. These call pairs may be nested.
+ */
+void kmsan_enable_current(void);
+
+/*
+ * kmsan_disable_current(): Disable KMSAN for the current task.
+ *
+ * Each kmsan_disable_current() current call must be followed by a
+ * kmsan_enable_current() call. These call pairs may be nested.
+ */
+void kmsan_disable_current(void);
+
 #else
 
 static inline void kmsan_init_shadow(void)
@@ -338,6 +354,14 @@ static inline void kmsan_unpoison_entry_regs(const struct 
pt_regs *regs)
 {
 }
 
+static inline void kmsan_enable_current(void)
+{
+}
+
+static inline void kmsan_disable_current(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_KMSAN_H */
diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h
index 8bfa6c98176d..27bb146ece95 100644
--- a/include/linux/kmsan_types.h
+++ b/include/linux/kmsan_types.h
@@ -29,7 +29,7 @@ struct kmsan_context_state {
 struct kmsan_ctx {
struct kmsan_context_state cstate;
int kmsan_in_runtime;
-   bool allow_reporting;
+   unsigned int depth;
 };
 
 #endif /* _LINUX_KMSAN_TYPES_H */
diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c
index c19f47af0424..68c68b30441d 100644
--- a/mm/kmsan/core.c
+++ b/mm/kmsan/core.c
@@ -43,7 +43,6 @@ void kmsan_internal_task_create(struct task_struct *task)
struct thread_info *info = current_thread_info();
 
__memset(ctx, 0, sizeof(*ctx));
-   ctx->allow_reporting = true;
kmsan_internal_unpoison_memory(info, sizeof(*info), false);
 }
 
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index eafc45f937eb..3acf010c9814 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -39,12 +39,10 @@ void kmsan_task_create(struct task_struct *task)
 
 void kmsan_task_exit(struct task_struct *task)
 {
-   struct kmsan_ctx *ctx = >kmsan_ctx;
-
if (!kmsan_enabled || kmsan_in_runtime())
return;
 
-   ctx->allow_reporting = false;
+   kmsan_disable_current();
 }
 
 void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags)
@@ -423,3 +421,17 @@ void kmsan_check_memory(const void *addr, size_t size)
   REASON_ANY);
 }
 EXPORT_SYMBOL(kmsan_check_memory);
+
+void kmsan_enable_current(void)
+{
+   KMSAN_WARN_ON(current->kmsan_ctx.depth == 0);
+   current->kmsan_ctx.depth--;
+}
+EXPORT_SYMBOL(kmsan_enable_current);
+
+void kmsan_disable_current(void)
+{
+   current->kmsan_ctx.depth++;
+   KMSAN_WARN_ON(current->kmsan_ctx.depth == 0);
+}
+EXPORT_SYMBOL(kmsan_disable_current);
diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index c79d3b0d2d0d..92e73ec61435 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -158,12 +159,12 @@ void kmsan_report(depot_stack_handle_t origin, void 
*address, int size,
 
if (!kmsan_enabled)
return;
-   if (!current->kmsan_ctx.allow_reporting)
+   if (current->kmsan_ctx.depth)
return;
if (!origin)
return;
 
-   current->kmsan_ctx.allow_reporting = false;
+   kmsan_disable_current();
ua_flags = user_access_save();

[PATCH v3 16/34] mm: kfence: Disable KMSAN when checking the canary

2023-12-13 Thread Ilya Leoshkevich
KMSAN warns about check_canary() accessing the canary.

The reason is that, even though set_canary() is properly instrumented
and sets shadow, slub explicitly poisons the canary's address range
afterwards.

Unpoisoning the canary is not the right thing to do: only
check_canary() is supposed to ever touch it. Instead, disable KMSAN
checks around canary read accesses.

Reviewed-by: Alexander Potapenko 
Tested-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kfence/core.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 3872528d0963..96138e704c5a 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -305,8 +305,14 @@ metadata_update_state(struct kfence_metadata *meta, enum 
kfence_object_state nex
WRITE_ONCE(meta->state, next);
 }
 
+#ifdef CONFIG_KMSAN
+#define CHECK_CANARY_ATTRIBUTES noinline __no_kmsan_checks
+#else
+#define CHECK_CANARY_ATTRIBUTES inline
+#endif
+
 /* Check canary byte at @addr. */
-static inline bool check_canary_byte(u8 *addr)
+static CHECK_CANARY_ATTRIBUTES bool check_canary_byte(u8 *addr)
 {
struct kfence_metadata *meta;
unsigned long flags;
@@ -341,7 +347,8 @@ static inline void set_canary(const struct kfence_metadata 
*meta)
*((u64 *)addr) = KFENCE_CANARY_PATTERN_U64;
 }
 
-static inline void check_canary(const struct kfence_metadata *meta)
+static CHECK_CANARY_ATTRIBUTES void
+check_canary(const struct kfence_metadata *meta)
 {
const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
unsigned long addr = pageaddr;
-- 
2.43.0




[PATCH v3 07/34] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()

2023-12-13 Thread Ilya Leoshkevich
The value assigned to prot is immediately overwritten on the next line
with PAGE_KERNEL. The right hand side of the assignment has no
side-effects.

Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Suggested-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index b9d05aff313e..2d57408c78ae 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, 
unsigned long end,
s_pages[i] = shadow_page_for(pages[i]);
o_pages[i] = origin_page_for(pages[i]);
}
-   prot = __pgprot(pgprot_val(prot) | _PAGE_NX);
prot = PAGE_KERNEL;
 
origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN);
-- 
2.43.0




[PATCH v3 15/34] mm: slub: Unpoison the memchr_inv() return value

2023-12-13 Thread Ilya Leoshkevich
Even though the KMSAN warnings generated by memchr_inv() are suppressed
by metadata_access_enable(), its return value may still be poisoned.

The reason is that the last iteration of memchr_inv() returns
`*start != value ? start : NULL`, where *start is poisoned. Because of
this, somewhat counterintuitively, the shadow value computed by
visitSelectInst() is equal to `(uintptr_t)start`.

The intention behind guarding memchr_inv() behind
metadata_access_enable() is to touch poisoned metadata without
triggering KMSAN, so unpoison its return value.

Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 2d29d368894c..802702748925 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1076,6 +1076,7 @@ static int check_bytes_and_report(struct kmem_cache *s, 
struct slab *slab,
metadata_access_enable();
fault = memchr_inv(kasan_reset_tag(start), value, bytes);
metadata_access_disable();
+   kmsan_unpoison_memory(, sizeof(fault));
if (!fault)
return 1;
 
@@ -1182,6 +1183,7 @@ static void slab_pad_check(struct kmem_cache *s, struct 
slab *slab)
metadata_access_enable();
fault = memchr_inv(kasan_reset_tag(pad), POISON_INUSE, remainder);
metadata_access_disable();
+   kmsan_unpoison_memory(, sizeof(fault));
if (!fault)
return;
while (end > fault && end[-1] == POISON_INUSE)
-- 
2.43.0




[PATCH v3 14/34] mm: slub: Let KMSAN access metadata

2023-12-13 Thread Ilya Leoshkevich
Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes
KMSAN to complain about touching redzones in kfree().

Fix by extending the existing KASAN-related metadata_access_enable()
and metadata_access_disable() functions to KMSAN.

Acked-by: Vlastimil Babka 
Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index b111bc315e3f..2d29d368894c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -700,10 +700,12 @@ static int disable_higher_order_debug;
 static inline void metadata_access_enable(void)
 {
kasan_disable_current();
+   kmsan_disable_current();
 }
 
 static inline void metadata_access_disable(void)
 {
+   kmsan_enable_current();
kasan_enable_current();
 }
 
-- 
2.43.0




[PATCH v3 06/34] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces

2023-12-13 Thread Ilya Leoshkevich
Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Assume that we are handling user memory access in
this case.

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 5d6e2dee5692..eafc45f937eb 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -267,7 +267,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, 
size_t to_copy,
return;
 
ua_flags = user_access_save();
-   if ((u64)to < TASK_SIZE) {
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) ||
+   (u64)to < TASK_SIZE) {
/* This is a user memory access, check it. */
kmsan_internal_check_memory((void *)from, to_copy - left, to,
REASON_COPY_TO_USER);
-- 
2.43.0




[PATCH v3 04/34] kmsan: Increase the maximum store size to 4096

2023-12-13 Thread Ilya Leoshkevich
The inline assembly block in s390's chsc() stores that much.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index cc3907a9c33a..470b0b4afcc4 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -110,11 +110,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t 
size)
 
ua_flags = user_access_save();
/*
-* Most of the accesses are below 32 bytes. The two exceptions so far
-* are clwb() (64 bytes) and FPU state (512 bytes).
-* It's unlikely that the assembly will touch more than 512 bytes.
+* Most of the accesses are below 32 bytes. The exceptions so far are
+* clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes).
 */
-   if (size > 512) {
+   if (size > 4096) {
WARN_ONCE(1, "assembly store size too big: %ld\n", size);
size = 8;
}
-- 
2.43.0




[PATCH v3 00/34] kmsan: Enable on s390

2023-12-13 Thread Ilya Leoshkevich
v2: https://lore.kernel.org/lkml/20231121220155.1217090-1-...@linux.ibm.com/
v2 -> v3: Drop kmsan_memmove_metadata() and strlcpy() patches;
  Remove kmsan_get_metadata() stub;
  Move kmsan_enable_current() and kmsan_disable_current() to
  include/linux/kmsan.h, explain why a counter is needed;
  Drop the memset_no_sanitize_memory() patch;
  Use __memset() in the SLAB_POISON patch;
  Add kmsan-checks.h to the DFLTCC patch;
  Add recursion check to the arch_kmsan_get_meta_or_null()
  patch (Alexander P.).

  Fix inline + __no_kmsan_checks issues.
  New patch for s390/irqflags, that resolves a lockdep warning.
  New patch for s390/diag, that resolves a false positive when
  running on an LPAR.
  New patch for STCCTM, same as above.
  New patch for check_bytes_and_report() that resolves a false
  positive that occurs even on Intel.

v1: https://lore.kernel.org/lkml/20231115203401.2495875-1-...@linux.ibm.com/
v1 -> v2: Add comments, sort #includes, introduce
  memset_no_sanitize_memory() and use it to avoid unpoisoning
  of redzones, change vmalloc alignment to _REGION3_SIZE, add
  R-bs (Alexander P.).

  Fix building
  [PATCH 28/33] s390/string: Add KMSAN support
  with FORTIFY_SOURCE.
  Reported-by: kernel test robot 
  Closes: 
https://lore.kernel.org/oe-kbuild-all/202311170550.bsbo44ix-...@intel.com/

Hi,

This series provides the minimal support for Kernel Memory Sanitizer on
s390. Kernel Memory Sanitizer is clang-only instrumentation for finding
accesses to uninitialized memory. The clang support for s390 has already
been merged [1].

With this series, I can successfully boot s390 defconfig and
debug_defconfig with kmsan.panic=1. The tool found one real
s390-specific bug (fixed in master).

Best regards,
Ilya

[1] https://reviews.llvm.org/D148596

Ilya Leoshkevich (34):
  ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
  kmsan: Make the tests compatible with kmsan.panic=1
  kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
  kmsan: Increase the maximum store size to 4096
  kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces
  kmsan: Fix kmsan_copy_to_user() on arches with overlapping address
spaces
  kmsan: Remove a useless assignment from
kmsan_vmap_pages_range_noflush()
  kmsan: Remove an x86-specific #include from kmsan.h
  kmsan: Expose kmsan_get_metadata()
  kmsan: Export panic_on_kmsan
  kmsan: Allow disabling KMSAN checks for the current task
  kmsan: Support SLAB_POISON
  kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()
  mm: slub: Let KMSAN access metadata
  mm: slub: Unpoison the memchr_inv() return value
  mm: kfence: Disable KMSAN when checking the canary
  lib/zlib: Unpoison DFLTCC output buffers
  kmsan: Accept ranges starting with 0 on s390
  s390: Turn off KMSAN for boot, vdso and purgatory
  s390: Use a larger stack for KMSAN
  s390/boot: Add the KMSAN runtime stub
  s390/checksum: Add a KMSAN check
  s390/cpacf: Unpoison the results of cpacf_trng()
  s390/cpumf: Unpoison STCCTM output buffer
  s390/diag: Unpoison diag224() output buffer
  s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()
  s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN
  s390/mm: Define KMSAN metadata for vmalloc and modules
  s390/string: Add KMSAN support
  s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs
  s390/uaccess: Add KMSAN support to put_user() and get_user()
  s390/unwind: Disable KMSAN checks
  s390: Implement the architecture-specific kmsan functions
  kmsan: Enable on s390

 Documentation/dev-tools/kmsan.rst   |   4 +-
 arch/s390/Kconfig   |   1 +
 arch/s390/Makefile  |   2 +-
 arch/s390/boot/Makefile |   3 +
 arch/s390/boot/kmsan.c  |   6 ++
 arch/s390/boot/startup.c|   8 ++
 arch/s390/boot/string.c |  16 
 arch/s390/include/asm/checksum.h|   2 +
 arch/s390/include/asm/cpacf.h   |   3 +
 arch/s390/include/asm/cpu_mf.h  |   6 ++
 arch/s390/include/asm/irqflags.h|  18 -
 arch/s390/include/asm/kmsan.h   |  43 +++
 arch/s390/include/asm/pgtable.h |  10 +++
 arch/s390/include/asm/string.h  |  20 +++--
 arch/s390/include/asm/thread_info.h |   2 +-
 arch/s390/include/asm/uaccess.h | 111 
 arch/s390/kernel/diag.c |   2 +
 arch/s390/kernel/ftrace.c   |   2 +
 arch/s390/kernel/traps.c|   6 ++
 arch/s390/kernel/unwind_bc.c|   4 +
 arch/s390/kernel/vdso32/Makefile|   3 +-
 arch/s390/kernel/vdso64/Makefile|   3 +-
 arch/s390/purgatory/Makefile|   2 +
 drivers/s390/char/sclp.c|   2 +-
 include/linux/kmsan.h   |  33 +
 include/linux/kmsan_types.h |   2 +-
 kernel/trace/ftrace.c   |   1 +
 

[PATCH v3 01/34] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2023-12-13 Thread Ilya Leoshkevich
Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.

Acked-by: Steven Rostedt (Google) 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 kernel/trace/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..dfb8b26966aa 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7399,6 +7399,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
parent_ip,
 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
__ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
 }
 #else
-- 
2.43.0




[PATCH v3 03/34] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled

2023-12-13 Thread Ilya Leoshkevich
KMSAN relies on memblock returning all available pages to it
(see kmsan_memblock_free_pages()). It partitions these pages into 3
categories: pages available to the buddy allocator, shadow pages and
origin pages. This partitioning is static.

If new pages appear after kmsan_init_runtime(), it is considered
an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as
incompatible with KMSAN.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 57cd378c73d6..712bcf5f1d20 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -985,6 +985,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   depends on !KMSAN
select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
-- 
2.43.0




[PATCH] driver core: Add a guard() definition for the device_lock()

2023-12-13 Thread Dan Williams
At present there are ~200 usages of device_lock() in the kernel. Some of
those usages lead to "goto unlock;" patterns which have proven to be
error prone. Define a "device" guard() definition to allow for those to
be cleaned up and prevent new ones from appearing.

Link: 
http://lore.kernel.org/r/657897453dda8_269bd29...@dwillia2-mobl3.amr.corp.intel.com.notmuch
Link: 
http://lore.kernel.org/r/6577b0c2a02df_a04c529...@dwillia2-xfh.jf.intel.com.notmuch
Cc: Vishal Verma 
Cc: Ira Weiny 
Cc: Peter Zijlstra 
Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
Hi Greg,

I wonder if you might include this change in v6.7-rc to ease some patch
sets alternately going through my tree and Andrew's tree. Those
discussions are linked above. Alternately I can can just take it through
my tree with your ack and the other use case can circle back to it in
the v6.9 cycle.

I considered also defining a __free() helper similar to __free(mutex),
but I think "__free(device)" would be a surprising name for something
that drops a lock. Also, I like the syntax of guard(device) over
something like guard(device_lock) since a 'struct device *' is the
argument, not a lock type, but I'm open to your or Peter's thoughts on
the naming.

 include/linux/device.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index d7a72a8749ea..6c83294395ac 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1007,6 +1007,8 @@ static inline void device_unlock(struct device *dev)
mutex_unlock(>mutex);
 }
 
+DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
+
 static inline void device_lock_assert(struct device *dev)
 {
lockdep_assert_held(>mutex);




Re: [PATCH v3] trace/kprobe: Display the actual notrace function when rejecting a probe

2023-12-13 Thread Google
On Wed, 13 Dec 2023 20:09:14 +0530
Naveen N Rao  wrote:

> Trying to probe update_sd_lb_stats() using perf results in the below
> message in the kernel log:
>   trace_kprobe: Could not probe notrace function _text
> 
> This is because 'perf probe' specifies the kprobe location as an offset
> from '_text':
>   $ sudo perf probe -D update_sd_lb_stats
>   p:probe/update_sd_lb_stats _text+1830728
> 
> However, the error message is misleading and doesn't help convey the
> actual notrace function that is being probed. Fix this by looking up the
> actual function name that is being probed. With this fix, we now get the
> below message in the kernel log:
>   trace_kprobe: Could not probe notrace function 
> update_sd_lb_stats.constprop.0
> 
> Signed-off-by: Naveen N Rao 
> ---
> v3: Remove tk parameter from within_notrace_func() as suggested by 
> Masami
> 
>  kernel/trace/trace_kprobe.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 3d7a180a8427..dc36c6ed6131 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -449,9 +449,8 @@ static bool __within_notrace_func(unsigned long addr)
>   return !ftrace_location_range(addr, addr + size - 1);
>  }
>  
> -static bool within_notrace_func(struct trace_kprobe *tk)
> +static bool within_notrace_func(unsigned long addr)
>  {
> - unsigned long addr = trace_kprobe_address(tk);
>   char symname[KSYM_NAME_LEN], *p;
>  
>   if (!__within_notrace_func(addr))
> @@ -471,12 +470,14 @@ static bool within_notrace_func(struct trace_kprobe *tk)
>   return true;
>  }
>  #else
> -#define within_notrace_func(tk)  (false)
> +#define within_notrace_func(addr)(false)
>  #endif
>  
>  /* Internal register function - just handle k*probes and flags */
>  static int __register_trace_kprobe(struct trace_kprobe *tk)
>  {
> + unsigned long addr = trace_kprobe_address(tk);
> + char symname[KSYM_NAME_LEN];
>   int i, ret;
>  
>   ret = security_locked_down(LOCKDOWN_KPROBES);
> @@ -486,9 +487,9 @@ static int __register_trace_kprobe(struct trace_kprobe 
> *tk)
>   if (trace_kprobe_is_registered(tk))
>   return -EINVAL;
>  
> - if (within_notrace_func(tk)) {
> + if (within_notrace_func(addr)) {
>   pr_warn("Could not probe notrace function %s\n",
> - trace_kprobe_symbol(tk));
> + lookup_symbol_name(addr, symname) ? 
> trace_kprobe_symbol(tk) : symname);

Can we just use %ps and (void *)trace_kprobe_address(tk) here?

That will be simpler.

Thank you,

>   return -EINVAL;
>   }
>  
> 
> base-commit: 4758560fa268cecfa1144f015aa9f2525d164b7e
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] rpmsg: glink: Fix buffer overflow

2023-12-13 Thread Bjorn Andersson
On Wed, Dec 13, 2023 at 03:15:15PM +, Hardevsinh Palaniya wrote:
>Hello [1]@Bjorn Andersson,

Please use appropriate mail list etiquette, and avoid HTML and
top-posting in your responses.

> 
>"strscpy_pad" itself takes care of null-terminated strings. So, there
>will be no leak.

Your strscpy_pad() will NUL-terminate and zero-pad up to 32 bytes.
Following this the next line will write strlen(name) + sizeof(hdr) bytes
to the FIFO.

So if strlen(name) >= 32 you will read beyond the end of the zero-padded
string.

Regards,
Bjorn

>  __
> 
>From: Bjorn Andersson 
>Sent: Tuesday, December 12, 2023 12:10 AM
>To: Hardevsinh Palaniya 
>Cc: agr...@kernel.org ; anders...@kernel.org
>; Konrad Dybcio ;
>Mathieu Poirier ;
>linux-arm-...@vger.kernel.org ;
>linux-remotep...@vger.kernel.org ;
>linux-kernel@vger.kernel.org 
>Subject: Re: [PATCH] rpmsg: glink: Fix buffer overflow
> 
>On Mon, Dec 11, 2023 at 09:32:20PM +0530, Hardevsinh Palaniya wrote:
>> In qcom_glink_send_open_req() remove error: strcpy() 'channel->name'
>> too large for 'req.name' (1010102 vs 32)
>>
>As far as I can tell, channel->name comes from the struct
>rpmsg_channel_info->name, which is a 32-byte array, and all code paths
>I
>can find either uses strscpy() or explicitly NUL-terminates this
>string.
>I'm curious to know which path took us here.
>> Signed-off-by: Hardevsinh Palaniya
>
>>
>> diff --git a/drivers/rpmsg/qcom_glink_native.c
>b/drivers/rpmsg/qcom_glink_native.c
>> index 82d460ff4777..2d6a592e1c72 100644
>> --- a/drivers/rpmsg/qcom_glink_native.c
>> +++ b/drivers/rpmsg/qcom_glink_native.c
>> @@ -479,7 +479,7 @@ static int qcom_glink_send_open_req(struct
>qcom_glink *glink,
>>req.msg.cmd = cpu_to_le16(GLINK_CMD_OPEN);
>>req.msg.param1 = cpu_to_le16(channel->lcid);
>>req.msg.param2 = cpu_to_le32(name_len);
>> - strcpy(req.name, channel->name);
>> + strscpy_pad(req.name, channel->name, sizeof(req.name));
>I think this patch is incomplete. While it makes sure we don't
>overwrite
>the stack. name_len is strlen(channel->name) + 1 and the amount of data
>sent out is based on name_len.
>As such, if you can get here with a @name of arbitrary length, then you
>can control how much of the stack we're going to now leak to the
>recipient.
>Regards,
>Bjorn
>>
>>ret = qcom_glink_tx(glink, , req_len, NULL, 0, true);
>>if (ret)
>> --
>> 2.25.1
>>
>>
> 
> References
> 
>1. mailto:quic_bjora...@quicinc.com



[PATCH] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As the ring buffer recording requires cmpxchg() to work, if the
architecture does not support cmpxchg in NMI, then do not do any recording
within an NMI.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8f8887f025c9..caaffcdc6350 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3719,6 +3719,12 @@ rb_reserve_next_event(struct trace_buffer *buffer,
int nr_loops = 0;
int add_ts_default;
 
+   /* ring buffer does cmpxchg, make sure it is safe in NMI context */
+   if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&
+   (unlikely(in_nmi( {
+   return NULL;
+   }
+
rb_start_commit(cpu_buffer);
/* The commit page can not change after this */
 
-- 
2.42.0




[PATCH 0/2] ARM: dts: qcom: msm8926-motorola-peregrine: Add initial device tree

2023-12-13 Thread André Apitzsch
This dts adds support for Motorola Moto G 4G released in 2013.

Add a device tree with initial support for:

- GPIO keys
- Hall sensor
- SDHCI
- Vibrator

Signed-off-by: André Apitzsch 
---
André Apitzsch (2):
  dt-bindings: arm: qcom: Add Motorola Moto G 4G
  ARM: dts: qcom: msm8926-motorola-peregrine: Add initial device tree

 Documentation/devicetree/bindings/arm/qcom.yaml|   1 +
 arch/arm/boot/dts/qcom/Makefile|   1 +
 .../dts/qcom/qcom-msm8926-motorola-peregrine.dts   | 297 +
 3 files changed, 299 insertions(+)
---
base-commit: 48e8992e33abf054bcc0bb2e77b2d43bb899212e
change-id: 20231208-peregrine-902ab2fb2cf5

Best regards,
-- 
André Apitzsch 




Re: [PATCH 0/2] ARM: dts: qcom: msm8926-motorola-peregrine: Add initial device tree

2023-12-13 Thread Konrad Dybcio




On 12/13/23 21:33, André Apitzsch wrote:

This dts adds support for Motorola Moto G 4G released in 2013.

I have a similar one in my drawer.. not the 4g kind, titan IIRC?
Wasn't this one codenamed thea?

Konrad



[PATCH 1/2] dt-bindings: arm: qcom: Add Motorola Moto G 4G

2023-12-13 Thread André Apitzsch
Document the compatible for the MSM8926-based Motorola Moto G 4G smartphone.

Signed-off-by: André Apitzsch 
---
 Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/qcom.yaml 
b/Documentation/devicetree/bindings/arm/qcom.yaml
index c968412d86b8..91dc3205d09e 100644
--- a/Documentation/devicetree/bindings/arm/qcom.yaml
+++ b/Documentation/devicetree/bindings/arm/qcom.yaml
@@ -185,6 +185,7 @@ properties:
   - enum:
   - microsoft,superman-lte
   - microsoft,tesla
+  - motorola,peregrine
   - const: qcom,msm8926
   - const: qcom,msm8226
 

-- 
2.43.0




Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory

2023-12-13 Thread Rob Herring
On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
 wrote:
>
> On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > >  wrote:
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi Rob,
> > > > > > >
> > > > > > > Thank you so much for the feedback, I'm not very familiar with 
> > > > > > > device tree,
> > > > > > > and any comments are very useful.
> > > > > > >
> > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Allow the kernel to get the size and location of the MTE tag 
> > > > > > > > > storage
> > > > > > > > > regions from the DTB. This memory is marked as reserved for 
> > > > > > > > > now.
> > > > > > > > >
> > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > >
> > > > > > > > > tags0: tag-storage@8f800 {
> > > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > > reg = <0x08 0xf800 0x00 0x400>;
> > > > > > > > > block-size = <0x1000>;
> > > > > > > > > memory = <>;// Associated tagged 
> > > > > > > > > memory node
> > > > > > > > > };
> > > > > > > >
> > > > > > > > I skimmed thru the discussion some. If this memory range is 
> > > > > > > > within
> > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > >
> > > > > > > Ok, will do that.
> > > > > > >
> > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious 
> > > > > > > about the
> > > > > > > motivation.
> > > > > >
> > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > multiple things to handle early.
> > > > > >
> > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > >
> > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > reclaiming contiguous space, but that's supposed to be better now 
> > > > > > (and
> > > > > > most h/w figured out they need IOMMUs).
> > > > > >
> > > > > > But for tag storage you know the size as it is a function of the
> > > > > > memory size, right? After all, you are validating the size is 
> > > > > > correct.
> > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > not which could be done in a variety of ways.
> > > > >
> > > > > Oh, sorry, my bad, I should have been clearer about this. I don't 
> > > > > want to
> > > > > put it in the DT as a "linux,cma" node. But I want it to be managed 
> > > > > by CMA.
> > > >
> > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > If the location doesn't matter and you can calculate the size from the
> > > > memory size, what else is there to add to the DT?
> > >
> > > I am afraid there has been a misunderstanding. What do you mean by
> > > "location doesn't matter"?
> >
> > You said:
> > > Tag storage is not DMA and can live anywhere in memory.
> >
> > Which I took as the kernel can figure out where to put it. But maybe
> > you meant the h/w platform can hard code it to be anywhere in memory?
> > If so, then yes, DT is needed.
>
> Ah, I see, sorry for not being clear enough, you are correct: tag storage
> is a hardware property, and software needs a mechanism (in this case, the
> dt) to discover its properties.
>
> >
> > > At the very least, Linux needs to know the address and size of a memory
> > > region to use it. The series is about using the tag storage memory for
> > > data. Tag storage cannot be described as a regular memory node because it
> > > cannot be tagged (and normal memory can).
> >
> > If the tag storage lives in the middle of memory, then it would be
> > described in the memory node, but removed by being in reserved-memory
> > node.
>
> I don't follow. Would you mind going into more details?

It goes back to what I said earlier about /memory nodes describing all
the memory. There's no reason to reserve memory if you haven't
described that range as memory to begin with. One could presumably
just have a memory node for each contiguous chunk and not need
/reserved-memory (ignoring the need to say 

[PATCH v3 00/15] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers

2023-12-13 Thread Steven Rostedt


[
  This is hopefully the last version before I push this to Linux next.
  It's now going though the final phases of my testing.
]

Note, this has been on my todo list since the ring buffer was created back
in 2008.

Tzvetomir last worked on this in 2021 and I need to finally get it in.

His last series was:

  
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-1-tz.stoya...@gmail.com/

With the description of:

   Currently the size of one sub buffer page is global for all buffers and
   it is hard coded to one system page. The patch set introduces configurable
   ring buffer sub page size, per ring buffer. A new user space interface is
   introduced, which allows to change the sub page size of the ftrace buffer,
   per ftrace instance.

I'm pulling in his patches mostly untouched, except that I had to tweak
a few things to forward port them.

The issues I found I added as the last 7 patches to the series, and then
I added documentation and a selftest, and then changed the UI file
from buffer_subbuf_order to buffer_subbuf_size_kb.

Basically, events to the tracing subsystem are limited to just under a
PAGE_SIZE, as the ring buffer is split into "sub buffers" of one page
size, and an event can not be bigger than a sub buffer. This allows users
to change the size of a sub buffer by the order:

  echo 3 > /sys/kernel/tracing/buffer_subbuf_order

[ last patch updates this to:

  echo 32 > /sys/kernel/tracing/buffer_subbuf_size_kb

]
  

Will make each sub buffer a size of 8 pages (32KB), allowing events to be
almost as big as 8 pages in size (sub buffers do have meta data on them as
well, keeping an event from reaching the same size as a sub buffer).

Changes since v2: 
https://lore.kernel.org/all/20231213021914.361709...@goodmis.org/

- Fixed up the subbuf tests to ignore multiple spaces after before the
  'buf' string (same fix as was done for the trace_marker test).

- This time include Cc'ing linux-trace-kernel :-p

Changes since v1: 
https://lore.kernel.org/all/20231210040448.569462...@goodmis.org/

- Add the last patch that changes the ABI from a file called:

  buffer_subbuf_order  to   buffer_subbuf_size_kb

  That is, I kept the old interface the same, but then added the last
  patch that converts the interface into the one that makes more sense.
  I like keeping this in the git history, especially because of the
  way the implemantion is.

- I rebased on top of trace/core in the:

git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git

- I made the tests a bit more advanced. Still a smoke test, but it
  now checks if the string written is the same as the string read.


Steven Rostedt (Google) (10):
  ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() 
failure
  ring-buffer: Do no swap cpu buffers if order is different
  ring-buffer: Make sure the spare sub buffer used for reads has same size
  tracing: Update snapshot order along with main buffer order
  tracing: Stop the tracing while changing the ring buffer subbuf size
  ring-buffer: Keep the same size when updating the order
  ring-buffer: Just update the subbuffers when changing their allocation 
order
  ring-buffer: Add documentation on the buffer_subbuf_order file
  ringbuffer/selftest: Add basic selftest to test changing subbuf order
  tracing: Update subbuffer with kilobytes not page order

Tzvetomir Stoyanov (VMware) (5):
  ring-buffer: Refactor ring buffer implementation
  ring-buffer: Page size per ring buffer
  ring-buffer: Add interface for configuring trace sub buffer size
  ring-buffer: Set new size of the ring buffer sub page
  ring-buffer: Read and write to ring buffers with custom sub buffer size


 Documentation/trace/ftrace.rst |  21 ++
 include/linux/ring_buffer.h|  17 +-
 kernel/trace/ring_buffer.c | 409 -
 kernel/trace/ring_buffer_benchmark.c   |  10 +-
 kernel/trace/trace.c   | 155 +++-
 kernel/trace/trace.h   |   1 +
 kernel/trace/trace_events.c|  59 ++-
 .../test.d/00basic/ringbuffer_subbuf_size.tc   |  95 +
 8 files changed, 647 insertions(+), 120 deletions(-)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_subbuf_size.tc



[PATCH v3 04/15] ring-buffer: Set new size of the ring buffer sub page

2023-12-13 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

There are two approaches when changing the size of the ring buffer
sub page:
 1. Destroying all pages and allocating new pages with the new size.
 2. Allocating new pages, copying the content of the old pages before
destroying them.
The first approach is easier, it is selected in the proposed
implementation. Changing the ring buffer sub page size is supposed to
not happen frequently. Usually, that size should be set only once,
when the buffer is not in use yet and is supposed to be empty.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 80 ++
 1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8ba7a45f7c21..6c780d7204ba 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -331,6 +331,7 @@ struct buffer_page {
unsigned read;  /* index for next read */
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
+   unsigned order; /* order of the page */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -361,7 +362,7 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_page((unsigned long)bpage->page);
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -1658,10 +1659,12 @@ static int __rb_allocate_pages(struct 
ring_buffer_per_cpu *cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags, 
0);
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
bpage->page = page_address(page);
+   bpage->order = cpu_buffer->buffer->subbuf_order;
rb_init_page(bpage->page);
 
if (user_thread && fatal_signal_pending(current))
@@ -1740,7 +1743,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
rb_check_bpage(cpu_buffer, bpage);
 
cpu_buffer->reader_page = bpage;
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -1824,6 +1828,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
goto fail_free_buffer;
 
/* Default buffer page size - one system page */
+   buffer->subbuf_order = 0;
buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
 
/* Max payload is buffer page size - header (8bytes) */
@@ -5645,8 +5650,8 @@ void *ring_buffer_alloc_read_page(struct trace_buffer 
*buffer, int cpu)
if (bpage)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY, 0);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
return ERR_PTR(-ENOMEM);
 
@@ -5695,7 +5700,7 @@ void ring_buffer_free_read_page(struct trace_buffer 
*buffer, int cpu, void *data
local_irq_restore(flags);
 
  out:
-   free_page((unsigned long)bpage);
+   free_pages((unsigned long)bpage, buffer->subbuf_order);
 }
 EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
 
@@ -5955,7 +5960,13 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
  */
 int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 {
+   struct ring_buffer_per_cpu **cpu_buffers;
+   int old_order, old_size;
+   int nr_pages;
int psize;
+   int bsize;
+   int err;
+   int cpu;
 
if (!buffer || order < 0)
return -EINVAL;
@@ -5967,12 +5978,67 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
+   bsize = sizeof(void *) * buffer->cpus;
+   cpu_buffers = kzalloc(bsize, GFP_KERNEL);
+   if (!cpu_buffers)
+   return -ENOMEM;
+
+   old_order = buffer->subbuf_order;
+   old_size = buffer->subbuf_size;
+
+   /* prevent another thread from changing buffer sizes */
+   mutex_lock(>mutex);
+   atomic_inc(>record_disabled);
+
+   /* Make sure all commits have finished */
+   synchronize_rcu();
+
buffer->subbuf_order = order;
 

[PATCH v3 14/15] ringbuffer/selftest: Add basic selftest to test changing subbuf order

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a self test that will write into the trace buffer with differ trace
sub buffer order sizes.

Signed-off-by: Steven Rostedt (Google) 
---
 .../ftrace/test.d/00basic/ringbuffer_order.tc | 95 +++
 1 file changed, 95 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc 
b/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc
new file mode 100644
index ..ecbcc810e6c1
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc
@@ -0,0 +1,95 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Change the ringbuffer sub-buffer order
+# requires: buffer_subbuf_order
+# flags: instance
+
+get_buffer_data_size() {
+   sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+get_buffer_data_offset() {
+   sed -ne 's/^.*data.*offset:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+get_event_header_size() {
+   type_len=`sed -ne 's/^.*type_len.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   time_len=`sed -ne 's/^.*time_delta.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   array_len=`sed -ne 's/^.*array.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   total_bits=$((type_len+time_len+array_len))
+   total_bits=$((total_bits+7))
+   echo $((total_bits/8))
+}
+
+get_print_event_buf_offset() {
+   sed -ne 's/^.*buf.*offset:\([0-9][0-9]*\).*/\1/p' 
events/ftrace/print/format
+}
+
+event_header_size=`get_event_header_size`
+print_header_size=`get_print_event_buf_offset`
+
+data_offset=`get_buffer_data_offset`
+
+marker_meta=$((event_header_size+print_header_size))
+
+make_str() {
+cnt=$1
+   printf -- 'X%.0s' $(seq $cnt)
+}
+
+write_buffer() {
+   size=$1
+
+   str=`make_str $size`
+
+   # clear the buffer
+   echo > trace
+
+   # write the string into the marker
+   echo $str > trace_marker
+
+   echo $str
+}
+
+test_buffer() {
+   orde=$1
+   page_size=$((4096< buffer_subbuf_order
+   test_buffer $a
+done
+
+echo $ORIG > buffer_subbuf_order
+
-- 
2.42.0





[PATCH v3 05/15] ring-buffer: Read and write to ring buffers with custom sub buffer size

2023-12-13 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

As the size of the ring sub buffer page can be changed dynamically,
the logic that reads and writes to the buffer should be fixed to take
that into account. Some internal ring buffer APIs are changed:
 ring_buffer_alloc_read_page()
 ring_buffer_free_read_page()
 ring_buffer_read_page()
A new API is introduced:
 ring_buffer_read_page_data()

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  | 11 ++--
 kernel/trace/ring_buffer.c   | 75 
 kernel/trace/ring_buffer_benchmark.c | 10 ++--
 kernel/trace/trace.c | 34 +++--
 4 files changed, 89 insertions(+), 41 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 12573306b889..fa802db216f9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -192,10 +192,15 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer 
*buffer);
 size_t ring_buffer_nr_pages(struct trace_buffer *buffer, int cpu);
 size_t ring_buffer_nr_dirty_pages(struct trace_buffer *buffer, int cpu);
 
-void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu);
-void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void 
*data);
-int ring_buffer_read_page(struct trace_buffer *buffer, void **data_page,
+struct buffer_data_read_page;
+struct buffer_data_read_page *
+ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu);
+void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
+   struct buffer_data_read_page *page);
+int ring_buffer_read_page(struct trace_buffer *buffer,
+ struct buffer_data_read_page *data_page,
  size_t len, int cpu, int full);
+void *ring_buffer_read_page_data(struct buffer_data_read_page *page);
 
 struct trace_seq;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6c780d7204ba..729fd4f07c1e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -317,6 +317,11 @@ struct buffer_data_page {
unsigned chardata[] RB_ALIGN_DATA;  /* data of buffer page */
 };
 
+struct buffer_data_read_page {
+   unsignedorder;  /* order of the page */
+   struct buffer_data_page *data;  /* actual data, stored in this page */
+};
+
 /*
  * Note, the buffer_page list must be first. The buffer pages
  * are allocated in cache lines, which means that each buffer
@@ -5625,40 +5630,48 @@ EXPORT_SYMBOL_GPL(ring_buffer_swap_cpu);
  * Returns:
  *  The page allocated, or ERR_PTR
  */
-void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
+struct buffer_data_read_page *
+ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
-   struct buffer_data_page *bpage = NULL;
+   struct buffer_data_read_page *bpage = NULL;
unsigned long flags;
struct page *page;
 
if (!cpumask_test_cpu(cpu, buffer->cpumask))
return ERR_PTR(-ENODEV);
 
+   bpage = kzalloc(sizeof(*bpage), GFP_KERNEL);
+   if (!bpage)
+   return ERR_PTR(-ENOMEM);
+
+   bpage->order = buffer->subbuf_order;
cpu_buffer = buffer->buffers[cpu];
local_irq_save(flags);
arch_spin_lock(_buffer->lock);
 
if (cpu_buffer->free_page) {
-   bpage = cpu_buffer->free_page;
+   bpage->data = cpu_buffer->free_page;
cpu_buffer->free_page = NULL;
}
 
arch_spin_unlock(_buffer->lock);
local_irq_restore(flags);
 
-   if (bpage)
+   if (bpage->data)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
cpu_buffer->buffer->subbuf_order);
-   if (!page)
+   if (!page) {
+   kfree(bpage);
return ERR_PTR(-ENOMEM);
+   }
 
-   bpage = page_address(page);
+   bpage->data = page_address(page);
 
  out:
-   rb_init_page(bpage);
+   rb_init_page(bpage->data);
 
return bpage;
 }
@@ -5668,14 +5681,15 @@ EXPORT_SYMBOL_GPL(ring_buffer_alloc_read_page);
  * ring_buffer_free_read_page - free an allocated read page
  * @buffer: the buffer the page was allocate for
  * @cpu: the cpu buffer the page came from
- * @data: the page to free
+ * @page: the page to free
  *
  * Free a page allocated from ring_buffer_alloc_read_page.
  */
-void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void 
*data)
+void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
+   struct buffer_data_read_page *data_page)
 {
struct ring_buffer_per_cpu *cpu_buffer;
-   struct buffer_data_page *bpage = data;
+   struct 

[PATCH v3 15/15] tracing: Update subbuffer with kilobytes not page order

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Using page order for deciding what the size of the ring buffer sub buffers
are is exposing a bit too much of the implementation. Although the sub
buffers are only allocated in orders of pages, allow the user to specify
the minimum size of each sub-buffer via kilobytes like they can with the
buffer size itself.

If the user specifies 3 via:

  echo 3 > buffer_subbuf_size_kb

Then the sub-buffer size will round up to 4kb (on a 4kb page size system).

If they specify:

  echo 6 > buffer_subbuf_size_kb

The sub-buffer size will become 8kb.

and so on.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst| 46 ---
 kernel/trace/trace.c  | 38 +--
 ...fer_order.tc => ringbuffer_subbuf_size.tc} | 18 
 3 files changed, 54 insertions(+), 48 deletions(-)
 rename tools/testing/selftests/ftrace/test.d/00basic/{ringbuffer_order.tc => 
ringbuffer_subbuf_size.tc} (85%)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 231d26ceedb0..933e7efb9f1b 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -203,32 +203,26 @@ of ftrace. Here is a list of some of the key files:
 
This displays the total combined size of all the trace buffers.
 
-  buffer_subbuf_order:
-
-   This sets or displays the sub buffer page size order. The ring buffer
-   is broken up into several same size "sub buffers". An event can not be
-   bigger than the size of the sub buffer. Normally, the sub buffer is
-   the size of the architecture's page (4K on x86). The sub buffer also
-   contains meta data at the start which also limits the size of an event.
-   That means when the sub buffer is a page size, no event can be larger
-   than the page size minus the sub buffer meta data.
-
-   The buffer_subbuf_order allows the user to change the size of the sub
-   buffer. As the sub buffer is a set of pages by the power of 2, thus
-   the sub buffer total size is defined by the order:
-
-   order   size
-   
-   0   PAGE_SIZE
-   1   PAGE_SIZE * 2
-   2   PAGE_SIZE * 4
-   3   PAGE_SIZE * 8
-
-   Changing the order will change the sub buffer size allowing for events
-   to be larger than the page size.
-
-   Note: When changing the order, tracing is stopped and any data in the
-   ring buffer and the snapshot buffer will be discarded.
+  buffer_subbuf_size_kb:
+
+   This sets or displays the sub buffer size. The ring buffer is broken up
+   into several same size "sub buffers". An event can not be bigger than
+   the size of the sub buffer. Normally, the sub buffer is the size of the
+   architecture's page (4K on x86). The sub buffer also contains meta data
+   at the start which also limits the size of an event.  That means when
+   the sub buffer is a page size, no event can be larger than the page
+   size minus the sub buffer meta data.
+
+   Note, the buffer_subbuf_size_kb is a way for the user to specify the
+   minimum size of the subbuffer. The kernel may make it bigger due to the
+   implementation details, or simply fail the operation if the kernel can
+   not handle the request.
+
+   Changing the sub buffer size allows for events to be larger than the
+   page size.
+
+   Note: When changing the sub-buffer size, tracing is stopped and any
+   data in the ring buffer and the snapshot buffer will be discarded.
 
   free_buffer:
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 020350da99e3..b4b2b2307b7b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9378,42 +9378,54 @@ static const struct file_operations buffer_percent_fops 
= {
 };
 
 static ssize_t
-buffer_order_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t 
*ppos)
+buffer_subbuf_size_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
 {
struct trace_array *tr = filp->private_data;
+   size_t size;
char buf[64];
+   int order;
int r;
 
-   r = sprintf(buf, "%d\n", 
ring_buffer_subbuf_order_get(tr->array_buffer.buffer));
+   order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   size = (PAGE_SIZE << order) / 1024;
+
+   r = sprintf(buf, "%zd\n", size);
 
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
 static ssize_t
-buffer_order_write(struct file *filp, const char __user *ubuf,
-  size_t cnt, loff_t *ppos)
+buffer_subbuf_size_write(struct file *filp, const char __user *ubuf,
+size_t cnt, loff_t *ppos)
 {
struct trace_array *tr = filp->private_data;
unsigned long val;
int old_order;
+   int order;
+   int pages;
int ret;
 
ret = 

[PATCH v3 09/15] tracing: Update snapshot order along with main buffer order

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When updating the order of the sub buffers for the main buffer, make sure
that if the snapshot buffer exists, that it gets its order updated as
well.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 45 ++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dd7669f6503b..658bb1f04e9c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1263,10 +1263,17 @@ static void set_buffer_entries(struct array_buffer 
*buf, unsigned long val);
 
 int tracing_alloc_snapshot_instance(struct trace_array *tr)
 {
+   int order;
int ret;
 
if (!tr->allocated_snapshot) {
 
+   /* Make the snapshot buffer have the same order as main buffer 
*/
+   order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   ret = ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 
order);
+   if (ret < 0)
+   return ret;
+
/* allocate spare buffer */
ret = resize_buffer_duplicate_size(>max_buffer,
   >array_buffer, RING_BUFFER_ALL_CPUS);
@@ -1286,6 +1293,7 @@ static void free_snapshot(struct trace_array *tr)
 * The max_tr ring buffer has some state (e.g. ring->clock) and
 * we want preserve it.
 */
+   ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 0);
ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
set_buffer_entries(>max_buffer, 1);
tracing_reset_online_cpus(>max_buffer);
@@ -9387,6 +9395,7 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
 {
struct trace_array *tr = filp->private_data;
unsigned long val;
+   int old_order;
int ret;
 
ret = kstrtoul_from_user(ubuf, cnt, 10, );
@@ -9397,12 +9406,44 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
if (val < 0 || val > 7)
return -EINVAL;
 
+   old_order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   if (old_order == val)
+   return 0;
+
ret = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, val);
if (ret)
-   return ret;
+   return 0;
 
-   (*ppos)++;
+#ifdef CONFIG_TRACER_MAX_TRACE
+
+   if (!tr->allocated_snapshot)
+   goto out_max;
 
+   ret = ring_buffer_subbuf_order_set(tr->max_buffer.buffer, val);
+   if (ret) {
+   /* Put back the old order */
+   cnt = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, 
old_order);
+   if (WARN_ON_ONCE(cnt)) {
+   /*
+* AARGH! We are left with different orders!
+* The max buffer is our "snapshot" buffer.
+* When a tracer needs a snapshot (one of the
+* latency tracers), it swaps the max buffer
+* with the saved snap shot. We succeeded to
+* update the order of the main buffer, but failed to
+* update the order of the max buffer. But when we tried
+* to reset the main buffer to the original size, we
+* failed there too. This is very unlikely to
+* happen, but if it does, warn and kill all
+* tracing.
+*/
+   tracing_disabled = 1;
+   }
+   return ret;
+   }
+ out_max:
+#endif
+   (*ppos)++;
return cnt;
 }
 
-- 
2.42.0





[PATCH v3 03/15] ring-buffer: Add interface for configuring trace sub buffer size

2023-12-13 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

The trace ring buffer sub page size can be configured, per trace
instance. A new ftrace file "buffer_subbuf_order" is added to get and
set the size of the ring buffer sub page for current trace instance.
The size must be an order of system page size, that's why the new
interface works with system page order, instead of absolute page size:
0 means the ring buffer sub page is equal to 1 system page and so
forth:
0 - 1 system page
1 - 2 system pages
2 - 4 system pages
...
The ring buffer sub page size is limited between 1 and 128 system
pages. The default value is 1 system page.
New ring buffer APIs are introduced:
 ring_buffer_subbuf_order_set()
 ring_buffer_subbuf_order_get()
 ring_buffer_subbuf_size_get()

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  4 ++
 kernel/trace/ring_buffer.c  | 73 +
 kernel/trace/trace.c| 48 
 3 files changed, 125 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index ce46218ce46d..12573306b889 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -202,6 +202,10 @@ struct trace_seq;
 int ring_buffer_print_entry_header(struct trace_seq *s);
 int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s);
 
+int ring_buffer_subbuf_order_get(struct trace_buffer *buffer);
+int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order);
+int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
+
 enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
 };
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7ca97da72538..8ba7a45f7c21 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -524,6 +524,7 @@ struct trace_buffer {
booltime_stamp_abs;
 
unsigned intsubbuf_size;
+   unsigned intsubbuf_order;
unsigned intmax_data_size;
 };
 
@@ -5903,6 +5904,78 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 }
 EXPORT_SYMBOL_GPL(ring_buffer_read_page);
 
+/**
+ * ring_buffer_subbuf_size_get - get size of the sub buffer.
+ * @buffer: the buffer to get the sub buffer size from
+ *
+ * Returns size of the sub buffer, in bytes.
+ */
+int ring_buffer_subbuf_size_get(struct trace_buffer *buffer)
+{
+   return buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_size_get);
+
+/**
+ * ring_buffer_subbuf_order_get - get order of system sub pages in one buffer 
page.
+ * @buffer: The ring_buffer to get the system sub page order from
+ *
+ * By default, one ring buffer sub page equals to one system page. This 
parameter
+ * is configurable, per ring buffer. The size of the ring buffer sub page can 
be
+ * extended, but must be an order of system page size.
+ *
+ * Returns the order of buffer sub page size, in system pages:
+ * 0 means the sub buffer size is 1 system page and so forth.
+ * In case of an error < 0 is returned.
+ */
+int ring_buffer_subbuf_order_get(struct trace_buffer *buffer)
+{
+   if (!buffer)
+   return -EINVAL;
+
+   return buffer->subbuf_order;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
+
+/**
+ * ring_buffer_subbuf_order_set - set the size of ring buffer sub page.
+ * @buffer: The ring_buffer to set the new page size.
+ * @order: Order of the system pages in one sub buffer page
+ *
+ * By default, one ring buffer pages equals to one system page. This API can be
+ * used to set new size of the ring buffer page. The size must be order of
+ * system page size, that's why the input parameter @order is the order of
+ * system pages that are allocated for one ring buffer page:
+ *  0 - 1 system page
+ *  1 - 2 system pages
+ *  3 - 4 system pages
+ *  ...
+ *
+ * Returns 0 on success or < 0 in case of an error.
+ */
+int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
+{
+   int psize;
+
+   if (!buffer || order < 0)
+   return -EINVAL;
+
+   if (buffer->subbuf_order == order)
+   return 0;
+
+   psize = (1 << order) * PAGE_SIZE;
+   if (psize <= BUF_PAGE_HDR_SIZE)
+   return -EINVAL;
+
+   buffer->subbuf_order = order;
+   buffer->subbuf_size = psize - BUF_PAGE_HDR_SIZE;
+
+   /* Todo: reset the buffer with the new page size */
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
+
 /*
  * We only allocate new buffers, never free them if the CPU goes down.
  * If we were to free the buffer, then the user would lose any trace that was 
in
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 276523252766..4df4c74dabc4 100644
--- a/kernel/trace/trace.c
+++ 

[PATCH v3 08/15] ring-buffer: Make sure the spare sub buffer used for reads has same size

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Now that the ring buffer specifies the size of its sub buffers, they all
need to be the same size. When doing a read, a swap is done with a spare
page. Make sure they are the same size before doing the swap, otherwise
the read will fail.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2e201ef5b1ce..dd7669f6503b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7576,6 +7576,7 @@ struct ftrace_buffer_info {
struct trace_iterator   iter;
void*spare;
unsigned intspare_cpu;
+   unsigned intspare_size;
unsigned intread;
 };
 
@@ -8295,6 +8296,15 @@ tracing_buffers_read(struct file *filp, char __user 
*ubuf,
 
page_size = ring_buffer_subbuf_size_get(iter->array_buffer->buffer);
 
+   /* Make sure the spare matches the current sub buffer size */
+   if (info->spare) {
+   if (page_size != info->spare_size) {
+   ring_buffer_free_read_page(iter->array_buffer->buffer,
+  info->spare_cpu, 
info->spare);
+   info->spare = NULL;
+   }
+   }
+
if (!info->spare) {
info->spare = 
ring_buffer_alloc_read_page(iter->array_buffer->buffer,
  iter->cpu_file);
@@ -8303,6 +8313,7 @@ tracing_buffers_read(struct file *filp, char __user *ubuf,
info->spare = NULL;
} else {
info->spare_cpu = iter->cpu_file;
+   info->spare_size = page_size;
}
}
if (!info->spare)
-- 
2.42.0





[PATCH v3 10/15] tracing: Stop the tracing while changing the ring buffer subbuf size

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Because the main buffer and the snapshot buffer need to be the same for
some tracers, otherwise it will fail and disable all tracing, the tracers
need to be stopped while updating the sub buffer sizes so that the tracers
see the main and snapshot buffers with the same sub buffer size.

Fixes: TBD ("ring-buffer: Add interface for configuring trace sub buffer size")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 658bb1f04e9c..020350da99e3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9406,13 +9406,16 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
if (val < 0 || val > 7)
return -EINVAL;
 
+   /* Do not allow tracing while changing the order of the ring buffer */
+   tracing_stop_tr(tr);
+
old_order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
if (old_order == val)
-   return 0;
+   goto out;
 
ret = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, val);
if (ret)
-   return 0;
+   goto out;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
 
@@ -9439,11 +9442,15 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
 */
tracing_disabled = 1;
}
-   return ret;
+   goto out;
}
  out_max:
 #endif
(*ppos)++;
+ out:
+   if (ret)
+   cnt = ret;
+   tracing_start_tr(tr);
return cnt;
 }
 
-- 
2.42.0





[PATCH v3 12/15] ring-buffer: Just update the subbuffers when changing their allocation order

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
cpu_buffers with the new subbuffers with the updated order, and if they
all successfully were created, then they the ring_buffer's per_cpu buffers
would be freed and replaced by them.

The problem is that the freed per_cpu buffers contains state that would be
lost. Running the following commands:

1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
3. # echo 1 > /sys/kernel/tracing/snapshot
4. # echo ff > /sys/kernel/tracing/tracing_cpumask
5. # echo test > /sys/kernel/tracing/trace_marker

Would result in:

 -bash: echo: write error: Bad file descriptor

That's because the state of the per_cpu buffers of the snapshot buffer is
lost when the order is changed (the order of a freed snapshot buffer goes
to 0 to save memory, and when the snapshot buffer is allocated again, it
goes back to what the main buffer is).

In operation 2, the snapshot buffers were set to "disable" (as all the
ring buffers CPUs were disabled).

In operation 3, the snapshot is allocated and a call to
ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
"record_disable" count.

When it was enabled again, the atomic_dec(_buffer->record_disable) was
decrementing a zero, setting it to -1. Writing 1 into the snapshot would
swap the snapshot buffer with the main buffer, so now the main buffer is
"disabled", and nothing can write to the ring buffer anymore.

Instead of creating new per_cpu buffers and losing the state of the old
buffers, basically do what the resize does and just allocate new subbuf
pages into the new_pages link list of the per_cpu buffer and if they all
succeed, then replace the old sub buffers with the new ones. This keeps
the per_cpu buffer descriptor in tact and by doing so, keeps its state.

Fixes: TBD ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 88 ++
 1 file changed, 71 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 86d3cac2a877..e08e646c4277 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5998,11 +5998,11 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
  */
 int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 {
-   struct ring_buffer_per_cpu **cpu_buffers;
+   struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *bpage, *tmp;
int old_order, old_size;
int nr_pages;
int psize;
-   int bsize;
int err;
int cpu;
 
@@ -6016,11 +6016,6 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
-   bsize = sizeof(void *) * buffer->cpus;
-   cpu_buffers = kzalloc(bsize, GFP_KERNEL);
-   if (!cpu_buffers)
-   return -ENOMEM;
-
old_order = buffer->subbuf_order;
old_size = buffer->subbuf_size;
 
@@ -6036,33 +6031,88 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
 
/* Make sure all new buffers are allocated, before deleting the old 
ones */
for_each_buffer_cpu(buffer, cpu) {
+
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
+   cpu_buffer = buffer->buffers[cpu];
+
/* Update the number of pages to match the new size */
nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
 
-   cpu_buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, 
cpu);
-   if (!cpu_buffers[cpu]) {
+   /* we need a minimum of two pages */
+   if (nr_pages < 2)
+   nr_pages = 2;
+
+   cpu_buffer->nr_pages_to_update = nr_pages;
+
+   /* Include the reader page */
+   nr_pages++;
+
+   /* Allocate the new size buffer */
+   INIT_LIST_HEAD(_buffer->new_pages);
+   if (__rb_allocate_pages(cpu_buffer, nr_pages,
+   _buffer->new_pages)) {
+   /* not enough memory for new pages */
err = -ENOMEM;
goto error;
}
}
 
for_each_buffer_cpu(buffer, cpu) {
+
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
-   rb_free_cpu_buffer(buffer->buffers[cpu]);
-   buffer->buffers[cpu] = cpu_buffers[cpu];
+   cpu_buffer = buffer->buffers[cpu];
+
+   /* Clear the head bit to make the link list normal to read */
+   rb_head_page_deactivate(cpu_buffer);
+
+   /* Now walk the list and free all the 

[PATCH v3 11/15] ring-buffer: Keep the same size when updating the order

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The function ring_buffer_subbuf_order_set() just updated the sub-buffers
to the new size, but this also changes the size of the buffer in doing so.
As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
increased without decreasing the nr_pages, this causes the total size of
the buffer to increase.

This broke the latency tracers as the snapshot needs to be the same size
as the main buffer. The size of the snapshot buffer is only expanded when
needed, and because the order is still the same, the size becomes out of
sync with the main buffer, as the main buffer increased in size without
the tracing system knowing.

Calculate the nr_pages to allocate with the new subbuf_size to be
buffer_size / new_subbuf_size.

Fixes: TBD ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2c908b4f3f68..86d3cac2a877 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6039,7 +6039,10 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
-   nr_pages = buffer->buffers[cpu]->nr_pages;
+   /* Update the number of pages to match the new size */
+   nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
+   nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
+
cpu_buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, 
cpu);
if (!cpu_buffers[cpu]) {
err = -ENOMEM;
-- 
2.42.0





[PATCH v3 02/15] ring-buffer: Page size per ring buffer

2023-12-13 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

Currently the size of one sub buffer page is global for all buffers and
it is hard coded to one system page. In order to introduce configurable
ring buffer sub page size, the internal logic should be refactored to
work with sub page size per ring buffer.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  2 +-
 kernel/trace/ring_buffer.c  | 68 +
 kernel/trace/trace.c|  2 +-
 kernel/trace/trace.h|  1 +
 kernel/trace/trace_events.c | 59 
 5 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index b1b03b2c0f08..ce46218ce46d 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -200,7 +200,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, void 
**data_page,
 struct trace_seq;
 
 int ring_buffer_print_entry_header(struct trace_seq *s);
-int ring_buffer_print_page_header(struct trace_seq *s);
+int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s);
 
 enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index eee36f90ae25..7ca97da72538 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -373,11 +373,6 @@ static inline bool test_time_stamp(u64 delta)
return !!(delta & TS_DELTA_TEST);
 }
 
-#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
-
-/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
-#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
-
 struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
@@ -527,6 +522,9 @@ struct trace_buffer {
 
struct rb_irq_work  irq_work;
booltime_stamp_abs;
+
+   unsigned intsubbuf_size;
+   unsigned intmax_data_size;
 };
 
 struct ring_buffer_iter {
@@ -540,10 +538,11 @@ struct ring_buffer_iter {
u64 read_stamp;
u64 page_stamp;
struct ring_buffer_event*event;
+   size_t  event_size;
int missed_events;
 };
 
-int ring_buffer_print_page_header(struct trace_seq *s)
+int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s)
 {
struct buffer_data_page field;
 
@@ -567,7 +566,7 @@ int ring_buffer_print_page_header(struct trace_seq *s)
trace_seq_printf(s, "\tfield: char data;\t"
 "offset:%u;\tsize:%u;\tsigned:%u;\n",
 (unsigned int)offsetof(typeof(field), data),
-(unsigned int)BUF_PAGE_SIZE,
+(unsigned int)buffer->subbuf_size,
 (unsigned int)is_signed_type(char));
 
return !trace_seq_has_overflowed(s);
@@ -1823,7 +1822,13 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
if (!zalloc_cpumask_var(>cpumask, GFP_KERNEL))
goto fail_free_buffer;
 
-   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   /* Default buffer page size - one system page */
+   buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
+
+   /* Max payload is buffer page size - header (8bytes) */
+   buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2);
+
+   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
@@ -2142,7 +2147,7 @@ static void update_pages_handler(struct work_struct *work)
  * @size: the new size.
  * @cpu_id: the cpu buffer to resize
  *
- * Minimum size is 2 * BUF_PAGE_SIZE.
+ * Minimum size is 2 * buffer->subbuf_size.
  *
  * Returns 0 on success and < 0 on failure.
  */
@@ -2164,7 +2169,7 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
!cpumask_test_cpu(cpu_id, buffer->cpumask))
return 0;
 
-   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
 
/* we need a minimum of two pages */
if (nr_pages < 2)
@@ -2411,7 +2416,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 */
barrier();
 
-   if ((iter->head + length) > commit || length > BUF_PAGE_SIZE)
+   if ((iter->head + length) > commit || length > iter->event_size)
/* Writer corrupted the read? */
goto reset;
 
@@ -2644,6 +2649,7 @@ static inline void
 rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
  unsigned long tail, 

[PATCH v3 13/15] ring-buffer: Add documentation on the buffer_subbuf_order file

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add to the documentation how to use the buffer_subbuf_order file to change
the size and how it affects what events can be added to the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 23572f6697c0..231d26ceedb0 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -203,6 +203,33 @@ of ftrace. Here is a list of some of the key files:
 
This displays the total combined size of all the trace buffers.
 
+  buffer_subbuf_order:
+
+   This sets or displays the sub buffer page size order. The ring buffer
+   is broken up into several same size "sub buffers". An event can not be
+   bigger than the size of the sub buffer. Normally, the sub buffer is
+   the size of the architecture's page (4K on x86). The sub buffer also
+   contains meta data at the start which also limits the size of an event.
+   That means when the sub buffer is a page size, no event can be larger
+   than the page size minus the sub buffer meta data.
+
+   The buffer_subbuf_order allows the user to change the size of the sub
+   buffer. As the sub buffer is a set of pages by the power of 2, thus
+   the sub buffer total size is defined by the order:
+
+   order   size
+   
+   0   PAGE_SIZE
+   1   PAGE_SIZE * 2
+   2   PAGE_SIZE * 4
+   3   PAGE_SIZE * 8
+
+   Changing the order will change the sub buffer size allowing for events
+   to be larger than the page size.
+
+   Note: When changing the order, tracing is stopped and any data in the
+   ring buffer and the snapshot buffer will be discarded.
+
   free_buffer:
 
If a process is performing tracing, and the ring buffer should be
-- 
2.42.0





[PATCH v3 07/15] ring-buffer: Do no swap cpu buffers if order is different

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ea3c217e4e43..2c908b4f3f68 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5559,6 +5559,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.42.0





[PATCH v3 06/15] ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

On failure to allocate ring buffer pages, the pointer to the CPU buffer
pages is freed, but the pages that were allocated previously were not.
Make sure they are freed too.

Fixes: TBD ("tracing: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 729fd4f07c1e..ea3c217e4e43 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6069,6 +6069,7 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
for_each_buffer_cpu(buffer, cpu) {
if (!cpu_buffers[cpu])
continue;
+   rb_free_cpu_buffer(cpu_buffers[cpu]);
kfree(cpu_buffers[cpu]);
}
kfree(cpu_buffers);
-- 
2.42.0





[PATCH v3 01/15] ring-buffer: Refactor ring buffer implementation

2023-12-13 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

In order to introduce sub-buffer size per ring buffer, some internal
refactoring is needed. As ring_buffer_print_page_header() will depend on
the trace_buffer structure, it is moved after the structure definition.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8f8887f025c9..eee36f90ae25 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -378,36 +378,6 @@ static inline bool test_time_stamp(u64 delta)
 /* Max payload is BUF_PAGE_SIZE - header (8bytes) */
 #define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
 
-int ring_buffer_print_page_header(struct trace_seq *s)
-{
-   struct buffer_data_page field;
-
-   trace_seq_printf(s, "\tfield: u64 timestamp;\t"
-"offset:0;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)sizeof(field.time_stamp),
-(unsigned int)is_signed_type(u64));
-
-   trace_seq_printf(s, "\tfield: local_t commit;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), commit),
-(unsigned int)sizeof(field.commit),
-(unsigned int)is_signed_type(long));
-
-   trace_seq_printf(s, "\tfield: int overwrite;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), commit),
-1,
-(unsigned int)is_signed_type(long));
-
-   trace_seq_printf(s, "\tfield: char data;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), data),
-(unsigned int)BUF_PAGE_SIZE,
-(unsigned int)is_signed_type(char));
-
-   return !trace_seq_has_overflowed(s);
-}
-
 struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
@@ -573,6 +543,36 @@ struct ring_buffer_iter {
int missed_events;
 };
 
+int ring_buffer_print_page_header(struct trace_seq *s)
+{
+   struct buffer_data_page field;
+
+   trace_seq_printf(s, "\tfield: u64 timestamp;\t"
+"offset:0;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)sizeof(field.time_stamp),
+(unsigned int)is_signed_type(u64));
+
+   trace_seq_printf(s, "\tfield: local_t commit;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), commit),
+(unsigned int)sizeof(field.commit),
+(unsigned int)is_signed_type(long));
+
+   trace_seq_printf(s, "\tfield: int overwrite;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), commit),
+1,
+(unsigned int)is_signed_type(long));
+
+   trace_seq_printf(s, "\tfield: char data;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), data),
+(unsigned int)BUF_PAGE_SIZE,
+(unsigned int)is_signed_type(char));
+
+   return !trace_seq_has_overflowed(s);
+}
+
 #ifdef RB_TIME_32
 
 /*
-- 
2.42.0





[PATCH] remoteproc: qcom: q6v5_mpd: Add custom coredump segments

2023-12-13 Thread Vignesh Viswanathan
IPQ9574 and IPQ5332 can have multiple reserved memory regions that needs
to be collected as part of the coredump.

Loop through all the regions defined under reserved-memory node in the
devicetree for the remoteproc and add each as a custom segment for
coredump.

Also, update the ELF info for coredump collection for multipd
remoteprocs.

This patch depends on [1] which adds support for IPQ9574 and IPQ5332
remoteproc q5v5_mpd driver.

[1]: 
https://lore.kernel.org/all/20231110091939.3025413-1-quic_mmani...@quicinc.com/

Signed-off-by: Vignesh Viswanathan 
---
 drivers/remoteproc/qcom_q6v5_mpd.c | 71 +-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/qcom_q6v5_mpd.c 
b/drivers/remoteproc/qcom_q6v5_mpd.c
index 839f6a15b88d..901b0b0da8f2 100644
--- a/drivers/remoteproc/qcom_q6v5_mpd.c
+++ b/drivers/remoteproc/qcom_q6v5_mpd.c
@@ -443,11 +443,77 @@ static unsigned long q6_wcss_panic(struct rproc *rproc)
return qcom_q6v5_panic(>q6);
 }
 
+static void q6_wcss_copy_segment(struct rproc *rproc,
+struct rproc_dump_segment *segment,
+void *dest, size_t offset, size_t size)
+{
+   struct q6_wcss *wcss = rproc->priv;
+   struct device *dev = wcss->dev;
+   int base;
+   void *ptr;
+
+   base = segment->da + offset - wcss->mem_reloc;
+
+   if (base < 0 || base + size > wcss->mem_size) {
+   ptr = devm_ioremap_wc(dev, segment->da, segment->size);
+   memcpy(dest, ptr + offset, size);
+   devm_iounmap(dev, ptr);
+   } else {
+   memcpy(dest, wcss->mem_region + offset, size);
+   }
+}
+
+static int q6_wcss_dump_segments(struct rproc *rproc,
+const struct firmware *fw)
+{
+   struct device *dev = rproc->dev.parent;
+   struct reserved_mem *rmem = NULL;
+   struct device_node *node;
+   int num_segs, index = 0;
+   int ret;
+
+   /* Parse through additional reserved memory regions for the rproc
+* and add them to the coredump segments
+*/
+   num_segs = of_count_phandle_with_args(dev->of_node,
+ "memory-region", NULL);
+   while (index < num_segs) {
+   node = of_parse_phandle(dev->of_node,
+   "memory-region", index);
+   if (!node)
+   return -EINVAL;
+
+   rmem = of_reserved_mem_lookup(node);
+   if (!rmem) {
+   dev_err(dev, "unable to acquire memory-region index %d 
num_segs %d\n",
+   index, num_segs);
+   return -EINVAL;
+   }
+
+   of_node_put(node);
+
+   dev_dbg(dev, "Adding segment 0x%pa size 0x%pa",
+   >base, >size);
+   ret = rproc_coredump_add_custom_segment(rproc,
+   rmem->base,
+   rmem->size,
+   q6_wcss_copy_segment,
+   NULL);
+   if (ret)
+   return ret;
+
+   index++;
+   }
+
+   return 0;
+}
+
 static const struct rproc_ops wcss_ops = {
.start = wcss_pd_start,
.stop = wcss_pd_stop,
.load = wcss_pd_load,
.get_boot_addr = rproc_elf_get_boot_addr,
+   .parse_fw = q6_wcss_dump_segments,
 };
 
 static const struct rproc_ops q6_wcss_ops = {
@@ -457,6 +523,7 @@ static const struct rproc_ops q6_wcss_ops = {
.load = q6_wcss_load,
.get_boot_addr = rproc_elf_get_boot_addr,
.panic = q6_wcss_panic,
+   .parse_fw = q6_wcss_dump_segments,
 };
 
 static int q6_alloc_memory_region(struct q6_wcss *wcss)
@@ -672,6 +739,7 @@ static int q6_register_userpd(struct platform_device *pdev,
goto free_rproc;
 
rproc->auto_boot = false;
+   rproc_coredump_set_elf_info(rproc, ELFCLASS32, EM_NONE);
ret = rproc_add(rproc);
if (ret)
goto free_rproc;
@@ -731,9 +799,10 @@ static int q6_wcss_probe(struct platform_device *pdev)
goto free_rproc;
 
qcom_add_glink_subdev(rproc, >glink_subdev, "q6wcss");
-   qcom_add_ssr_subdev(rproc, >ssr_subdev, "q6wcss");
+   qcom_add_ssr_subdev(rproc, >ssr_subdev, pdev->name);
 
rproc->auto_boot = false;
+   rproc_coredump_set_elf_info(rproc, ELFCLASS32, EM_NONE);
ret = rproc_add(rproc);
if (ret)
goto free_rproc;
-- 
2.41.0




Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 08:11:57PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 13.12.2023 18:13, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 10:05:44AM -0500, Michael S. Tsirkin wrote:
> >> On Wed, Dec 13, 2023 at 12:08:27PM +0300, Arseniy Krasnov wrote:
> >>>
> >>>
> >>> On 13.12.2023 11:43, Stefano Garzarella wrote:
>  On Tue, Dec 12, 2023 at 08:43:07PM +0300, Arseniy Krasnov wrote:
> >
> >
> > On 12.12.2023 19:12, Michael S. Tsirkin wrote:
> >> On Tue, Dec 12, 2023 at 06:59:03PM +0300, Arseniy Krasnov wrote:
> >>>
> >>>
> >>> On 12.12.2023 18:54, Michael S. Tsirkin wrote:
>  On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
> > Hello,
> >
> >    DESCRIPTION
> >
> > This patchset fixes old problem with hungup of both rx/tx sides and 
> > adds
> > test for it. This happens due to non-default SO_RCVLOWAT value and
> > deferred credit update in virtio/vsock. Link to previous old 
> > patchset:
> > https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
> 
> 
>  Patchset:
> 
>  Acked-by: Michael S. Tsirkin 
> >>>
> >>> Thanks!
> >>>
> 
> 
>  But I worry whether we actually need 3/8 in net not in net-next.
> >>>
> >>> Because of "Fixes" tag ? I think this problem is not critical and 
> >>> reproducible
> >>> only in special cases, but i'm not familiar with netdev process so 
> >>> good, so I don't
> >>> have strong opinion. I guess @Stefano knows better.
> >>>
> >>> Thanks, Arseniy
> >>
> >> Fixes means "if you have that other commit then you need this commit
> >> too". I think as a minimum you need to rearrange patches to make the
> >> fix go in first. We don't want a regression followed by a fix.
> >
> > I see, ok, @Stefano WDYT? I think rearrange doesn't break anything, 
> > because this
> > patch fixes problem that is not related with the new patches from this 
> > patchset.
> 
>  I agree, patch 3 is for sure net material (I'm fine with both 
>  rearrangement or send it separately), but IMHO also patch 2 could be.
>  I think with the same fixes tag, since before commit b89d882dc9fc 
>  ("vsock/virtio: reduce credit update messages") we sent a credit update
>  for every bytes we read, so we should not have this problem, right?
> >>>
> >>> Agree for 2, so I think I can rearrange: two fixes go first, then current 
> >>> 0001, and then tests. And send it as V9 for 'net' only ?
> >>>
> >>> Thanks, Arseniy
> >>
> >>
> >> hmm why not net-next?
> > 
> > Oh I missed your previous discussion. I think everything in net-next is
> > safer.  Having said that, I won't nack it net, either.
> 
> So, summarizing all above:
> 1) This patchset entirely goes to net-next as v9
> 2) I reorder patches like 3 - 2 - 1 - 4, e.g. two fixes goes first with Fixes 
> tag
> 3) Add Acked-by: Michael S. Tsirkin  to each patch
> 
> @Michael, @Stefano ?
> 
> Thanks, Arseniy

Fine by me.

> > 
> 
>  So, maybe all the series could be "net".
> 
>  Thanks,
>  Stefano
> 
> > 




Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory

2023-12-13 Thread Alexandru Elisei
On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
>  wrote:
> >
> > Hi,
> >
> > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > >  wrote:
> > > >
> > > > Hi Rob,
> > > >
> > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > >  wrote:
> > > > > >
> > > > > > Hi Rob,
> > > > > >
> > > > > > Thank you so much for the feedback, I'm not very familiar with 
> > > > > > device tree,
> > > > > > and any comments are very useful.
> > > > > >
> > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Allow the kernel to get the size and location of the MTE tag 
> > > > > > > > storage
> > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > >
> > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > >
> > > > > > > > tags0: tag-storage@8f800 {
> > > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > > reg = <0x08 0xf800 0x00 0x400>;
> > > > > > > > block-size = <0x1000>;
> > > > > > > > memory = <>;// Associated tagged 
> > > > > > > > memory node
> > > > > > > > };
> > > > > > >
> > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > >
> > > > > > Ok, will do that.
> > > > > >
> > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious 
> > > > > > about the
> > > > > > motivation.
> > > > >
> > > > > Simply so that /memory nodes describe all possible memory and
> > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > multiple things to handle early.
> > > > >
> > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > >
> > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > most h/w figured out they need IOMMUs).
> > > > >
> > > > > But for tag storage you know the size as it is a function of the
> > > > > memory size, right? After all, you are validating the size is correct.
> > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > not which could be done in a variety of ways.
> > > >
> > > > Oh, sorry, my bad, I should have been clearer about this. I don't want 
> > > > to
> > > > put it in the DT as a "linux,cma" node. But I want it to be managed by 
> > > > CMA.
> > >
> > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > If the location doesn't matter and you can calculate the size from the
> > > memory size, what else is there to add to the DT?
> >
> > I am afraid there has been a misunderstanding. What do you mean by
> > "location doesn't matter"?
> 
> You said:
> > Tag storage is not DMA and can live anywhere in memory.
> 
> Which I took as the kernel can figure out where to put it. But maybe
> you meant the h/w platform can hard code it to be anywhere in memory?
> If so, then yes, DT is needed.

Ah, I see, sorry for not being clear enough, you are correct: tag storage
is a hardware property, and software needs a mechanism (in this case, the
dt) to discover its properties.

> 
> > At the very least, Linux needs to know the address and size of a memory
> > region to use it. The series is about using the tag storage memory for
> > data. Tag storage cannot be described as a regular memory node because it
> > cannot be tagged (and normal memory can).
> 
> If the tag storage lives in the middle of memory, then it would be
> described in the memory node, but removed by being in reserved-memory
> node.

I don't follow. Would you mind going into more details?

> 
> > Then there's the matter of the tag storage block size (explained in this
> > commit message), and also knowing the memory range for which a tag storage
> > region stores the tags. This is explained in the cover letter.
> 
> Honestly, I just forgot about that part.

I totally understand, there are a lot of things to consider at the same
time.

Thanks,
Alex



Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory

2023-12-13 Thread Rob Herring
On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
 wrote:
>
> Hi,
>
> On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> >  wrote:
> > >
> > > Hi Rob,
> > >
> > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > >  wrote:
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > Thank you so much for the feedback, I'm not very familiar with device 
> > > > > tree,
> > > > > and any comments are very useful.
> > > > >
> > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > >  wrote:
> > > > > > >
> > > > > > > Allow the kernel to get the size and location of the MTE tag 
> > > > > > > storage
> > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > >
> > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > >
> > > > > > > tags0: tag-storage@8f800 {
> > > > > > > compatible = "arm,mte-tag-storage";
> > > > > > > reg = <0x08 0xf800 0x00 0x400>;
> > > > > > > block-size = <0x1000>;
> > > > > > > memory = <>;// Associated tagged 
> > > > > > > memory node
> > > > > > > };
> > > > > >
> > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > >
> > > > > Ok, will do that.
> > > > >
> > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about 
> > > > > the
> > > > > motivation.
> > > >
> > > > Simply so that /memory nodes describe all possible memory and
> > > > /reserved-memory is just adding restrictions. It's also because
> > > > /reserved-memory is what gets handled early, and we don't need
> > > > multiple things to handle early.
> > > >
> > > > > Tag storage is not DMA and can live anywhere in memory.
> > > >
> > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > most h/w figured out they need IOMMUs).
> > > >
> > > > But for tag storage you know the size as it is a function of the
> > > > memory size, right? After all, you are validating the size is correct.
> > > > I guess there is still the aspect of whether you want enable MTE or
> > > > not which could be done in a variety of ways.
> > >
> > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > put it in the DT as a "linux,cma" node. But I want it to be managed by 
> > > CMA.
> >
> > Yes, I understand, but my point remains. Why do you need this in DT?
> > If the location doesn't matter and you can calculate the size from the
> > memory size, what else is there to add to the DT?
>
> I am afraid there has been a misunderstanding. What do you mean by
> "location doesn't matter"?

You said:
> Tag storage is not DMA and can live anywhere in memory.

Which I took as the kernel can figure out where to put it. But maybe
you meant the h/w platform can hard code it to be anywhere in memory?
If so, then yes, DT is needed.

> At the very least, Linux needs to know the address and size of a memory
> region to use it. The series is about using the tag storage memory for
> data. Tag storage cannot be described as a regular memory node because it
> cannot be tagged (and normal memory can).

If the tag storage lives in the middle of memory, then it would be
described in the memory node, but removed by being in reserved-memory
node.

> Then there's the matter of the tag storage block size (explained in this
> commit message), and also knowing the memory range for which a tag storage
> region stores the tags. This is explained in the cover letter.

Honestly, I just forgot about that part.

Rob



Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-13 Thread Arseniy Krasnov



On 13.12.2023 18:13, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 10:05:44AM -0500, Michael S. Tsirkin wrote:
>> On Wed, Dec 13, 2023 at 12:08:27PM +0300, Arseniy Krasnov wrote:
>>>
>>>
>>> On 13.12.2023 11:43, Stefano Garzarella wrote:
 On Tue, Dec 12, 2023 at 08:43:07PM +0300, Arseniy Krasnov wrote:
>
>
> On 12.12.2023 19:12, Michael S. Tsirkin wrote:
>> On Tue, Dec 12, 2023 at 06:59:03PM +0300, Arseniy Krasnov wrote:
>>>
>>>
>>> On 12.12.2023 18:54, Michael S. Tsirkin wrote:
 On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
> Hello,
>
>    DESCRIPTION
>
> This patchset fixes old problem with hungup of both rx/tx sides and 
> adds
> test for it. This happens due to non-default SO_RCVLOWAT value and
> deferred credit update in virtio/vsock. Link to previous old patchset:
> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/


 Patchset:

 Acked-by: Michael S. Tsirkin 
>>>
>>> Thanks!
>>>


 But I worry whether we actually need 3/8 in net not in net-next.
>>>
>>> Because of "Fixes" tag ? I think this problem is not critical and 
>>> reproducible
>>> only in special cases, but i'm not familiar with netdev process so 
>>> good, so I don't
>>> have strong opinion. I guess @Stefano knows better.
>>>
>>> Thanks, Arseniy
>>
>> Fixes means "if you have that other commit then you need this commit
>> too". I think as a minimum you need to rearrange patches to make the
>> fix go in first. We don't want a regression followed by a fix.
>
> I see, ok, @Stefano WDYT? I think rearrange doesn't break anything, 
> because this
> patch fixes problem that is not related with the new patches from this 
> patchset.

 I agree, patch 3 is for sure net material (I'm fine with both 
 rearrangement or send it separately), but IMHO also patch 2 could be.
 I think with the same fixes tag, since before commit b89d882dc9fc 
 ("vsock/virtio: reduce credit update messages") we sent a credit update
 for every bytes we read, so we should not have this problem, right?
>>>
>>> Agree for 2, so I think I can rearrange: two fixes go first, then current 
>>> 0001, and then tests. And send it as V9 for 'net' only ?
>>>
>>> Thanks, Arseniy
>>
>>
>> hmm why not net-next?
> 
> Oh I missed your previous discussion. I think everything in net-next is
> safer.  Having said that, I won't nack it net, either.

So, summarizing all above:
1) This patchset entirely goes to net-next as v9
2) I reorder patches like 3 - 2 - 1 - 4, e.g. two fixes goes first with Fixes 
tag
3) Add Acked-by: Michael S. Tsirkin  to each patch

@Michael, @Stefano ?

Thanks, Arseniy

> 

 So, maybe all the series could be "net".

 Thanks,
 Stefano

> 



Re: [PATCH v4 1/3] Documentatiion/ABI: Add ABI documentation for sys-bus-dax

2023-12-13 Thread Jonathan Cameron
On Tue, 12 Dec 2023 12:08:30 -0700
Vishal Verma  wrote:

> Add the missing sysfs ABI documentation for the device DAX subsystem.
> Various ABI attributes under this have been present since v5.1, and more
> have been added over time. In preparation for adding a new attribute,
> add this file with the historical details.
> 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 

Hi Vishal,  One editorial suggestions.

I don't know the interface well enough to do a good review of the content
so leaving that for Dan or others.

> +What:/sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
> +Date:October, 2020
> +KernelVersion:   v5.10
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RO) A dax device may have multiple constituent discontiguous
> + address ranges. These are represented by the different
> + 'mappingX' subdirectories. The 'start' attribute indicates the
> + start physical address for the given range.

A common option for these files is to have a single entry with two What:
lines.  Here that would avoid duplication of majority of this text across
the start, end  and page_offset entries.  Alternatively you could do an
entry for the mapping[0..N] directory with the shared text then separate
entries for the 3 files under there.


> +
> +What:/sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
> +Date:October, 2020
> +KernelVersion:   v5.10
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RO) A dax device may have multiple constituent discontiguous
> + address ranges. These are represented by the different
> + 'mappingX' subdirectories. The 'end' attribute indicates the
> + end physical address for the given range.
> +
> +What:/sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
> +Date:October, 2020
> +KernelVersion:   v5.10
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RO) A dax device may have multiple constituent discontiguous
> + address ranges. These are represented by the different
> + 'mappingX' subdirectories. The 'page_offset' attribute 
> indicates the
> + offset of the current range in the dax device.




Re: [PATCH v4 2/3] dax/bus: Introduce guard(device) for device_{lock,unlock} flows

2023-12-13 Thread Jonathan Cameron
On Tue, 12 Dec 2023 12:08:31 -0700
Vishal Verma  wrote:

> Introduce a guard(device) macro to lock a 'struct device', and unlock it
> automatically when going out of scope using Scope Based Resource
> Management semantics. A lot of the sysfs attribute writes in
> drivers/dax/bus.c benefit from a cleanup using these, so change these
> where applicable.
> 
> Cc: Joao Martins 
> Suggested-by: Dan Williams 
> Signed-off-by: Vishal Verma 
Hi Vishal,

I'm a big fan of this cleanup.h stuff so very happen to see this getting used 
here.
There are added opportunities for cleanup that result.

Note that almost every time I see people using this stuff they don't look again
at the code post the change so miss the wider cleanup that it enables. So you 
are
in good company ;)

Jonathan

> ---
>  include/linux/device.h |   2 +
>  drivers/dax/bus.c  | 109 
> +++--
>  2 files changed, 44 insertions(+), 67 deletions(-)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index d7a72a8749ea..a83efd9ae949 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1131,6 +1131,8 @@ void set_secondary_fwnode(struct device *dev, struct 
> fwnode_handle *fwnode);
>  void device_set_of_node_from_dev(struct device *dev, const struct device 
> *dev2);
>  void device_set_node(struct device *dev, struct fwnode_handle *fwnode);
>  
> +DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))

Nice. I'd expect this to be widely adopted, so maybe to make things less painful
for backporting changes that depend on it, make this a separate trivial patch
rather than having this in here.

> +
>  static inline int dev_num_vf(struct device *dev)
>  {
>   if (dev->bus && dev->bus->num_vf)
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1ff1ab5fa105..ce1356ac6dc2 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -296,9 +296,8 @@ static ssize_t available_size_show(struct device *dev,
>   struct dax_region *dax_region = dev_get_drvdata(dev);
>   unsigned long long size;
>  
> - device_lock(dev);
> + guard(device)(dev);
>   size = dax_region_avail_size(dax_region);
> - device_unlock(dev);
>  
>   return sprintf(buf, "%llu\n", size);
return sprintf(buf, @%llu\n@, dax_region_avail_size(dax_region));
and drop the local variable that adds little perhaps?

>  }
> @@ -314,10 +313,9 @@ static ssize_t seed_show(struct device *dev,
>   if (is_static(dax_region))
>   return -EINVAL;
>  
> - device_lock(dev);
> + guard(device)(dev);
>   seed = dax_region->seed;
>   rc = sprintf(buf, "%s\n", seed ? dev_name(seed) : "");

return sprintf();

> - device_unlock(dev);
>  
>   return rc;
>  }
> @@ -333,10 +331,9 @@ static ssize_t create_show(struct device *dev,
>   if (is_static(dax_region))
>   return -EINVAL;
>  
> - device_lock(dev);
> + guard(device)(dev);
>   youngest = dax_region->youngest;
>   rc = sprintf(buf, "%s\n", youngest ? dev_name(youngest) : "");

return sprintf();

> - device_unlock(dev);
>  
>   return rc;
>  }
> @@ -345,7 +342,14 @@ static ssize_t create_store(struct device *dev, struct 
> device_attribute *attr,
>   const char *buf, size_t len)
>  {
>   struct dax_region *dax_region = dev_get_drvdata(dev);
> + struct dev_dax_data data = {
> + .dax_region = dax_region,
> + .size = 0,
> + .id = -1,
> + .memmap_on_memory = false,
> + };
>   unsigned long long avail;
> + struct dev_dax *dev_dax;
>   ssize_t rc;
>   int val;
>  
> @@ -358,38 +362,26 @@ static ssize_t create_store(struct device *dev, struct 
> device_attribute *attr,
>   if (val != 1)
>   return -EINVAL;
>  
> - device_lock(dev);
> + guard(device)(dev);
>   avail = dax_region_avail_size(dax_region);
>   if (avail == 0)
> - rc = -ENOSPC;
> - else {
> - struct dev_dax_data data = {
> - .dax_region = dax_region,
> - .size = 0,
> - .id = -1,
> - .memmap_on_memory = false,
> - };
> - struct dev_dax *dev_dax = devm_create_dev_dax();
> + return -ENOSPC;
>  
> - if (IS_ERR(dev_dax))
> - rc = PTR_ERR(dev_dax);
> - else {
> - /*
> -  * In support of crafting multiple new devices
> -  * simultaneously multiple seeds can be created,
> -  * but only the first one that has not been
> -  * successfully bound is tracked as the region
> -  * seed.
> -  */
> - if (!dax_region->seed)
> - dax_region->seed = _dax->dev;
> - dax_region->youngest = _dax->dev;
> -

Re: [PATCH] nvdimm-btt: simplify code with the scope based resource management

2023-12-13 Thread Dave Jiang



On 12/12/23 20:12, dinghao@zju.edu.cn wrote:
>>
>> On 12/10/23 03:27, Dinghao Liu wrote:
>>> Use the scope based resource management (defined in
>>> linux/cleanup.h) to automate resource lifetime
>>> control on struct btt_sb *super in discover_arenas().
>>>
>>> Signed-off-by: Dinghao Liu 
>>> ---
>>>  drivers/nvdimm/btt.c | 12 
>>>  1 file changed, 4 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>>> index d5593b0dc700..ff42778b51de 100644
>>> --- a/drivers/nvdimm/btt.c
>>> +++ b/drivers/nvdimm/btt.c
>>> @@ -16,6 +16,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include "btt.h"
>>>  #include "nd.h"
>>>  
>>> @@ -847,7 +848,7 @@ static int discover_arenas(struct btt *btt)
>>>  {
>>> int ret = 0;
>>> struct arena_info *arena;
>>> -   struct btt_sb *super;
>>> +   struct btt_sb *super __free(kfree) = NULL;
>>> size_t remaining = btt->rawsize;
>>> u64 cur_nlba = 0;
>>> size_t cur_off = 0;
>>> @@ -860,10 +861,8 @@ static int discover_arenas(struct btt *btt)
>>> while (remaining) {
>>> /* Alloc memory for arena */
>>> arena = alloc_arena(btt, 0, 0, 0);
>>> -   if (!arena) {
>>> -   ret = -ENOMEM;
>>> -   goto out_super;
>>> -   }
>>> +   if (!arena)
>>> +   return -ENOMEM;
>>>  
>>> arena->infooff = cur_off;
>>> ret = btt_info_read(arena, super);
>>> @@ -919,14 +918,11 @@ static int discover_arenas(struct btt *btt)
>>> btt->nlba = cur_nlba;
>>> btt->init_state = INIT_READY;
>>>  
>>> -   kfree(super);
>>> return ret;
>>>  
>>>   out:
>>> kfree(arena);
>>> free_arenas(btt);
>>> - out_super:
>>> -   kfree(super);
>>> return ret;
>>>  }
>>>  
>>
>> I would do the allocation like something below for the first chunk. 
>> Otherwise the rest LGTM. 
>>
>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>> index d5593b0dc700..143921e7f26c 100644
>> --- a/drivers/nvdimm/btt.c
>> +++ b/drivers/nvdimm/btt.c
>> @@ -16,6 +16,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include "btt.h"
>>  #include "nd.h"
>>  
>> @@ -845,25 +846,23 @@ static void parse_arena_meta(struct arena_info *arena, 
>> struct btt_sb *super,
>>  
>>  static int discover_arenas(struct btt *btt)
>>  {
>> +   struct btt_sb *super __free(kfree) =
>> +   kzalloc(sizeof(*super), GFP_KERNEL);
>> int ret = 0;
>> struct arena_info *arena;
>> -   struct btt_sb *super;
>> size_t remaining = btt->rawsize;
>> u64 cur_nlba = 0;
>> size_t cur_off = 0;
>> int num_arenas = 0;
>>  
>> -   super = kzalloc(sizeof(*super), GFP_KERNEL);
>> if (!super)
>> return -ENOMEM;
>>  
>> while (remaining) {
>> /* Alloc memory for arena */
> 
> It's a little strange that we do not check super immediately after allocation.
> How about this:
> 
>  static int discover_arenas(struct btt *btt)
>  {
> int ret = 0;
> struct arena_info *arena;
> -   struct btt_sb *super;
> size_t remaining = btt->rawsize;
> u64 cur_nlba = 0;
> size_t cur_off = 0;
> int num_arenas = 0;
>  
> -   super = kzalloc(sizeof(*super), GFP_KERNEL);
> +   struct btt_sb *super __free(kfree) = 
> +   kzalloc(sizeof(*super), GFP_KERNEL);
> if (!super)
> return -ENOMEM;
>  
> while (remaining) {
>  

That's fine by me



[PATCH v2] tracing: Increase size of trace_marker_raw to max ring buffer entry

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's no reason to give an arbitrary limit to the size of a raw trace
marker. Just let it be as big as the size that is allowed by the ring
buffer itself.

And there's also no reason to artificially break up the write to
TRACE_BUF_SIZE, as that's not even used.

Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231209175716.09ac4...@gandalf.local.home

- Moved "if (size > ring_buffer_max_event_size(buffer))" to after
  "buffer" is assigned, otherwise it creates an uninitialized warning
  error. (kernel test robot)

 kernel/trace/trace.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e07baf388ab3..95d02804d3ae 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7359,9 +7359,6 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
return written;
 }
 
-/* Limit it for now to 3K (including tag) */
-#define RAW_DATA_MAX_SIZE (1024*3)
-
 static ssize_t
 tracing_mark_raw_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *fpos)
@@ -7383,19 +7380,18 @@ tracing_mark_raw_write(struct file *filp, const char 
__user *ubuf,
return -EINVAL;
 
/* The marker must at least have a tag id */
-   if (cnt < sizeof(unsigned int) || cnt > RAW_DATA_MAX_SIZE)
+   if (cnt < sizeof(unsigned int))
return -EINVAL;
 
-   if (cnt > TRACE_BUF_SIZE)
-   cnt = TRACE_BUF_SIZE;
-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
-
size = sizeof(*entry) + cnt;
if (cnt < FAULT_SIZE_ID)
size += FAULT_SIZE_ID - cnt;
 
buffer = tr->array_buffer.buffer;
+
+   if (size > ring_buffer_max_event_size(buffer))
+   return -EINVAL;
+
event = __trace_buffer_lock_reserve(buffer, TRACE_RAW_DATA, size,
tracing_gen_ctx());
if (!event)
-- 
2.42.0




Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 10:05:44AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 12:08:27PM +0300, Arseniy Krasnov wrote:
> > 
> > 
> > On 13.12.2023 11:43, Stefano Garzarella wrote:
> > > On Tue, Dec 12, 2023 at 08:43:07PM +0300, Arseniy Krasnov wrote:
> > >>
> > >>
> > >> On 12.12.2023 19:12, Michael S. Tsirkin wrote:
> > >>> On Tue, Dec 12, 2023 at 06:59:03PM +0300, Arseniy Krasnov wrote:
> > 
> > 
> >  On 12.12.2023 18:54, Michael S. Tsirkin wrote:
> > > On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
> > >> Hello,
> > >>
> > >>    DESCRIPTION
> > >>
> > >> This patchset fixes old problem with hungup of both rx/tx sides and 
> > >> adds
> > >> test for it. This happens due to non-default SO_RCVLOWAT value and
> > >> deferred credit update in virtio/vsock. Link to previous old 
> > >> patchset:
> > >> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
> > >
> > >
> > > Patchset:
> > >
> > > Acked-by: Michael S. Tsirkin 
> > 
> >  Thanks!
> > 
> > >
> > >
> > > But I worry whether we actually need 3/8 in net not in net-next.
> > 
> >  Because of "Fixes" tag ? I think this problem is not critical and 
> >  reproducible
> >  only in special cases, but i'm not familiar with netdev process so 
> >  good, so I don't
> >  have strong opinion. I guess @Stefano knows better.
> > 
> >  Thanks, Arseniy
> > >>>
> > >>> Fixes means "if you have that other commit then you need this commit
> > >>> too". I think as a minimum you need to rearrange patches to make the
> > >>> fix go in first. We don't want a regression followed by a fix.
> > >>
> > >> I see, ok, @Stefano WDYT? I think rearrange doesn't break anything, 
> > >> because this
> > >> patch fixes problem that is not related with the new patches from this 
> > >> patchset.
> > > 
> > > I agree, patch 3 is for sure net material (I'm fine with both 
> > > rearrangement or send it separately), but IMHO also patch 2 could be.
> > > I think with the same fixes tag, since before commit b89d882dc9fc 
> > > ("vsock/virtio: reduce credit update messages") we sent a credit update
> > > for every bytes we read, so we should not have this problem, right?
> > 
> > Agree for 2, so I think I can rearrange: two fixes go first, then current 
> > 0001, and then tests. And send it as V9 for 'net' only ?
> > 
> > Thanks, Arseniy
> 
> 
> hmm why not net-next?

Oh I missed your previous discussion. I think everything in net-next is
safer.  Having said that, I won't nack it net, either.

> > > 
> > > So, maybe all the series could be "net".
> > > 
> > > Thanks,
> > > Stefano
> > > 




Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 12:08:27PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 13.12.2023 11:43, Stefano Garzarella wrote:
> > On Tue, Dec 12, 2023 at 08:43:07PM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 12.12.2023 19:12, Michael S. Tsirkin wrote:
> >>> On Tue, Dec 12, 2023 at 06:59:03PM +0300, Arseniy Krasnov wrote:
> 
> 
>  On 12.12.2023 18:54, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
> >> Hello,
> >>
> >>    DESCRIPTION
> >>
> >> This patchset fixes old problem with hungup of both rx/tx sides and 
> >> adds
> >> test for it. This happens due to non-default SO_RCVLOWAT value and
> >> deferred credit update in virtio/vsock. Link to previous old patchset:
> >> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
> >
> >
> > Patchset:
> >
> > Acked-by: Michael S. Tsirkin 
> 
>  Thanks!
> 
> >
> >
> > But I worry whether we actually need 3/8 in net not in net-next.
> 
>  Because of "Fixes" tag ? I think this problem is not critical and 
>  reproducible
>  only in special cases, but i'm not familiar with netdev process so good, 
>  so I don't
>  have strong opinion. I guess @Stefano knows better.
> 
>  Thanks, Arseniy
> >>>
> >>> Fixes means "if you have that other commit then you need this commit
> >>> too". I think as a minimum you need to rearrange patches to make the
> >>> fix go in first. We don't want a regression followed by a fix.
> >>
> >> I see, ok, @Stefano WDYT? I think rearrange doesn't break anything, 
> >> because this
> >> patch fixes problem that is not related with the new patches from this 
> >> patchset.
> > 
> > I agree, patch 3 is for sure net material (I'm fine with both rearrangement 
> > or send it separately), but IMHO also patch 2 could be.
> > I think with the same fixes tag, since before commit b89d882dc9fc 
> > ("vsock/virtio: reduce credit update messages") we sent a credit update
> > for every bytes we read, so we should not have this problem, right?
> 
> Agree for 2, so I think I can rearrange: two fixes go first, then current 
> 0001, and then tests. And send it as V9 for 'net' only ?
> 
> Thanks, Arseniy


hmm why not net-next?

> > 
> > So, maybe all the series could be "net".
> > 
> > Thanks,
> > Stefano
> > 




Re: [PATCH v2 12/33] kmsan: Allow disabling KMSAN checks for the current task

2023-12-13 Thread Ilya Leoshkevich
On Mon, 2023-12-11 at 12:50 +0100, Alexander Potapenko wrote:
> On Tue, Nov 21, 2023 at 11:06 PM Ilya Leoshkevich 
> wrote:
> > 
> > Like for KASAN, it's useful to temporarily disable KMSAN checks
> > around,
> > e.g., redzone accesses. Introduce kmsan_disable_current() and
> > kmsan_enable_current(), which are similar to their KASAN
> > counterparts.
> 
> Initially we used to have this disablement counter in KMSAN, but
> adding it uncontrollably can result in KMSAN not functioning
> properly.
> E.g. forgetting to call kmsan_disable_current() or underflowing the
> counter will break reporting.
> We'd better put this API in include/linux/kmsan.h to indicate it
> should be discouraged.
> 
> > Even though it's not strictly necessary, make them reentrant, in
> > order
> > to match the KASAN behavior.
> 
> Until this becomes strictly necessary, I think we'd better
> KMSAN_WARN_ON if the counter is re-entered.

I encountered a case when we are freeing memory from an interrupt
handler:

[  149.840553] [ cut here ]   
[  149.840649] WARNING: CPU: 1 PID: 181 at mm/kmsan/hooks.c:447
kmsan_disable_current+0x2e/0x40   
[  149.840790] Modules linked in: 
[  149.840894] CPU: 1 PID: 181 Comm: (direxec) Tainted: GB   W
N 6.7.0-rc5-gd34a4b46f382 #13
[  149.841003] Hardware name: IBM 3931 A01 704 (KVM/Linux) 
[  149.841094] Krnl PSW : 0404c0018000 0197dbc2
(kmsan_disable_current+0x32/0x40)
[  149.841276]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0
PM:0 RI:0 EA:3
[  149.841420] Krnl GPRS: 0040 96914100
1000 0001
[  149.841518]036d827daee0 07c97008
80096500 92f4f000
[  149.841617]036d 
0040 
[  149.841712]92f4efc0 0001ff710f60
0193acba 037f0008f710
[  149.841893] Krnl Code: 0197dbb6: eb0018640352mviy  
14436(%r1),0
[  149.841893]0197dbbc: 07febcr   
15,%r14
[  149.841893]   #0197dbbe: af00mc
0,0
[  149.841893]   >0197dbc2: a7f4fffabrc   
15,0197dbb6
[  149.841893]0197dbc6: 0700bcr   
0,%r0
[  149.841893]0197dbc8: 0700bcr   
0,%r0
[  149.841893]0197dbca: 0700bcr   
0,%r0
[  149.841893]0197dbcc: 0700bcr   
0,%r0
[  149.842438] Call Trace:
15:37:25 [90/1838]
[  149.842510]  [<0197dbc2>] kmsan_disable_current+0x32/0x40 
[  149.842631] ([<0193ac14>] slab_pad_check+0x1d4/0xac0)
[  149.842738]  [<01949222>] free_to_partial_list+0x1d72/0x3b80
[  149.842850]  [<01947066>] __slab_free+0xd86/0x11d0 
[  149.842956]  [<019111e8>] kmem_cache_free+0x15d8/0x25d0 
[  149.843062]  [<00229e3a>] __tlb_remove_table+0x20a/0xa50 
[  149.843174]  [<016c7f98>] tlb_remove_table_rcu+0x98/0x120 
[  149.843291]  [<0083e1c6>] rcu_core+0x15b6/0x54b0 
[  149.843406]  [<069c3c0e>] __do_softirq+0xa1e/0x2178 
[  149.843514]  [<003467b4>] irq_exit_rcu+0x2c4/0x630 
[  149.843623]  [<06949f6e>] do_ext_irq+0x9e/0x120 
[  149.843736]  [<069c18d4>] ext_int_handler+0xc4/0xf0 
[  149.843841]  [<0197e428>] kmsan_get_metadata+0x68/0x280 
[  149.843950]  [<0197e344>]
kmsan_get_shadow_origin_ptr+0x74/0xf0 
[  149.844071]  [<0197ba3a>]
__msan_metadata_ptr_for_load_8+0x2a/0x40 
[  149.844192]  [<00184e4a>]
unwind_get_return_address+0xda/0x150 
[  149.844313]  [<0018fd12>] arch_stack_walk+0x172/0x2f0 
[  149.844417]  [<008f1af0>] stack_trace_save+0x100/0x160 
[  149.844529]  [<0197af22>]
kmsan_internal_chain_origin+0x62/0xe0 
[  149.844647]  [<0197c1f0>] __msan_chain_origin+0xd0/0x160 
[  149.844763]  [<068b3ba4>] memchr_inv+0x5b4/0xb20 
[  149.844877]  [<0193e730>] check_bytes_and_report+0xa0/0xd30 
[  149.844986]  [<0193b920>] check_object+0x420/0x17d0 
[  149.845092]  [<0194aa8a>] free_to_partial_list+0x35da/0x3b80
[  149.845202]  [<01947066>] __slab_free+0xd86/0x11d0 
[  149.845308]  [<019111e8>] kmem_cache_free+0x15d8/0x25d0 
[  149.845414]  [<016bc2fe>] exit_mmap+0x87e/0x1200 
[  149.845524]  [<002f315c>] mmput+0x13c/0x5b0 
[  149.845632]  [<01b9d634>] exec_mmap+0xc34/0x1230 
[  149.845744]  [<01b996c2>] begin_new_exec+0xcf2/0x2520 
[  149.845857]  [<01f6a084>] load_elf_binary+0x2364/0x67d0 
[  149.845971]  [<01ba5ba4>] bprm_execve+0x25b4/0x4010 
[  149.846083]  [<01baa7e6>] do_execveat_common+0x2436/0x2600 
[  149.846200]  [<01ba78f8>] 

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > > wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
>   <...>
>   did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> not mistaken
>   if (!did_work)
>   schedule();
>   <...>
> }
> 
> static bool vhost_worker(void *data)
> {
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
> 
>   node = llist_del_all(>work_list);
>   if (node) {
>   <...>
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   <...>
>   }
>   }
> 
>   return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, 
> doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.


So we are actually making an effort to be nice.
Documentation/kernel-hacking/hacking.rst says:

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

cond_resched(); /* Will sleep */


and this is what vhost.c does.

At this point I'm not sure why it's appropriate to call schedule() as opposed to
cond_resched(). Ideas?


-- 
MST




  1   2   >