KVM Forum 2013 Save the Date
Hi, Where can get slides in 2012 KVM Forum? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] swap: redirty page if page write fails on swap file
Hi Jerome, On 04/24/2013 05:57 PM, Jerome Marchand wrote: On 04/22/2013 10:37 PM, Andrew Morton wrote: On Wed, 17 Apr 2013 14:11:55 +0200 Jerome Marchand wrote: Since commit 62c230b, swap_writepage() calls direct_IO on swap files. However, in that case page isn't redirtied if I/O fails, and is therefore handled afterwards as if it has been successfully written to the swap file, leading to memory corruption when the page is eventually swapped back in. This patch sets the page dirty when direct_IO() fails. It fixes a memory corruption that happened while using swap-over-NFS. ... --- a/mm/page_io.c +++ b/mm/page_io.c @@ -222,6 +222,8 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) if (ret == PAGE_SIZE) { count_vm_event(PSWPOUT); ret = 0; + } else { + set_page_dirty(page); } return ret; } So what happens to the page now? It remains dirty and the kernel later tries to write it again? Yes. Also, AS_EIO or AS_ENOSPC is set to the address space flags (in this case, swapper_space). After set AS_EIO or AS_ENOSPC, we can't touch swapper_space any more, correct? And if that write also fails, the page is effectively leaked until process exit? AFAICT, there is no special handling for that page afterwards, so if all subsequent attempts fail, it's indeed going to stay in memory until freed. Jerome Aside: Mel, __swap_writepage() is fairly hair-raising. It unlocks the page before doing the IO and doesn't set PageWriteback(). Why such an exception from normal handling? Also, what is protecting the page from concurrent reclaim or exit() during the above swap_writepage()? Seems that the code needs a bunch of fixes or a bunch of comments explaining why it is safe and why it has to be this way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] Obey mark_page_accessed hint given by filesystems
Hi Mel, On 04/30/2013 12:31 AM, Mel Gorman wrote: Andrew Perepechko reported a problem whereby pages are being prematurely evicted as the mark_page_accessed() hint is ignored for pages that are currently on a pagevec -- http://www.spinics.net/lists/linux-ext4/msg37340.html . Alexey Lyahkov and Robin Dong have also reported problems recently that could be due to hot pages reaching the end of the inactive list too quickly and be reclaimed. Both shrink_active_list and shrink_inactive_list can call lru_add_drain(), why the hot pages can't be mark Actived during this time? Rather than addressing this on a per-filesystem basis, this series aims to fix the mark_page_accessed() interface by deferring what LRU a page is added to pagevec drain time and allowing mark_page_accessed() to call SetPageActive on a pagevec page. This opens some important races that I think should be harmless but needs double checking. The races and the VM_BUG_ON checks that are removed are all described in patch 2. This series received only very light testing but it did not immediately blow up and a debugging patch confirmed that pages are now getting added to the active file LRU list that would previously have been added to the inactive list. fs/cachefiles/rdwr.c| 30 ++-- fs/nfs/dir.c| 7 ++ include/linux/pagevec.h | 34 +-- mm/swap.c | 61 - mm/vmscan.c | 3 --- 5 files changed, 40 insertions(+), 95 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM-killer and strange RSS value in 3.9-rc7
Hi Christoph, On 04/29/2013 10:49 PM, Christoph Lameter wrote: On Sat, 27 Apr 2013, Will Huck wrote: Hi Christoph, On 04/26/2013 01:17 AM, Christoph Lameter wrote: On Thu, 25 Apr 2013, Han Pingtian wrote: I have enabled "slub_debug" and here is the /sys/kernel/slab/kmalloc-512/alloc_calls contents: 50 .__alloc_workqueue_key+0x90/0x5d0 age=113630/116957/119419 pid=1-1730 cpus=0,6-8,13,24,26,44,53,57,60,68 nodes=1 11 .__alloc_workqueue_key+0x16c/0x5d0 age=113814/116733/119419 pid=1-1730 cpus=0,44,68 nodes=1 13 .add_sysfs_param.isra.2+0x80/0x210 age=115175/117994/118779 pid=1-1342 cpus=0,8,12,24,60 nodes=1 160 .build_sched_domains+0x108/0xe30 age=119111/119120/119131 pid=1 cpus=0 nodes=1 9000 .alloc_fair_sched_group+0xe4/0x220 age=110549/114471/117357 pid=1-2290 cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79 nodes=1 9000 .alloc_fair_sched_group+0x114/0x220 age=110549/114471/117357 pid=1-2290 cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79 nodes=1 Could you explain the meaning of age=xx/xx/xx pid=xx-xx cpus=xx here? Age refers to the mininum / avg / maximum age of the object in ticks. Why need monitor the age of the object? pid refers to the range of pids by processes running when the objects were created. cpus are the processors on which kernel threads where running when these objects were allocated. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap
Hi Peter, On 04/28/2013 12:00 PM, H. Peter Anvin wrote: Not reserved page, reserved bits in the page tables (which includes all bits beyond the maximum physical address.) Thanks for your clarify. When these reserved bits are set? Another question, if configure UMA to fake numa can get benefit? Will Huck wrote: On 04/28/2013 03:13 AM, Frantisek Hrbata wrote: On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote: On 04/26/2013 11:35 PM, Frantisek Hrbata wrote: On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote: Hi Peter, On 04/02/2013 08:28 PM, Frantisek Hrbata wrote: When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated with RSVD error code. RSVD flag (bit 3). This flag is 1 if there is no valid translation for the linear address because a reserved bit was set in one of the paging-structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.) In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap In this case, remap_pfn_range() setup the map and reserved bits for mmio memory, so the mmio memory is already populated, why trigger #PF? Hi, I think this is described in the quote above for the RSVD flag. remap_pfn_range() => page present => touch page => tlb miss => walk through paging structures => reserved bit set => #pf with rsvd flag Page present can also trigger #PF? why? Yes, please see Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A 4.7 PAGE-FAULT EXCEPTIONS · RSVD flag (bit 3). This flag is 1 if there is no valid translation for the linear address because a reserved bit was set in one of the paging-structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.) Bits reserved in the paging-structure entries are reserved for future functionality. Software developers should be aware that such bits may be used in the future and that a paging-structure entry that causes a page-fault exception on one processor might not do so in the future. I cannot tell you why. I guess this is more a question for some Intel guys. Anyway this patch is trying to fix the following problem and the "Bad pagetable" oops. -8<-- #include #include #include #include #include #include #include #include #define die(fmt, ...) err(1, fmt, ##__VA_ARGS__) /* 1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is defined $ cat /proc/iomem | grep -v "\(System RAM\|reserved\)" 2) Find physical address width $ cat /proc/cpuinfo | grep "address sizes" PTE bits 51 - M are reserved, where M is physical address width found 2) Note: step 2) is actually not needed, we can always set just the 51th bit (0x8) What's the meaning here? You trigger oops since the address is beyond max address cpu supported or access to a reserved page? If the answer is the latter, I'm think it's not right. For example, the kernel code/data section is reserved in memory, kernel access it will trigger oops? I don't think so. Set OFFSET macro to (start of iomem range found in 1)) | (1 << 51) for example 0x000a | 0x8 = 0x8000a where 0x000a is start of PCI BUS on my laptop */ #define OFFSET 0x8000aLL int main(int argc, char *argv[]) { int fd; long ps; long pgoff; char *map; char c; ps = sysconf(_SC_PAGE_SIZE); if (ps == -1) die("cannot get page size"); fd = open("/dev/mem", O_RDONLY); if (fd == -1) die("cannot open /dev/mem"); printf("%Lx\n", pgoff); pgoff = (OFFSET + (ps - 1)) & ~(ps - 1); printf("%Lx\n", pgoff); map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff); if (map == MAP_FAILED) die("cannot mmap"); c = map[0]; if (munmap(map, ps) == -1)
Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap
On 04/28/2013 03:13 AM, Frantisek Hrbata wrote: On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote: On 04/26/2013 11:35 PM, Frantisek Hrbata wrote: On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote: Hi Peter, On 04/02/2013 08:28 PM, Frantisek Hrbata wrote: When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated with RSVD error code. RSVD flag (bit 3). This flag is 1 if there is no valid translation for the linear address because a reserved bit was set in one of the paging-structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.) In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap In this case, remap_pfn_range() setup the map and reserved bits for mmio memory, so the mmio memory is already populated, why trigger #PF? Hi, I think this is described in the quote above for the RSVD flag. remap_pfn_range() => page present => touch page => tlb miss => walk through paging structures => reserved bit set => #pf with rsvd flag Page present can also trigger #PF? why? Yes, please see Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A 4.7 PAGE-FAULT EXCEPTIONS · RSVD flag (bit 3). This flag is 1 if there is no valid translation for the linear address because a reserved bit was set in one of the paging-structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.) Bits reserved in the paging-structure entries are reserved for future functionality. Software developers should be aware that such bits may be used in the future and that a paging-structure entry that causes a page-fault exception on one processor might not do so in the future. I cannot tell you why. I guess this is more a question for some Intel guys. Anyway this patch is trying to fix the following problem and the "Bad pagetable" oops. -8<-- #include #include #include #include #include #include #include #include #define die(fmt, ...) err(1, fmt, ##__VA_ARGS__) /* 1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is defined $ cat /proc/iomem | grep -v "\(System RAM\|reserved\)" 2) Find physical address width $ cat /proc/cpuinfo | grep "address sizes" PTE bits 51 - M are reserved, where M is physical address width found 2) Note: step 2) is actually not needed, we can always set just the 51th bit (0x8) What's the meaning here? You trigger oops since the address is beyond max address cpu supported or access to a reserved page? If the answer is the latter, I'm think it's not right. For example, the kernel code/data section is reserved in memory, kernel access it will trigger oops? I don't think so. Set OFFSET macro to (start of iomem range found in 1)) | (1 << 51) for example 0x000a | 0x8 = 0x8000a where 0x000a is start of PCI BUS on my laptop */ #define OFFSET 0x8000aLL int main(int argc, char *argv[]) { int fd; long ps; long pgoff; char *map; char c; ps = sysconf(_SC_PAGE_SIZE); if (ps == -1) die("cannot get page size"); fd = open("/dev/mem", O_RDONLY); if (fd == -1) die("cannot open /dev/mem"); printf("%Lx\n", pgoff); pgoff = (OFFSET + (ps - 1)) & ~(ps - 1); printf("%Lx\n", pgoff); map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff); if (map == MAP_FAILED) die("cannot mmap"); c = map[0]; if (munmap(map, ps) == -1) die("cannot munmap"); if (close(fd) == -1) die("cannot close"); return 0; } -8<-- Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted page table at address 7f34087c8000 Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b
Re: OOM-killer and strange RSS value in 3.9-rc7
Hi Christoph, On 04/26/2013 01:17 AM, Christoph Lameter wrote: On Thu, 25 Apr 2013, Han Pingtian wrote: I have enabled "slub_debug" and here is the /sys/kernel/slab/kmalloc-512/alloc_calls contents: 50 .__alloc_workqueue_key+0x90/0x5d0 age=113630/116957/119419 pid=1-1730 cpus=0,6-8,13,24,26,44,53,57,60,68 nodes=1 11 .__alloc_workqueue_key+0x16c/0x5d0 age=113814/116733/119419 pid=1-1730 cpus=0,44,68 nodes=1 13 .add_sysfs_param.isra.2+0x80/0x210 age=115175/117994/118779 pid=1-1342 cpus=0,8,12,24,60 nodes=1 160 .build_sched_domains+0x108/0xe30 age=119111/119120/119131 pid=1 cpus=0 nodes=1 9000 .alloc_fair_sched_group+0xe4/0x220 age=110549/114471/117357 pid=1-2290 cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79 nodes=1 9000 .alloc_fair_sched_group+0x114/0x220 age=110549/114471/117357 pid=1-2290 cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79 nodes=1 Could you explain the meaning of age=xx/xx/xx pid=xx-xx cpus=xx here? ?? Is that normal to have that amount of sched group allocations? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap
On 04/26/2013 11:35 PM, Frantisek Hrbata wrote: On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote: Hi Peter, On 04/02/2013 08:28 PM, Frantisek Hrbata wrote: When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated with RSVD error code. RSVD flag (bit 3). This flag is 1 if there is no valid translation for the linear address because a reserved bit was set in one of the paging-structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.) In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap In this case, remap_pfn_range() setup the map and reserved bits for mmio memory, so the mmio memory is already populated, why trigger #PF? Hi, I think this is described in the quote above for the RSVD flag. remap_pfn_range() => page present => touch page => tlb miss => walk through paging structures => reserved bit set => #pf with rsvd flag Page present can also trigger #PF? why? I hope I didn't misunderstand your question. Thanks on /dev/mem and cause system panic. It's probably not that serious, because access to /dev/mem is limited and the system has to have panic_on_oops set, but still I think we should check this and return error. This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the requested phys addr is bigger then the supported phys addr width. [1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A Signed-off-by: Frantisek Hrbata --- arch/x86/include/asm/io.h | 4 arch/x86/mm/mmap.c| 13 + 2 files changed, 17 insertions(+) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index d8e8eef..39607c6 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void) #endif } +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE +extern int valid_phys_addr_range(phys_addr_t addr, size_t count); +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count); + #endif /* __KERNEL__ */ extern void native_io_delay(void); diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c index 845df68..92ec31c 100644 --- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c @@ -31,6 +31,8 @@ #include #include +#include "physaddr.h" + struct __read_mostly va_alignment va_align = { .flags = -1, }; @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm) mm->unmap_area = arch_unmap_area_topdown; } } + +int valid_phys_addr_range(phys_addr_t addr, size_t count) +{ + return addr + count <= __pa(high_memory); +} + +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count) +{ + resource_size_t addr = (pfn << PAGE_SHIFT) + count; + return phys_addr_valid(addr); +} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
KVM Forum 2013 Save the Date
Hi, Where can get slides in 2012 KVM Forum? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap
Hi Peter, On 04/02/2013 08:28 PM, Frantisek Hrbata wrote: When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated with RSVD error code. RSVD flag (bit 3). This flag is 1 if there is no valid translation for the linear address because a reserved bit was set in one of the paging-structure entries used to translate that address. (Because reserved bits are not checked in a paging-structure entry whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also set.) In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap In this case, remap_pfn_range() setup the map and reserved bits for mmio memory, so the mmio memory is already populated, why trigger #PF? on /dev/mem and cause system panic. It's probably not that serious, because access to /dev/mem is limited and the system has to have panic_on_oops set, but still I think we should check this and return error. This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the requested phys addr is bigger then the supported phys addr width. [1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A Signed-off-by: Frantisek Hrbata --- arch/x86/include/asm/io.h | 4 arch/x86/mm/mmap.c| 13 + 2 files changed, 17 insertions(+) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index d8e8eef..39607c6 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void) #endif } +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE +extern int valid_phys_addr_range(phys_addr_t addr, size_t count); +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count); + #endif /* __KERNEL__ */ extern void native_io_delay(void); diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c index 845df68..92ec31c 100644 --- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c @@ -31,6 +31,8 @@ #include #include +#include "physaddr.h" + struct __read_mostly va_alignment va_align = { .flags = -1, }; @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm) mm->unmap_area = arch_unmap_area_topdown; } } + +int valid_phys_addr_range(phys_addr_t addr, size_t count) +{ + return addr + count <= __pa(high_memory); +} + +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count) +{ + resource_size_t addr = (pfn << PAGE_SHIFT) + count; + return phys_addr_valid(addr); +} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add a sysctl for numa_balancing.
On 04/25/2013 07:56 AM, Andi Kleen wrote: From: Andi Kleen As discussed earlier, this adds a working sysctl to enable/disable automatic numa memory balancing at runtime. This was possible earlier through debugfs, but only with special debugging options set. Also fix the boot message. One offline question. If I configure uma to fake numa, is there benefit or downside? Signed-off-by: Andi Kleen --- Documentation/sysctl/kernel.txt | 10 ++ include/linux/sched/sysctl.h|4 kernel/sched/core.c | 24 +++- kernel/sysctl.c | 11 +++ mm/mempolicy.c |2 +- 5 files changed, 49 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index ccd4258..17a7004 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -354,6 +354,16 @@ utilize. == +numa_balancing + +Enables/disables automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes +that access it often. + +TBD someone document the other numa_balancing tunables + +== + osrelease, ostype & version: # cat osrelease diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index bf8086b..e228a1b 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -101,4 +101,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +extern int sched_numa_balancing(struct ctl_table *table, int write, +void __user *buffer, size_t *lenp, +loff_t *ppos); + #endif /* _SCHED_SYSCTL_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 67d0465..679be74 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1614,7 +1614,29 @@ void set_numabalancing_state(bool enabled) numabalancing_enabled = enabled; } #endif /* CONFIG_SCHED_DEBUG */ -#endif /* CONFIG_NUMA_BALANCING */ + +#ifdef CONFIG_PROC_SYSCTL +int sched_numa_balancing(struct ctl_table *table, int write, +void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table t; + int err; + int state = numabalancing_enabled; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + t = *table; + t.data = &state; + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + if (err < 0) + return err; + if (write) + set_numabalancing_state(state); + return err; +} +#endif +#endif /* * fork()/clone()-time setup: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index afc1dc6..94164ac 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -393,6 +393,17 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "numa_balancing", + .data = NULL, /* filled in by handler */ + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sched_numa_balancing, + .extra1 = &zero, + .extra2 = &one, + }, + + #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_SCHED_DEBUG */ { diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7431001..7eee646 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2531,7 +2531,7 @@ static void __init check_numabalancing_enable(void) if (nr_node_ids > 1 && !numabalancing_override) { printk(KERN_INFO "Enabling automatic NUMA balancing. " - "Configure with numa_balancing= or sysctl"); + "Configure with numa_balancing= or the kernel.numa_balancing sysctl"); set_numabalancing_state(numabalancing_default); } } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slab: Remove unnecessary __builtin_constant_p()
Hi Steven, On 04/18/2013 03:09 AM, Steven Rostedt wrote: The slab.c code has a size check macro that checks the size of the following structs: struct arraycache_init struct kmem_list3 The index_of() function that takes the sizeof() of the above two structs and does an unnecessary __builtin_constant_p() on that. As sizeof() will always end up being a constant making this always be true. The code is not incorrect, but it just adds added complexity, and confuses users and wastes the time of reviewers of the code, who spends time trying to figure out why the builtin_constant_p() was used. In normal case, builtin_constant_p() is used for what? This patch is just a clean up that makes the index_of() code a little bit less complex. Signed-off-by: Steven Rostedt diff --git a/mm/slab.c b/mm/slab.c index 856e4a1..6047900 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -325,9 +325,7 @@ static void cache_reap(struct work_struct *unused); static __always_inline int index_of(const size_t size) { extern void __bad_size(void); - - if (__builtin_constant_p(size)) { - int i = 0; + int i = 0; #define CACHE(x) \ if (size <=x) \ @@ -336,9 +334,7 @@ static __always_inline int index_of(const size_t size) i++; #include #undef CACHE - __bad_size(); - } else - __bad_size(); + __bad_size(); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The downside of page cache use-once replacement algorithm is inter-reference distance, corret? Does it have any other downside? What's the downside of two-handed clock algorithm against anonymous pages? If they get referenced before they reach the end of the inactive anon list, they get moved back to the active list. If we need to swap something out and find a non-referenced page at the end of the inactive anon list, we will swap it out. In order to make good pageout decisions, pages need to stay on the inactive anon list for a longer time, so they have plenty of time to get referenced, before the reclaim code looks at them. To achieve that, we will move some active anon pages to the inactive anon list even when we do not want to swap anything out - as long as the inactive anon list is below its target size. Does that make sense? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages Why the algorithm has relationship with two-handed clock? start off on the active anon list. Older anonymous pages get moved to the inactive anon list. If they get referenced before they reach the end of the inactive anon list, they get moved back to the active list. If we need to swap something out and find a non-referenced page at the end of the inactive anon list, we will swap it out. In order to make good pageout decisions, pages need to stay on the inactive anon list for a longer time, so they have plenty of time to get referenced, before the reclaim code looks at them. To achieve that, we will move some active anon pages to the inactive anon list even when we do not want to swap anything out - as long as the inactive anon list is below its target size. Does that make sense? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
cc Fengguang, On 04/05/2013 08:05 AM, Will Huck wrote: Hi Rik, On 03/22/2013 09:01 PM, Rik van Riel wrote: On 03/22/2013 12:59 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:56 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The file lrus also use the two-handed clock algorithm, correct? After reinvestigate the codes, the answer is no. But why have this difference? I think you are the expert for this question, expect your explanation. :-) Anonymous memory has a smaller amount of memory (on the order of system memory), most of which is or has been in a working set at some point. File system cache tends to have two distinct sets. One part are the frequently accessed files, another part are the files that are accessed just once or twice. The file working set needs to be protected from streaming IO. We do this by having new file pages start out on the Is there streaming IO workload or benchmark? inactive file list, and only promoted to the active file list if they get accessed twice. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Ping Rik. On 04/05/2013 08:05 AM, Will Huck wrote: Hi Rik, On 03/22/2013 09:01 PM, Rik van Riel wrote: On 03/22/2013 12:59 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:56 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The file lrus also use the two-handed clock algorithm, correct? After reinvestigate the codes, the answer is no. But why have this difference? I think you are the expert for this question, expect your explanation. :-) Anonymous memory has a smaller amount of memory (on the order of system memory), most of which is or has been in a working set at some point. File system cache tends to have two distinct sets. One part are the frequently accessed files, another part are the files that are accessed just once or twice. The file working set needs to be protected from streaming IO. We do this by having new file pages start out on the Is there streaming IO workload or benchmark? inactive file list, and only promoted to the active file list if they get accessed twice. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Rik, On 03/22/2013 09:01 PM, Rik van Riel wrote: On 03/22/2013 12:59 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:56 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The file lrus also use the two-handed clock algorithm, correct? After reinvestigate the codes, the answer is no. But why have this difference? I think you are the expert for this question, expect your explanation. :-) Anonymous memory has a smaller amount of memory (on the order of system memory), most of which is or has been in a working set at some point. File system cache tends to have two distinct sets. One part are the frequently accessed files, another part are the files that are accessed just once or twice. The file working set needs to be protected from streaming IO. We do this by having new file pages start out on the Is there streaming IO workload or benchmark? inactive file list, and only promoted to the active file list if they get accessed twice. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The file lrus also use the two-handed clock algorithm, correct? If they get referenced before they reach the end of the inactive anon list, they get moved back to the active list. If we need to swap something out and find a non-referenced page at the end of the inactive anon list, we will swap it out. In order to make good pageout decisions, pages need to stay on the inactive anon list for a longer time, so they have plenty of time to get referenced, before the reclaim code looks at them. To achieve that, we will move some active anon pages to the inactive anon list even when we do not want to swap anything out - as long as the inactive anon list is below its target size. Does that make sense? Make sense, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Rik, On 03/22/2013 11:56 AM, Will Huck wrote: Hi Rik, On 03/22/2013 11:52 AM, Rik van Riel wrote: On 03/21/2013 08:05 PM, Will Huck wrote: One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); The anon lrus use a two-handed clock algorithm. New anonymous pages start off on the active anon list. Older anonymous pages get moved to the inactive anon list. The file lrus also use the two-handed clock algorithm, correct? After reinvestigate the codes, the answer is no. But why have this difference? I think you are the expert for this question, expect your explanation. :-) If they get referenced before they reach the end of the inactive anon list, they get moved back to the active list. If we need to swap something out and find a non-referenced page at the end of the inactive anon list, we will swap it out. In order to make good pageout decisions, pages need to stay on the inactive anon list for a longer time, so they have plenty of time to get referenced, before the reclaim code looks at them. To achieve that, we will move some active anon pages to the inactive anon list even when we do not want to swap anything out - as long as the inactive anon list is below its target size. Does that make sense? Make sense, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Rik, On 03/21/2013 08:52 AM, Rik van Riel wrote: On 03/20/2013 12:18 PM, Michal Hocko wrote: On Sun 17-03-13 13:04:07, Mel Gorman wrote: [...] diff --git a/mm/vmscan.c b/mm/vmscan.c index 88c5fed..4835a7a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, } /* + * kswapd shrinks the zone by the number of pages required to reach + * the high watermark. + */ +static void kswapd_shrink_zone(struct zone *zone, + struct scan_control *sc, + unsigned long lru_pages) +{ +unsigned long nr_slab; +struct reclaim_state *reclaim_state = current->reclaim_state; +struct shrink_control shrink = { +.gfp_mask = sc->gfp_mask, +}; + +/* Reclaim above the high watermark. */ +sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); OK, so the cap is at high watermark which sounds OK to me, although I would expect balance_gap being considered here. Is it not used intentionally or you just wanted to have a reasonable upper bound? I am not objecting to that it just hit my eyes. This is the maximum number of pages to reclaim, not the point at which to stop reclaiming. What's the difference between the maximum number of pages to reclaim and the point at which to stop reclaiming? I assume Mel chose this value because it guarantees that enough pages will have been freed, while also making sure that the value is scaled according to zone size (keeping pressure between zones roughly equal). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
Hi Johannes, On 03/21/2013 11:57 PM, Johannes Weiner wrote: On Sun, Mar 17, 2013 at 01:04:07PM +, Mel Gorman wrote: The number of pages kswapd can reclaim is bound by the number of pages it scans which is related to the size of the zone and the scanning priority. In many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large number of pages it cannot reclaim, it will raise the priority and potentially discard a large percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible effect is a reclaim "spike" where a large percentage of memory is suddenly freed. It would be bad enough if this was just unused memory but because of how anon/file pages are balanced it is possible that applications get pushed to swap unnecessarily. This patch limits the number of pages kswapd will reclaim to the high watermark. Reclaim will will overshoot due to it not being a hard limit as will -> still? shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it prevents kswapd reclaiming the world at higher priorities. The number of pages it reclaims is not adjusted for high-order allocations as kswapd will reclaim excessively if it is to balance zones for high-order allocations. I don't really understand this last sentence. Is the excessive reclaim a result of the patch, a description of what's happening now...? Signed-off-by: Mel Gorman Nice, thank you. Using the high watermark for larger zones is more reasonable than my hack that just always went with SWAP_CLUSTER_MAX, what with inter-zone LRU cycle time balancing and all. Acked-by: Johannes Weiner One offline question, how to understand this in function balance_pgdat: /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_acitve_anon(zone, &sc); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: page_alloc: remove branch operation in free_pages_prepare()
Hi Hugh, On 03/08/2013 10:01 AM, Hugh Dickins wrote: On Fri, 8 Mar 2013, Joonsoo Kim wrote: On Thu, Mar 07, 2013 at 10:54:15AM -0800, Hugh Dickins wrote: On Thu, 7 Mar 2013, Joonsoo Kim wrote: When we found that the flag has a bit of PAGE_FLAGS_CHECK_AT_PREP, we reset the flag. If we always reset the flag, we can reduce one branch operation. So remove it. Cc: Hugh Dickins Signed-off-by: Joonsoo Kim I don't object to this patch. But certainly I would have written it that way in order not to dirty a cacheline unnecessarily. It may be obvious to you that the cacheline in question is almost always already dirty, and the branch almost always more expensive. But I'll leave that to you, and to those who know more about these subtle costs than I do. Yes. I already think about that. I thought that even if a cacheline is not dirty at this time, we always touch the 'struct page' in set_freepage_migratetype() a little later, so dirtying is not the problem. I expect that a very high proportion of user pages have PG_uptodate to be cleared here; and there's also the recently added When PG_uptodate will be set? page_nid_reset_last(), which will dirty the flags or a nearby field when CONFIG_NUMA_BALANCING. Those argue in favour of your patch. But, now, I re-think this and decide to drop this patch. The reason is that 'struct page' of 'compound pages' may not be dirty at this time and will not be dirty at later time. Actual compound pages would have PG_head or PG_tail or PG_compound to be cleared there, I believe (check if I'm right on that). The questionable case is the ordinary order>0 case without __GFP_COMP (and page_nid_reset_last() is applied to each subpage of those). So this patch is bad idea. I'm not so sure. I doubt your patch will make a giant improvement in kernel performance! But it might make a little - maybe you just need to give some numbers from perf to justify it (but I'm easily dazzled by numbers - don't expect me to judge the result). Hugh Is there any comments? Thanks. Hugh diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fcced7..778f2a9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -614,8 +614,7 @@ static inline int free_pages_check(struct page *page) return 1; } page_nid_reset_last(page); - if (page->flags & PAGE_FLAGS_CHECK_AT_PREP) - page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Inactive memory keep growing and how to release it?
Cc experts. Hugh, Johannes, On 03/04/2013 08:21 PM, Lenky Gao wrote: 2013/3/4 Zlatko Calusic : The drop_caches mechanism doesn't free dirty page cache pages. And your bash script is creating a lot of dirty pages. Run it like this and see if it helps your case: sync; echo 3 > /proc/sys/vm/drop_caches Thanks for your advice. The inactive memory still cannot be reclaimed after i execute the sync command: # cat /proc/meminfo | grep Inactive\(file\); Inactive(file): 882824 kB # sync; # echo 3 > /proc/sys/vm/drop_caches # cat /proc/meminfo | grep Inactive\(file\); Inactive(file): 777664 kB I find these page becomes orphaned in this function, but do not understand why: /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes orphaned. It will be left on the LRU and may even be mapped into * user pagetables if we're racing with filemap_fault(). * * We need to bale out if page->mapping is no longer equal to the original * mapping. This happens a) when the VM reclaimed the page while we waited on * its lock, b) when a concurrent invalidate_mapping_pages got there first and * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. */ static int truncate_complete_page(struct address_space *mapping, struct page *page) { ... My file system type is ext3, mounted with the opteion data=journal and it is easy to reproduce. ___ devel mailing list devel@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/devel
Re: Inactive memory keep growing and how to release it?
Cc experts. Hugh, Johannes, On 03/04/2013 08:21 PM, Lenky Gao wrote: 2013/3/4 Zlatko Calusic : The drop_caches mechanism doesn't free dirty page cache pages. And your bash script is creating a lot of dirty pages. Run it like this and see if it helps your case: sync; echo 3 > /proc/sys/vm/drop_caches Thanks for your advice. The inactive memory still cannot be reclaimed after i execute the sync command: # cat /proc/meminfo | grep Inactive\(file\); Inactive(file): 882824 kB # sync; # echo 3 > /proc/sys/vm/drop_caches # cat /proc/meminfo | grep Inactive\(file\); Inactive(file): 777664 kB I find these page becomes orphaned in this function, but do not understand why: /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes orphaned. It will be left on the LRU and may even be mapped into * user pagetables if we're racing with filemap_fault(). * * We need to bale out if page->mapping is no longer equal to the original * mapping. This happens a) when the VM reclaimed the page while we waited on * its lock, b) when a concurrent invalidate_mapping_pages got there first and * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. */ static int truncate_complete_page(struct address_space *mapping, struct page *page) { ... My file system type is ext3, mounted with the opteion data=journal and it is easy to reproduce. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] tmpfs: fix mempolicy object leaks
Hi Hugh, On 03/06/2013 03:40 AM, Hugh Dickins wrote: On Mon, 4 Mar 2013, Will Huck wrote: Could you explain me why shmem has more relationship with mempolicy? It seems that there are many codes in shmem handle mempolicy, but other components in mm subsystem just have little. NUMA mempolicy is mostly handled in mm/mempolicy.c, which services the mbind, migrate_pages, set_mempolicy, get_mempolicy system calls: which govern how process memory is distributed across NUMA nodes. mm/shmem.c is affected because it was also found useful to specify mempolicy on the shared memory objects which may back process memory: that includes SysV SHM and POSIX shared memory and tmpfs. mm/hugetlb.c contains some mempolicy handling for hugetlbfs; fs/ramfs is kept minimal, so nothing in there. Those are the memory-based filesystems, where NUMA mempolicy is most natural. The regular filesystems could support shared mempolicy too, but that would raise more awkward design questions. I found that if mbind several processes to one node and almost exhaust memory, processes will just stuck and no processes make progress or be killed. Is it normal? Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
Hi Hugh, On 03/02/2013 10:57 AM, Hugh Dickins wrote: How ksm treat a ksm forked page? IIUC, it's not merged in ksm stable tree. It will just be ignore? On Sat, 2 Mar 2013, Ric Mason wrote: On 03/02/2013 04:03 AM, Hugh Dickins wrote: On Fri, 1 Mar 2013, Ric Mason wrote: I think the ksm implementation for num awareness is buggy. Sorry, I just don't understand your comments below, but will try to answer or question them as best I can. For page migratyion stuff, new page is allocated from node *which page is migrated to*. Yes, by definition. - when meeting a page from the wrong NUMA node in an unstable tree get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page) I thought you were writing of the wrong NUMA node case, but now you emphasize "*==*", which means the right NUMA node. Yes, I mean the wrong NUMA node. During page migration, new page has already been allocated in new node and old page maybe freed. So tree_page is the page in new node's unstable tree, page is also new node page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page). I don't understand; but here you seem to be describing a case where two pages from the same NUMA node get merged (after both have been migrated from another NUMA node?), and there's nothing wrong with that, so I won't worry about it further. - meeting a page which is ksm page before migration get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture them since stable_node is for tree page in current stable tree. They are always equal. When we meet a ksm page in the stable tree before it's migrated to another NUMA node, yes, it will be on the right NUMA node (because we were careful only to merge pages from the right NUMA node there), and that test will not capture them. It's for capturng a ksm page in the stable tree after it has been migrated to another NUMA node. ksm page migrated to another NUMA node still not freed, why? Who take page count of it? The old page, the one which used to be a ksm page on the old NUMA node, should be freed very soon: since it was isolated from lru, and its page count checked, I cannot think of anything to hold a reference to it, apart from migration itself - so it just needs to reach putback_lru_page(), and then may rest awhile on __lru_cache_add()'s pagevec before being freed. But I don't see where I said the old page was still not freed. If not freed, since new page is allocated in new node, it is the copy of current ksm page, so current ksm doesn't change, get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid). But ksm_migrate_page() did VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); stable_node->kpfn = page_to_pfn(newpage); without changing stable_node->nid. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] tmpfs: fix mempolicy object leaks
Hi Hugh, On 02/21/2013 04:26 AM, Hugh Dickins wrote: On Tue, 19 Feb 2013, Greg Thelen wrote: This patch fixes several mempolicy leaks in the tmpfs mount logic. These leaks are slow - on the order of one object leaked per mount attempt. Leak 1 (umount doesn't free mpol allocated in mount): while true; do mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt umount /mnt done Leak 2 (errors parsing remount options will leak mpol): mount -t tmpfs -o size=100M nodev /mnt while true; do mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null done umount /mnt Leak 3 (multiple mpol per mount leak mpol): while true; do mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt umount /mnt done This patch fixes all of the above. I could have broken the patch into three pieces but is seemed easier to review as one. Yes, I agree, and nicely fixed - but one doubt below. If you resolve that, please add my Acked-by: Hugh Dickins Could you explain me why shmem has more relationship with mempolicy? It seems that there are many codes in shmem handle mempolicy, but other components in mm subsystem just have little. Signed-off-by: Greg Thelen --- mm/shmem.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index efd0b3a..ed2cb26 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2386,6 +2386,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, bool remount) { char *this_char, *value, *rest; + struct mempolicy *mpol = NULL; uid_t uid; gid_t gid; @@ -2414,7 +2415,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, printk(KERN_ERR "tmpfs: No value for mount option '%s'\n", this_char); - return 1; + goto error; } if (!strcmp(this_char,"size")) { @@ -2463,19 +2464,23 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, if (!gid_valid(sbinfo->gid)) goto bad_val; } else if (!strcmp(this_char,"mpol")) { - if (mpol_parse_str(value, &sbinfo->mpol)) + mpol_put(mpol); I haven't tested to check, but don't we need mpol = NULL; here, in case the new option turns out to be bad? + if (mpol_parse_str(value, &mpol)) goto bad_val; } else { printk(KERN_ERR "tmpfs: Bad mount option %s\n", this_char); - return 1; + goto error; } } + sbinfo->mpol = mpol; return 0; bad_val: printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n", value, this_char); +error: + mpol_put(mpol); return 1; } @@ -2551,6 +2556,7 @@ static void shmem_put_super(struct super_block *sb) struct shmem_sb_info *sbinfo = SHMEM_SB(sb); percpu_counter_destroy(&sbinfo->used_blocks); + mpol_put(sbinfo->mpol); kfree(sbinfo); sb->s_fs_info = NULL; } -- 1.8.1.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT.
On 02/20/2013 08:31 PM, Tang Chen wrote: On 02/20/2013 07:00 PM, Tang Chen wrote: As mentioned by HPA before, when we are using movablemem_map=acpi, if all the memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. Before parsing SRAT, memblock has already reserve some memory ranges for other purposes, such as for kernel image, and so on. We cannot prevent kernel from using these memory. So we need to exclude these ranges even if these memory is hotpluggable. This patch changes the movablemem_map=acpi option's behavior. The memory ranges reserved by memblock will not be added into movablemem_map.map[]. So even if all the memory is hotpluggable, there will always be memory that could be used by the kernel. What's the relationship between e820 map and SRAT? Reported-by: H Peter Anvin Signed-off-by: Tang Chen --- arch/x86/mm/srat.c | 18 +- 1 files changed, 17 insertions(+), 1 deletions(-) diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c index 62ba97b..b8028b2 100644 --- a/arch/x86/mm/srat.c +++ b/arch/x86/mm/srat.c @@ -145,7 +145,7 @@ static inline int save_add_info(void) {return 0;} static void __init handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) { -int overlap; +int overlap, i; unsigned long start_pfn, end_pfn; start_pfn = PFN_DOWN(start); @@ -161,8 +161,24 @@ handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable) * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. + * + * Before parsing SRAT, memblock has already reserve some memory ranges + * for other purposes, such as for kernel image. We cannot prevent + * kernel from using these memory, so we need to exclude these memory + * even if it is hotpluggable. */ if (hotpluggable&& movablemem_map.acpi) { +/* Exclude ranges reserved by memblock. */ +struct memblock_type *rgn =&memblock.reserved; + +for (i = 0; i< rgn->cnt; i++) { +if (end<= rgn->regions[i].base || +start>= rgn->regions[i].base + +rgn->regions[i].size) Hi all, Here, I scan the memblock.reserved each time we parse an entry because the rgn->regions[i].nid is set to MAX_NUMNODES in memblock_reserve(). So I cannot obtain the nid which the kernel resides in directly from memblock.reserved. I think there could be some problems if the memory ranges in SRAT are not in increasing order, since if [3,4) [1,2) are all on node0, and kernel is not using [3,4), but using [1,2), then I cannot remove [3,4) because I don't know on which node [3,4) is. Any idea for this ? And by the way, I think this approach works well when the memory entries in SRAT are arranged in increasing order. Thanks. :) +continue; +goto out; +} + insert_movablemem_map(start_pfn, end_pfn); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Should a swapped out page be deleted from swap cache?
On 02/20/2013 03:06 AM, Hugh Dickins wrote: On Tue, 19 Feb 2013, Will Huck wrote: Another question: I don't see the connection to deleting a swapped out page from swap cache. Why kernel memory mapping use direct mapping instead of kmalloc/vmalloc which will setup mapping on demand? I may misunderstand you, and "kernel memory mapping". kmalloc does not set up a mapping, it uses the direct mapping already set up. It would be circular if the basic page allocation primitives used kmalloc, since kmalloc relies on the basic page allocation primitives. vmalloc is less efficient than using the direct mapping (repeated setup and teardown, no use of hugepages), but necessary when you want a larger Is there tlb flush in setup and teardown process? and they also expensive? virtual array than you're likely to find from the buddy allocator. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Should a swapped out page be deleted from swap cache?
On 02/19/2013 10:04 AM, Li Haifeng wrote: 2013/2/19 Hugh Dickins On Mon, 18 Feb 2013, Li Haifeng wrote: For explain my question, the two points should be displayed as below. 1. If an anonymous page is swapped out, this page will be deleted from swap cache and be put back into buddy system. Yes, unless the page is referenced again before it comes to be deleted from swap cache. 2. When a page is swapped out, the sharing count of swap slot must not be zero. That is, page_swapcount(page) will not return zero. I would not say "must not": we just prefer not to waste time on swapping a page out if its use count has already gone to 0. And its use count might go down to 0 an instant after swap_writepage() makes that check. Thanks for your reply and patience. If a anonymous page is swapped out and comes to be reclaimable, shrink_page_list() will call __remove_mapping() to delete the page swapped out from swap cache. Corresponding code lists as below. I'm not sure if if (PageAnon(page) && !PageSwapCache(page)) { . } will add the page to swap cache again. 765 static unsigned long shrink_page_list(struct list_head *page_list, 766 struct mem_cgroup_zone *mz, 767 struct scan_control *sc, 768 int priority, 769 unsigned long *ret_nr_dirty, 770 unsigned long *ret_nr_writeback) 771 { ... 971 if (!mapping || !__remove_mapping(mapping, page)) 972 goto keep_locked; 973 974 /* 975 * At this point, we have no other references and there is 976 * no way to pick any more up (removed from LRU, removed 977 * from pagecache). Can use non-atomic bitops now (and 978 * we obviously don't have to worry about waking up a process 979 * waiting on the page lock, because there are no references. 980 */ 981 __clear_page_locked(page); 982 free_it: 983 nr_reclaimed++; 984 985 /* 986 * Is there need to periodically free_page_list? It would 987 * appear not as the counts should be low 988 */ 989 list_add(&page->lru, &free_pages); 990 continue; Please correct me if my understanding is wrong. Thanks. Are both of them above right? According the two points above, I was confused to the line 655 below. When a page is swapped out, the return value of page_swapcount(page) will not be zero. So, the page couldn't be deleted from swap cache. Yes, we cannot free the swap as long as its data might be needed again. But a swap cache page may linger in memory for an indefinite time, in between being queued for write out, and actually being freed from the end of the lru by memory pressure. At various points where we hold the page lock on a swap cache page, it's worth checking whether it is still actually needed, or could now be freed from swap cache, and the corresponding swap slot freed: that's what try_to_free_swap() does. I do agree. Thanks again. Hugh 644 * If swap is getting full, or if there are no more mappings of this page, 645 * then try_to_free_swap is called to free its swap space. 646 */ 647 int try_to_free_swap(struct page *page) 648 { 649 VM_BUG_ON(!PageLocked(page)); 650 651 if (!PageSwapCache(page)) 652 return 0; 653 if (PageWriteback(page)) 654 return 0; 655 if (page_swapcount(page))//Has referenced by other swap out page. 656 return 0; 657 658 /* 659 * Once hibernation has begun to create its image of memory, 660 * there's a danger that one of the calls to try_to_free_swap() 661 * - most probably a call from __try_to_reclaim_swap() while 662 * hibernation is allocating its own swap pages for the image, 663 * but conceivably even a call from memory reclaim - will free 664 * the swap from a page which has already been recorded in the 665 * image as a clean swapcache page, and then reuse its swap for 666 * another page of the image. On waking from hibernation, the 667 * original page might be freed under memory pressure, then 668 * later read back in from swap, now with the wrong data. 669 * 670 * Hibration suspends storage while it is writing the image 671 * to disk so check that here. 672 */ 673 if (pm_suspended_storage()) 674 return 0; 675 676 delete_from_swap_cache(page); 677 SetPageDirty(page); 678 return 1; 679 } Thanks. -- To unsubscribe,
Re: Should a swapped out page be deleted from swap cache?
Hi Hugh, On 02/19/2013 02:06 AM, Hugh Dickins wrote: Another question: Why kernel memory mapping use direct mapping instead of kmalloc/vmalloc which will setup mapping on demand? On Mon, 18 Feb 2013, Li Haifeng wrote: For explain my question, the two points should be displayed as below. 1. If an anonymous page is swapped out, this page will be deleted from swap cache and be put back into buddy system. Yes, unless the page is referenced again before it comes to be deleted from swap cache. 2. When a page is swapped out, the sharing count of swap slot must not be zero. That is, page_swapcount(page) will not return zero. I would not say "must not": we just prefer not to waste time on swapping a page out if its use count has already gone to 0. And its use count might go down to 0 an instant after swap_writepage() makes that check. Are both of them above right? According the two points above, I was confused to the line 655 below. When a page is swapped out, the return value of page_swapcount(page) will not be zero. So, the page couldn't be deleted from swap cache. Yes, we cannot free the swap as long as its data might be needed again. But a swap cache page may linger in memory for an indefinite time, in between being queued for write out, and actually being freed from the end of the lru by memory pressure. At various points where we hold the page lock on a swap cache page, it's worth checking whether it is still actually needed, or could now be freed from swap cache, and the corresponding swap slot freed: that's what try_to_free_swap() does. Hugh 644 * If swap is getting full, or if there are no more mappings of this page, 645 * then try_to_free_swap is called to free its swap space. 646 */ 647 int try_to_free_swap(struct page *page) 648 { 649 VM_BUG_ON(!PageLocked(page)); 650 651 if (!PageSwapCache(page)) 652 return 0; 653 if (PageWriteback(page)) 654 return 0; 655 if (page_swapcount(page))//Has referenced by other swap out page. 656 return 0; 657 658 /* 659 * Once hibernation has begun to create its image of memory, 660 * there's a danger that one of the calls to try_to_free_swap() 661 * - most probably a call from __try_to_reclaim_swap() while 662 * hibernation is allocating its own swap pages for the image, 663 * but conceivably even a call from memory reclaim - will free 664 * the swap from a page which has already been recorded in the 665 * image as a clean swapcache page, and then reuse its swap for 666 * another page of the image. On waking from hibernation, the 667 * original page might be freed under memory pressure, then 668 * later read back in from swap, now with the wrong data. 669 * 670 * Hibration suspends storage while it is writing the image 671 * to disk so check that here. 672 */ 673 if (pm_suspended_storage()) 674 return 0; 675 676 delete_from_swap_cache(page); 677 SetPageDirty(page); 678 return 1; 679 } Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/