KVM Forum 2013 Save the Date

2013-05-02 Thread Will Huck

Hi,

Where can get slides in 2012 KVM Forum?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] swap: redirty page if page write fails on swap file

2013-05-01 Thread Will Huck

Hi Jerome,
On 04/24/2013 05:57 PM, Jerome Marchand wrote:

On 04/22/2013 10:37 PM, Andrew Morton wrote:

On Wed, 17 Apr 2013 14:11:55 +0200 Jerome Marchand  wrote:


Since commit 62c230b, swap_writepage() calls direct_IO on swap files.
However, in that case page isn't redirtied if I/O fails, and is therefore
handled afterwards as if it has been successfully written to the swap
file, leading to memory corruption when the page is eventually swapped
back in.
This patch sets the page dirty when direct_IO() fails. It fixes a memory
corruption that happened while using swap-over-NFS.

...

--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -222,6 +222,8 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
ret = 0;
+   } else {
+   set_page_dirty(page);
}
return ret;
}

So what happens to the page now?  It remains dirty and the kernel later
tries to write it again?

Yes. Also, AS_EIO or AS_ENOSPC is set to the address space flags (in this
case, swapper_space).


After set AS_EIO or AS_ENOSPC, we can't touch swapper_space any more,  
correct?





And if that write also fails, the page is
effectively leaked until process exit?

AFAICT, there is no special handling for that page afterwards, so if all
subsequent attempts fail, it's indeed going to stay in memory until freed.

Jerome




Aside: Mel, __swap_writepage() is fairly hair-raising.  It unlocks the
page before doing the IO and doesn't set PageWriteback().  Why such an
exception from normal handling?

Also, what is protecting the page from concurrent reclaim or exit()
during the above swap_writepage()?

Seems that the code needs a bunch of fixes or a bunch of comments
explaining why it is safe and why it has to be this way.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] Obey mark_page_accessed hint given by filesystems

2013-04-30 Thread Will Huck

Hi Mel,
On 04/30/2013 12:31 AM, Mel Gorman wrote:

Andrew Perepechko reported a problem whereby pages are being prematurely
evicted as the mark_page_accessed() hint is ignored for pages that are
currently on a pagevec -- http://www.spinics.net/lists/linux-ext4/msg37340.html 
.
Alexey Lyahkov and Robin Dong have also reported problems recently that
could be due to hot pages reaching the end of the inactive list too quickly
and be reclaimed.


Both shrink_active_list and shrink_inactive_list can call 
lru_add_drain(), why the hot pages can't be mark Actived during this time?



Rather than addressing this on a per-filesystem basis, this series aims
to fix the mark_page_accessed() interface by deferring what LRU a page
is added to pagevec drain time and allowing mark_page_accessed() to call
SetPageActive on a pagevec page. This opens some important races that
I think should be harmless but needs double checking. The races and the
VM_BUG_ON checks that are removed are all described in patch 2.

This series received only very light testing but it did not immediately
blow up and a debugging patch confirmed that pages are now getting added
to the active file LRU list that would previously have been added to the
inactive list.

  fs/cachefiles/rdwr.c| 30 ++--
  fs/nfs/dir.c|  7 ++
  include/linux/pagevec.h | 34 +--
  mm/swap.c   | 61 -
  mm/vmscan.c |  3 ---
  5 files changed, 40 insertions(+), 95 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM-killer and strange RSS value in 3.9-rc7

2013-04-30 Thread Will Huck

Hi Christoph,
On 04/29/2013 10:49 PM, Christoph Lameter wrote:

On Sat, 27 Apr 2013, Will Huck wrote:


Hi Christoph,
On 04/26/2013 01:17 AM, Christoph Lameter wrote:

On Thu, 25 Apr 2013, Han Pingtian wrote:


I have enabled "slub_debug" and here is the
/sys/kernel/slab/kmalloc-512/alloc_calls contents:

   50 .__alloc_workqueue_key+0x90/0x5d0 age=113630/116957/119419
pid=1-1730 cpus=0,6-8,13,24,26,44,53,57,60,68 nodes=1
   11 .__alloc_workqueue_key+0x16c/0x5d0 age=113814/116733/119419
pid=1-1730 cpus=0,44,68 nodes=1
   13 .add_sysfs_param.isra.2+0x80/0x210 age=115175/117994/118779
pid=1-1342 cpus=0,8,12,24,60 nodes=1
  160 .build_sched_domains+0x108/0xe30 age=119111/119120/119131 pid=1
cpus=0 nodes=1
 9000 .alloc_fair_sched_group+0xe4/0x220 age=110549/114471/117357
pid=1-2290
cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79
nodes=1
 9000 .alloc_fair_sched_group+0x114/0x220 age=110549/114471/117357
pid=1-2290
cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79
nodes=1

Could you explain the meaning of  age=xx/xx/xx  pid=xx-xx cpus=xx here?


Age refers to the mininum / avg / maximum age of the object in ticks.


Why need monitor the age of the object?



pid refers to the range of pids by processes running when the objects were
created.

cpus are the processors on which kernel threads where running when these
objects were allocated.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

2013-04-28 Thread Will Huck

Hi Peter,
On 04/28/2013 12:00 PM, H. Peter Anvin wrote:

Not reserved page, reserved bits in the page tables (which includes all bits 
beyond the maximum physical address.)


Thanks for your clarify. When these reserved bits are set?

Another question, if configure UMA to fake numa can get benefit?



Will Huck  wrote:


On 04/28/2013 03:13 AM, Frantisek Hrbata wrote:

On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:

On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:

On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:

Hi Peter,
On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:

When CR4.PAE is set, the 64b PTE's are

used(ARCH_PHYS_ADDR_T_64BIT is set for

X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some

higher bits in 64b

PTE are reserved and have to be set to zero. For example, for

IA-32e and 4KB

page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are

reserved. So

for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be

zero. If one of

the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF

is generated

with RSVD error code.


RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear

address because a

reserved bit was set in one of the paging-structure entries used

to translate

that address. (Because reserved bits are not checked in a

paging-structure entry

whose P flag is 0, bit 3 of the error code can be set only if bit

0 is also

set.)


In mmap_mem() the first check is valid_mmap_phys_addr_range(),

but it always

returns 1 on x86. So it's possible to use any pgoff we want and

to set the PTE's

reserved bits in remap_pfn_range(). Meaning there is a

possibility to use mmap

In this case, remap_pfn_range() setup the map and reserved bits

for

mmio memory, so the mmio memory is already populated, why trigger
#PF?

Hi,

I think this is described in the quote above for the RSVD flag.

remap_pfn_range() => page present => touch page => tlb miss =>
walk through paging structures => reserved bit set => #pf with rsvd

flag

Page present can also trigger #PF? why?

Yes, please see
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume

3A

4.7 PAGE-FAULT EXCEPTIONS

· RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear

address because

a reserved bit was set in one of the paging-structure entries used to
translate that address. (Because reserved bits are not checked in a
paging-structure entry whose P flag is 0, bit 3 of the error code can

be set

only if bit 0 is also set.) Bits reserved in the paging-structure

entries are

reserved for future functionality. Software developers should be

aware that

such bits may be used in the future and that a paging-structure entry

that

causes a page-fault exception on one processor might not do so in the

future.



I cannot tell you why. I guess this is more a question for some Intel

guys.

Anyway this patch is trying to fix the following problem and
the "Bad pagetable" oops.



-8<--

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)

/*
 1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is

defined

 $ cat /proc/iomem | grep -v "\(System RAM\|reserved\)"

 2) Find physical address width
 $ cat /proc/cpuinfo | grep "address sizes"

 PTE bits 51 - M are reserved, where M is physical address width

found 2)

 Note: step 2) is actually not needed, we can always set just the

51th bit

 (0x8)

What's the meaning here? You trigger oops since the address is beyond
max address cpu supported or access to a reserved page? If the answer
is
the latter, I'm think it's not right. For example, the kernel code/data

section is reserved in memory, kernel access it will trigger oops? I
don't think so.


 Set OFFSET macro to

 (start of iomem range found in 1)) | (1 << 51)

 for example
 0x000a | 0x8 = 0x8000a

 where 0x000a is start of PCI BUS on my laptop

   */

#define OFFSET 0x8000aLL

int main(int argc, char *argv[])
{
int fd;
long ps;
long pgoff;
char *map;
char c;

ps = sysconf(_SC_PAGE_SIZE);
if (ps == -1)
die("cannot get page size");

fd = open("/dev/mem", O_RDONLY);
if (fd == -1)
die("cannot open /dev/mem");

printf("%Lx\n", pgoff);
pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
printf("%Lx\n", pgoff);

map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
if (map == MAP_FAILED)
die("cannot mmap");

c = map[0];

if (munmap(map, ps) == -1)
 

Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

2013-04-27 Thread Will Huck

On 04/28/2013 03:13 AM, Frantisek Hrbata wrote:

On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:

On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:

On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:

Hi Peter,
On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:

When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
with RSVD error code.


RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear address because a
reserved bit was set in one of the paging-structure entries used to translate
that address. (Because reserved bits are not checked in a paging-structure entry
whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
set.)


In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap

In this case, remap_pfn_range() setup the map and reserved bits for
mmio memory, so the mmio memory is already populated, why trigger
#PF?

Hi,

I think this is described in the quote above for the RSVD flag.

remap_pfn_range() => page present => touch page => tlb miss =>
walk through paging structures => reserved bit set => #pf with rsvd flag

Page present can also trigger #PF? why?

Yes, please see
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A

4.7 PAGE-FAULT EXCEPTIONS

· RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear address because
a reserved bit was set in one of the paging-structure entries used to
translate that address. (Because reserved bits are not checked in a
paging-structure entry whose P flag is 0, bit 3 of the error code can be set
only if bit 0 is also set.) Bits reserved in the paging-structure entries are
reserved for future functionality. Software developers should be aware that
such bits may be used in the future and that a paging-structure entry that
causes a page-fault exception on one processor might not do so in the future.


I cannot tell you why. I guess this is more a question for some Intel guys.

Anyway this patch is trying to fix the following problem and
the "Bad pagetable" oops.

-8<--
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)

/*
1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is defined
$ cat /proc/iomem | grep -v "\(System RAM\|reserved\)"

2) Find physical address width
$ cat /proc/cpuinfo | grep "address sizes"

PTE bits 51 - M are reserved, where M is physical address width found 2)
Note: step 2) is actually not needed, we can always set just the 51th bit
(0x8)


What's the meaning here? You trigger oops since the address is beyond 
max address cpu supported or access to a reserved page? If the answer is 
the latter, I'm think it's not right. For example, the kernel code/data 
section is reserved in memory, kernel access it will trigger oops? I 
don't think so.




Set OFFSET macro to

(start of iomem range found in 1)) | (1 << 51)

for example
0x000a | 0x8 = 0x8000a

where 0x000a is start of PCI BUS on my laptop

  */

#define OFFSET 0x8000aLL

int main(int argc, char *argv[])
{
int fd;
long ps;
long pgoff;
char *map;
char c;

ps = sysconf(_SC_PAGE_SIZE);
if (ps == -1)
die("cannot get page size");

fd = open("/dev/mem", O_RDONLY);
if (fd == -1)
die("cannot open /dev/mem");

printf("%Lx\n", pgoff);
pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
printf("%Lx\n", pgoff);

map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
if (map == MAP_FAILED)
die("cannot mmap");

c = map[0];

if (munmap(map, ps) == -1)
die("cannot munmap");

if (close(fd) == -1)
die("cannot close");

return 0;
}
-8<--

Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted page table 
at address 7f34087c8000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b

Re: OOM-killer and strange RSS value in 3.9-rc7

2013-04-27 Thread Will Huck

Hi Christoph,
On 04/26/2013 01:17 AM, Christoph Lameter wrote:

On Thu, 25 Apr 2013, Han Pingtian wrote:


I have enabled "slub_debug" and here is the
/sys/kernel/slab/kmalloc-512/alloc_calls contents:

  50 .__alloc_workqueue_key+0x90/0x5d0 age=113630/116957/119419 pid=1-1730 
cpus=0,6-8,13,24,26,44,53,57,60,68 nodes=1
  11 .__alloc_workqueue_key+0x16c/0x5d0 age=113814/116733/119419 pid=1-1730 
cpus=0,44,68 nodes=1
  13 .add_sysfs_param.isra.2+0x80/0x210 age=115175/117994/118779 pid=1-1342 
cpus=0,8,12,24,60 nodes=1
 160 .build_sched_domains+0x108/0xe30 age=119111/119120/119131 pid=1 cpus=0 
nodes=1
9000 .alloc_fair_sched_group+0xe4/0x220 age=110549/114471/117357 pid=1-2290 
cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79
 nodes=1
9000 .alloc_fair_sched_group+0x114/0x220 age=110549/114471/117357 
pid=1-2290 
cpus=0-1,5,9-11,13,24,29,33,36,38,40-41,45,48-50,53,56-58,60-63,68-69,72-73,76-77,79
 nodes=1


Could you explain the meaning of  age=xx/xx/xx  pid=xx-xx cpus=xx here?


?? Is that normal to have that amount of sched group allocations?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

2013-04-27 Thread Will Huck

On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:

On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:

Hi Peter,
On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:

When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
with RSVD error code.


RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear address because a
reserved bit was set in one of the paging-structure entries used to translate
that address. (Because reserved bits are not checked in a paging-structure entry
whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
set.)


In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap

In this case, remap_pfn_range() setup the map and reserved bits for
mmio memory, so the mmio memory is already populated, why trigger
#PF?

Hi,

I think this is described in the quote above for the RSVD flag.

remap_pfn_range() => page present => touch page => tlb miss =>
walk through paging structures => reserved bit set => #pf with rsvd flag


Page present can also trigger #PF? why?



I hope I didn't misunderstand your question.

Thanks


on /dev/mem and cause system panic. It's probably not that serious, because
access to /dev/mem is limited and the system has to have panic_on_oops set, but
still I think we should check this and return error.

This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
requested phys addr is bigger then the supported phys addr width.

[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A

Signed-off-by: Frantisek Hrbata 
---
  arch/x86/include/asm/io.h |  4 
  arch/x86/mm/mmap.c| 13 +
  2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d8e8eef..39607c6 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
  #endif
  }
+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
+extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
+extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
+
  #endif /* __KERNEL__ */
  extern void native_io_delay(void);
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 845df68..92ec31c 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -31,6 +31,8 @@
  #include 
  #include 
+#include "physaddr.h"
+
  struct __read_mostly va_alignment va_align = {
.flags = -1,
  };
@@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
mm->unmap_area = arch_unmap_area_topdown;
}
  }
+
+int valid_phys_addr_range(phys_addr_t addr, size_t count)
+{
+   return addr + count <= __pa(high_memory);
+}
+
+int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
+{
+   resource_size_t addr = (pfn << PAGE_SHIFT) + count;
+   return phys_addr_valid(addr);
+}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


KVM Forum 2013 Save the Date

2013-04-26 Thread Will Huck

Hi,

Where can get slides in 2012 KVM Forum?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

2013-04-25 Thread Will Huck

Hi Peter,
On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:

When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
with RSVD error code.


RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear address because a
reserved bit was set in one of the paging-structure entries used to translate
that address. (Because reserved bits are not checked in a paging-structure entry
whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
set.)


In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap


In this case, remap_pfn_range() setup the map and reserved bits for mmio 
memory, so the mmio memory is already populated, why trigger #PF?



on /dev/mem and cause system panic. It's probably not that serious, because
access to /dev/mem is limited and the system has to have panic_on_oops set, but
still I think we should check this and return error.

This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
requested phys addr is bigger then the supported phys addr width.

[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A

Signed-off-by: Frantisek Hrbata 
---
  arch/x86/include/asm/io.h |  4 
  arch/x86/mm/mmap.c| 13 +
  2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d8e8eef..39607c6 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
  #endif
  }
  
+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE

+extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
+extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
+
  #endif /* __KERNEL__ */
  
  extern void native_io_delay(void);

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 845df68..92ec31c 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -31,6 +31,8 @@
  #include 
  #include 
  
+#include "physaddr.h"

+
  struct __read_mostly va_alignment va_align = {
.flags = -1,
  };
@@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
mm->unmap_area = arch_unmap_area_topdown;
}
  }
+
+int valid_phys_addr_range(phys_addr_t addr, size_t count)
+{
+   return addr + count <= __pa(high_memory);
+}
+
+int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
+{
+   resource_size_t addr = (pfn << PAGE_SHIFT) + count;
+   return phys_addr_valid(addr);
+}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add a sysctl for numa_balancing.

2013-04-24 Thread Will Huck

On 04/25/2013 07:56 AM, Andi Kleen wrote:

From: Andi Kleen 

As discussed earlier, this adds a working sysctl to enable/disable
automatic numa memory balancing at runtime.

This was possible earlier through debugfs, but only with special
debugging options set. Also fix the boot message.


One offline question.

If I configure uma to fake numa, is there benefit or downside?



Signed-off-by: Andi Kleen 
---
  Documentation/sysctl/kernel.txt |   10 ++
  include/linux/sched/sysctl.h|4 
  kernel/sched/core.c |   24 +++-
  kernel/sysctl.c |   11 +++
  mm/mempolicy.c  |2 +-
  5 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..17a7004 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,16 @@ utilize.
  
  ==
  
+numa_balancing

+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+TBD someone document the other numa_balancing tunables
+
+==
+
  osrelease, ostype & version:
  
  # cat osrelease

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..e228a1b 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -101,4 +101,8 @@ extern int sched_rt_handler(struct ctl_table *table, int 
write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
  
+extern int sched_numa_balancing(struct ctl_table *table, int write,

+void __user *buffer, size_t *lenp,
+loff_t *ppos);
+
  #endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..679be74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1614,7 +1614,29 @@ void set_numabalancing_state(bool enabled)
numabalancing_enabled = enabled;
  }
  #endif /* CONFIG_SCHED_DEBUG */
-#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_PROC_SYSCTL
+int sched_numa_balancing(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   struct ctl_table t;
+   int err;
+   int state = numabalancing_enabled;
+
+   if (write && !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   t = *table;
+   t.data = &state;
+   err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+   if (err < 0)
+   return err;
+   if (write)
+   set_numabalancing_state(state);
+   return err;
+}
+#endif
+#endif
  
  /*

   * fork()/clone()-time setup:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..94164ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,17 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "numa_balancing",
+   .data   = NULL, /* filled in by handler */
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = sched_numa_balancing,
+   .extra1 = &zero,
+   .extra2 = &one,
+   },
+
+
  #endif /* CONFIG_NUMA_BALANCING */
  #endif /* CONFIG_SCHED_DEBUG */
{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..7eee646 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2531,7 +2531,7 @@ static void __init check_numabalancing_enable(void)
  
  	if (nr_node_ids > 1 && !numabalancing_override) {

printk(KERN_INFO "Enabling automatic NUMA balancing. "
-   "Configure with numa_balancing= or sysctl");
+   "Configure with numa_balancing= or the kernel.numa_balancing 
sysctl");
set_numabalancing_state(numabalancing_default);
}
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] slab: Remove unnecessary __builtin_constant_p()

2013-04-17 Thread Will Huck

Hi Steven,
On 04/18/2013 03:09 AM, Steven Rostedt wrote:

The slab.c code has a size check macro that checks the size of the
following structs:

struct arraycache_init
struct kmem_list3

The index_of() function that takes the sizeof() of the above two structs
and does an unnecessary __builtin_constant_p() on that. As sizeof() will
always end up being a constant making this always be true. The code is
not incorrect, but it just adds added complexity, and confuses users and
wastes the time of reviewers of the code, who spends time trying to
figure out why the builtin_constant_p() was used.


In normal case, builtin_constant_p() is used for what?



This patch is just a clean up that makes the index_of() code a little
bit less complex.

Signed-off-by: Steven Rostedt 

diff --git a/mm/slab.c b/mm/slab.c
index 856e4a1..6047900 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -325,9 +325,7 @@ static void cache_reap(struct work_struct *unused);
  static __always_inline int index_of(const size_t size)
  {
extern void __bad_size(void);
-
-   if (__builtin_constant_p(size)) {
-   int i = 0;
+   int i = 0;
  
  #define CACHE(x) \

if (size <=x) \
@@ -336,9 +334,7 @@ static __always_inline int index_of(const size_t size)
i++;
  #include 
  #undef CACHE
-   __bad_size();
-   } else
-   __bad_size();
+   __bad_size();
return 0;
  }
  



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-10 Thread Will Huck

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The downside of page cache use-once replacement algorithm is 
inter-reference distance, corret? Does it have any other downside? 
What's the downside of two-handed clock algorithm against anonymous pages?




If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-10 Thread Will Huck

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages


Why the algorithm has relationship with two-handed clock?


start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.

If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-07 Thread Will Huck

cc Fengguang,
On 04/05/2013 08:05 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 09:01 PM, Rik van Riel wrote:

On 03/22/2013 12:59 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:56 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:

One offline question, how to understand this in function 
balance_pgdat:

/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?


After reinvestigate the codes, the answer is no. But why have this
difference? I think you are the expert for this question, expect your
explanation. :-)


Anonymous memory has a smaller amount of memory (on the order
of system memory), most of which is or has been in a working
set at some point.

File system cache tends to have two distinct sets. One part
are the frequently accessed files, another part are the files
that are accessed just once or twice.

The file working set needs to be protected from streaming
IO. We do this by having new file pages start out on the


Is there streaming IO workload or benchmark?


inactive file list, and only promoted to the active file
list if they get accessed twice.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-07 Thread Will Huck

Ping Rik.
On 04/05/2013 08:05 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 09:01 PM, Rik van Riel wrote:

On 03/22/2013 12:59 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:56 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:

One offline question, how to understand this in function 
balance_pgdat:

/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?


After reinvestigate the codes, the answer is no. But why have this
difference? I think you are the expert for this question, expect your
explanation. :-)


Anonymous memory has a smaller amount of memory (on the order
of system memory), most of which is or has been in a working
set at some point.

File system cache tends to have two distinct sets. One part
are the frequently accessed files, another part are the files
that are accessed just once or twice.

The file working set needs to be protected from streaming
IO. We do this by having new file pages start out on the


Is there streaming IO workload or benchmark?


inactive file list, and only promoted to the active file
list if they get accessed twice.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-04-04 Thread Will Huck

Hi Rik,
On 03/22/2013 09:01 PM, Rik van Riel wrote:

On 03/22/2013 12:59 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:56 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:

One offline question, how to understand this in function 
balance_pgdat:

/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?


After reinvestigate the codes, the answer is no. But why have this
difference? I think you are the expert for this question, expect your
explanation. :-)


Anonymous memory has a smaller amount of memory (on the order
of system memory), most of which is or has been in a working
set at some point.

File system cache tends to have two distinct sets. One part
are the frequently accessed files, another part are the files
that are accessed just once or twice.

The file working set needs to be protected from streaming
IO. We do this by having new file pages start out on the


Is there streaming IO workload or benchmark?


inactive file list, and only promoted to the active file
list if they get accessed twice.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?



If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?


Make sense, thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Rik,
On 03/22/2013 11:56 AM, Will Huck wrote:

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:

On 03/21/2013 08:05 PM, Will Huck wrote:


One offline question, how to understand this in function balance_pgdat:
/*
  * Do some background aging of the anon list, to give
  * pages a chance to be referenced before reclaiming.
  */
age_acitve_anon(zone, &sc);


The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.


The file lrus also use the two-handed clock algorithm, correct?


After reinvestigate the codes, the answer is no. But why have this 
difference? I think you are the expert for this question, expect your 
explanation. :-)






If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?


Make sense, thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Rik,
On 03/21/2013 08:52 AM, Rik van Riel wrote:

On 03/20/2013 12:18 PM, Michal Hocko wrote:

On Sun 17-03-13 13:04:07, Mel Gorman wrote:
[...]

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..4835a7a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t 
*pgdat, int order, long remaining,

  }

  /*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+   struct scan_control *sc,
+   unsigned long lru_pages)
+{
+unsigned long nr_slab;
+struct reclaim_state *reclaim_state = current->reclaim_state;
+struct shrink_control shrink = {
+.gfp_mask = sc->gfp_mask,
+};
+
+/* Reclaim above the high watermark. */
+sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));


OK, so the cap is at high watermark which sounds OK to me, although I
would expect balance_gap being considered here. Is it not used
intentionally or you just wanted to have a reasonable upper bound?

I am not objecting to that it just hit my eyes.


This is the maximum number of pages to reclaim, not the point
at which to stop reclaiming.


What's the difference between the maximum number of pages to reclaim and 
the point at which to stop reclaiming?




I assume Mel chose this value because it guarantees that enough
pages will have been freed, while also making sure that the value
is scaled according to zone size (keeping pressure between zones
roughly equal).



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

2013-03-21 Thread Will Huck

Hi Johannes,
On 03/21/2013 11:57 PM, Johannes Weiner wrote:

On Sun, Mar 17, 2013 at 01:04:07PM +, Mel Gorman wrote:

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will will overshoot due to it not being a hard limit as

will -> still?


shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

I don't really understand this last sentence.  Is the excessive
reclaim a result of the patch, a description of what's happening
now...?


Signed-off-by: Mel Gorman 

Nice, thank you.  Using the high watermark for larger zones is more
reasonable than my hack that just always went with SWAP_CLUSTER_MAX,
what with inter-zone LRU cycle time balancing and all.

Acked-by: Johannes Weiner 


One offline question, how to understand this in function balance_pgdat:
/*
 * Do some background aging of the anon list, to give
 * pages a chance to be referenced before reclaiming.
 */
age_acitve_anon(zone, &sc);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: page_alloc: remove branch operation in free_pages_prepare()

2013-03-08 Thread Will Huck

Hi Hugh,
On 03/08/2013 10:01 AM, Hugh Dickins wrote:

On Fri, 8 Mar 2013, Joonsoo Kim wrote:

On Thu, Mar 07, 2013 at 10:54:15AM -0800, Hugh Dickins wrote:

On Thu, 7 Mar 2013, Joonsoo Kim wrote:


When we found that the flag has a bit of PAGE_FLAGS_CHECK_AT_PREP,
we reset the flag. If we always reset the flag, we can reduce one
branch operation. So remove it.

Cc: Hugh Dickins 
Signed-off-by: Joonsoo Kim 

I don't object to this patch.  But certainly I would have written it
that way in order not to dirty a cacheline unnecessarily.  It may be
obvious to you that the cacheline in question is almost always already
dirty, and the branch almost always more expensive.  But I'll leave that
to you, and to those who know more about these subtle costs than I do.

Yes. I already think about that. I thought that even if a cacheline is
not dirty at this time, we always touch the 'struct page' in
set_freepage_migratetype() a little later, so dirtying is not the problem.

I expect that a very high proportion of user pages have
PG_uptodate to be cleared here; and there's also the recently added


When PG_uptodate will be set?


page_nid_reset_last(), which will dirty the flags or a nearby field
when CONFIG_NUMA_BALANCING.  Those argue in favour of your patch.


But, now, I re-think this and decide to drop this patch.
The reason is that 'struct page' of 'compound pages' may not be dirty
at this time and will not be dirty at later time.

Actual compound pages would have PG_head or PG_tail or PG_compound
to be cleared there, I believe (check if I'm right on that).  The
questionable case is the ordinary order>0 case without __GFP_COMP
(and page_nid_reset_last() is applied to each subpage of those).


So this patch is bad idea.

I'm not so sure.  I doubt your patch will make a giant improvement
in kernel performance!  But it might make a little - maybe you just
need to give some numbers from perf to justify it (but I'm easily
dazzled by numbers - don't expect me to judge the result).

Hugh


Is there any comments?

Thanks.


Hugh


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..778f2a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,8 +614,7 @@ static inline int free_pages_check(struct page *page)
return 1;
}
page_nid_reset_last(page);
-   if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
-   page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+   page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
  }
  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Inactive memory keep growing and how to release it?

2013-03-08 Thread Will Huck

Cc experts. Hugh, Johannes,

On 03/04/2013 08:21 PM, Lenky Gao wrote:

2013/3/4 Zlatko Calusic :

The drop_caches mechanism doesn't free dirty page cache pages. And your bash
script is creating a lot of dirty pages. Run it like this and see if it
helps your case:

sync; echo 3 > /proc/sys/vm/drop_caches

Thanks for your advice.

The inactive memory still cannot be reclaimed after i execute the sync command:

# cat /proc/meminfo | grep Inactive\(file\);
Inactive(file):   882824 kB
# sync;
# echo 3 > /proc/sys/vm/drop_caches
# cat /proc/meminfo | grep Inactive\(file\);
Inactive(file):   777664 kB

I find these page becomes orphaned in this function, but do not understand why:

/*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes orphaned.  It will be left on the LRU and may even be mapped into
  * user pagetables if we're racing with filemap_fault().
  *
  * We need to bale out if page->mapping is no longer equal to the original
  * mapping.  This happens a) when the VM reclaimed the page while we waited on
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
static int
truncate_complete_page(struct address_space *mapping, struct page *page)
{
...

My file system type is ext3, mounted with the opteion data=journal and
it is easy to reproduce.




___
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/devel


Re: Inactive memory keep growing and how to release it?

2013-03-08 Thread Will Huck

Cc experts. Hugh, Johannes,

On 03/04/2013 08:21 PM, Lenky Gao wrote:

2013/3/4 Zlatko Calusic :

The drop_caches mechanism doesn't free dirty page cache pages. And your bash
script is creating a lot of dirty pages. Run it like this and see if it
helps your case:

sync; echo 3 > /proc/sys/vm/drop_caches

Thanks for your advice.

The inactive memory still cannot be reclaimed after i execute the sync command:

# cat /proc/meminfo | grep Inactive\(file\);
Inactive(file):   882824 kB
# sync;
# echo 3 > /proc/sys/vm/drop_caches
# cat /proc/meminfo | grep Inactive\(file\);
Inactive(file):   777664 kB

I find these page becomes orphaned in this function, but do not understand why:

/*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes orphaned.  It will be left on the LRU and may even be mapped into
  * user pagetables if we're racing with filemap_fault().
  *
  * We need to bale out if page->mapping is no longer equal to the original
  * mapping.  This happens a) when the VM reclaimed the page while we waited on
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
static int
truncate_complete_page(struct address_space *mapping, struct page *page)
{
...

My file system type is ext3, mounted with the opteion data=journal and
it is easy to reproduce.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] tmpfs: fix mempolicy object leaks

2013-03-06 Thread Will Huck

Hi Hugh,
On 03/06/2013 03:40 AM, Hugh Dickins wrote:

On Mon, 4 Mar 2013, Will Huck wrote:

Could you explain me why shmem has more relationship with mempolicy? It seems
that there are many codes in shmem handle mempolicy, but other components in
mm subsystem just have little.

NUMA mempolicy is mostly handled in mm/mempolicy.c, which services the
mbind, migrate_pages, set_mempolicy, get_mempolicy system calls: which
govern how process memory is distributed across NUMA nodes.

mm/shmem.c is affected because it was also found useful to specify
mempolicy on the shared memory objects which may back process memory:
that includes SysV SHM and POSIX shared memory and tmpfs.  mm/hugetlb.c
contains some mempolicy handling for hugetlbfs; fs/ramfs is kept minimal,
so nothing in there.

Those are the memory-based filesystems, where NUMA mempolicy is most
natural.  The regular filesystems could support shared mempolicy too,
but that would raise more awkward design questions.


I found that if mbind several processes to one node and almost exhaust 
memory, processes will just stuck and no processes make progress or be 
killed. Is it normal?




Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree

2013-03-05 Thread Will Huck

Hi Hugh,
On 03/02/2013 10:57 AM, Hugh Dickins wrote:

How ksm treat a ksm forked page? IIUC, it's not merged in ksm stable 
tree. It will just be ignore?



On Sat, 2 Mar 2013, Ric Mason wrote:

On 03/02/2013 04:03 AM, Hugh Dickins wrote:

On Fri, 1 Mar 2013, Ric Mason wrote:

I think the ksm implementation for num awareness  is buggy.

Sorry, I just don't understand your comments below,
but will try to answer or question them as best I can.


For page migratyion stuff, new page is allocated from node *which page is
migrated to*.

Yes, by definition.


- when meeting a page from the wrong NUMA node in an unstable tree
  get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)

I thought you were writing of the wrong NUMA node case,
but now you emphasize "*==*", which means the right NUMA node.

Yes, I mean the wrong NUMA node. During page migration, new page has already
been allocated in new node and old page maybe freed.  So tree_page is the
page in new node's unstable tree, page is also new node page, so
get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).

I don't understand; but here you seem to be describing a case where two
pages from the same NUMA node get merged (after both have been migrated
from another NUMA node?), and there's nothing wrong with that,
so I won't worry about it further.


 - meeting a page which is ksm page before migration
   get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't
capture
them since stable_node is for tree page in current stable tree. They are
always equal.

When we meet a ksm page in the stable tree before it's migrated to another
NUMA node, yes, it will be on the right NUMA node (because we were careful
only to merge pages from the right NUMA node there), and that test will not
capture them.  It's for capturng a ksm page in the stable tree after it has
been migrated to another NUMA node.

ksm page migrated to another NUMA node still not freed, why? Who take page
count of it?

The old page, the one which used to be a ksm page on the old NUMA node,
should be freed very soon: since it was isolated from lru, and its page
count checked, I cannot think of anything to hold a reference to it,
apart from migration itself - so it just needs to reach putback_lru_page(),
and then may rest awhile on __lru_cache_add()'s pagevec before being freed.

But I don't see where I said the old page was still not freed.


If not  freed, since new page is allocated in new node, it is
the copy of current ksm page, so current ksm doesn't change,
get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).

But ksm_migrate_page() did
VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
stable_node->kpfn = page_to_pfn(newpage);
without changing stable_node->nid.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] tmpfs: fix mempolicy object leaks

2013-03-03 Thread Will Huck


Hi Hugh,
On 02/21/2013 04:26 AM, Hugh Dickins wrote:

On Tue, 19 Feb 2013, Greg Thelen wrote:


This patch fixes several mempolicy leaks in the tmpfs mount logic.
These leaks are slow - on the order of one object leaked per mount
attempt.

Leak 1 (umount doesn't free mpol allocated in mount):
 while true; do
 mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
 umount /mnt
 done

Leak 2 (errors parsing remount options will leak mpol):
 mount -t tmpfs -o size=100M nodev /mnt
 while true; do
 mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
 done
 umount /mnt

Leak 3 (multiple mpol per mount leak mpol):
 while true; do
 mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
 umount /mnt
 done

This patch fixes all of the above.  I could have broken the patch into
three pieces but is seemed easier to review as one.

Yes, I agree, and nicely fixed - but one doubt below.  If you resolve
that, please add my Acked-by: Hugh Dickins 


Could you explain me why shmem has more relationship with mempolicy? It 
seems that there are many codes in shmem handle mempolicy, but other 
components in mm subsystem just have little.





Signed-off-by: Greg Thelen 
---
  mm/shmem.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index efd0b3a..ed2cb26 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2386,6 +2386,7 @@ static int shmem_parse_options(char *options, struct 
shmem_sb_info *sbinfo,
   bool remount)
  {
char *this_char, *value, *rest;
+   struct mempolicy *mpol = NULL;
uid_t uid;
gid_t gid;
  
@@ -2414,7 +2415,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,

printk(KERN_ERR
"tmpfs: No value for mount option '%s'\n",
this_char);
-   return 1;
+   goto error;
}
  
  		if (!strcmp(this_char,"size")) {

@@ -2463,19 +2464,23 @@ static int shmem_parse_options(char *options, struct 
shmem_sb_info *sbinfo,
if (!gid_valid(sbinfo->gid))
goto bad_val;
} else if (!strcmp(this_char,"mpol")) {
-   if (mpol_parse_str(value, &sbinfo->mpol))
+   mpol_put(mpol);

I haven't tested to check, but don't we need
mpol = NULL;
here, in case the new option turns out to be bad?


+   if (mpol_parse_str(value, &mpol))
goto bad_val;
} else {
printk(KERN_ERR "tmpfs: Bad mount option %s\n",
   this_char);
-   return 1;
+   goto error;
}
}
+   sbinfo->mpol = mpol;
return 0;
  
  bad_val:

printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n",
   value, this_char);
+error:
+   mpol_put(mpol);
return 1;
  
  }

@@ -2551,6 +2556,7 @@ static void shmem_put_super(struct super_block *sb)
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
  
  	percpu_counter_destroy(&sbinfo->used_blocks);

+   mpol_put(sbinfo->mpol);
kfree(sbinfo);
sb->s_fs_info = NULL;
  }
--
1.8.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug fix PATCH 1/2] acpi, movablemem_map: Exclude memblock.reserved ranges when parsing SRAT.

2013-02-20 Thread Will Huck

On 02/20/2013 08:31 PM, Tang Chen wrote:

On 02/20/2013 07:00 PM, Tang Chen wrote:
As mentioned by HPA before, when we are using movablemem_map=acpi, if 
all the
memory ranges in SRAT is hotpluggable, then no memory can be used by 
kernel.


Before parsing SRAT, memblock has already reserve some memory ranges 
for other
purposes, such as for kernel image, and so on. We cannot prevent 
kernel from
using these memory. So we need to exclude these ranges even if these 
memory is

hotpluggable.

This patch changes the movablemem_map=acpi option's behavior. The 
memory ranges
reserved by memblock will not be added into movablemem_map.map[]. So 
even if
all the memory is hotpluggable, there will always be memory that 
could be used

by the kernel.



What's the relationship between e820 map and SRAT?


Reported-by: H Peter Anvin
Signed-off-by: Tang Chen
---
  arch/x86/mm/srat.c |   18 +-
  1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 62ba97b..b8028b2 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -145,7 +145,7 @@ static inline int save_add_info(void) {return 0;}
  static void __init
  handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable)
  {
-int overlap;
+int overlap, i;
  unsigned long start_pfn, end_pfn;

  start_pfn = PFN_DOWN(start);
@@ -161,8 +161,24 @@ handle_movablemem(int node, u64 start, u64 end, 
u32 hotpluggable)

   *
   * Using movablemem_map, we can prevent memblock from 
allocating memory

   * on ZONE_MOVABLE at boot time.
+ *
+ * Before parsing SRAT, memblock has already reserve some memory 
ranges

+ * for other purposes, such as for kernel image. We cannot prevent
+ * kernel from using these memory, so we need to exclude these 
memory

+ * even if it is hotpluggable.
   */
  if (hotpluggable&&  movablemem_map.acpi) {
+/* Exclude ranges reserved by memblock. */
+struct memblock_type *rgn =&memblock.reserved;
+
+for (i = 0; i<  rgn->cnt; i++) {
+if (end<= rgn->regions[i].base ||
+start>= rgn->regions[i].base +
+rgn->regions[i].size)


Hi all,

Here, I scan the memblock.reserved each time we parse an entry because 
the
rgn->regions[i].nid is set to MAX_NUMNODES in memblock_reserve(). So I 
cannot
obtain the nid which the kernel resides in directly from 
memblock.reserved.


I think there could be some problems if the memory ranges in SRAT are 
not in
increasing order, since if [3,4) [1,2) are all on node0, and kernel is 
not
using [3,4), but using [1,2), then I cannot remove [3,4) because I 
don't know

on which node [3,4) is.

Any idea for this ?

And by the way, I think this approach works well when the memory 
entries in

SRAT are arranged in increasing order.

Thanks. :)


+continue;
+goto out;
+}
+
  insert_movablemem_map(start_pfn, end_pfn);

  /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Should a swapped out page be deleted from swap cache?

2013-02-19 Thread Will Huck

On 02/20/2013 03:06 AM, Hugh Dickins wrote:

On Tue, 19 Feb 2013, Will Huck wrote:

Another question:

I don't see the connection to deleting a swapped out page from swap cache.


Why kernel memory mapping use direct mapping instead of kmalloc/vmalloc which
will setup mapping on demand?

I may misunderstand you, and "kernel memory mapping".

kmalloc does not set up a mapping, it uses the direct mapping already set up.

It would be circular if the basic page allocation primitives used kmalloc,
since kmalloc relies on the basic page allocation primitives.

vmalloc is less efficient than using the direct mapping (repeated setup
and teardown, no use of hugepages), but necessary when you want a larger


Is there tlb flush in setup and teardown process? and they also expensive?


virtual array than you're likely to find from the buddy allocator.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Should a swapped out page be deleted from swap cache?

2013-02-18 Thread Will Huck

On 02/19/2013 10:04 AM, Li Haifeng wrote:

2013/2/19 Hugh Dickins 

On Mon, 18 Feb 2013, Li Haifeng wrote:


For explain my question, the two points should be displayed as below.

1.  If an anonymous page is swapped out, this page will be deleted
from swap cache and be put back into buddy system.

Yes, unless the page is referenced again before it comes to be
deleted from swap cache.


2. When a page is swapped out, the sharing count of swap slot must not
be zero. That is, page_swapcount(page) will not return zero.

I would not say "must not": we just prefer not to waste time on swapping
a page out if its use count has already gone to 0.  And its use count
might go down to 0 an instant after swap_writepage() makes that check.


Thanks for your reply and patience.

If a anonymous page is swapped out and  comes to be reclaimable,
shrink_page_list() will call __remove_mapping() to delete the page
swapped out from swap cache. Corresponding code lists as below.


I'm not sure if
if (PageAnon(page) && !PageSwapCache(page)) {
 .
}
will add the page to swap cache again.



  765 static unsigned long shrink_page_list(struct list_head *page_list,
  766   struct mem_cgroup_zone *mz,
  767   struct scan_control *sc,
  768   int priority,
  769   unsigned long *ret_nr_dirty,
  770   unsigned long *ret_nr_writeback)
  771 {
...
  971 if (!mapping || !__remove_mapping(mapping, page))
  972 goto keep_locked;
  973
  974 /*
  975  * At this point, we have no other references and there is
  976  * no way to pick any more up (removed from LRU, removed
  977  * from pagecache). Can use non-atomic bitops now (and
  978  * we obviously don't have to worry about waking
up a process
  979  * waiting on the page lock, because there are no
references.
  980  */
  981 __clear_page_locked(page);
  982 free_it:
  983 nr_reclaimed++;
  984
  985 /*
  986  * Is there need to periodically free_page_list? It would
  987  * appear not as the counts should be low
  988  */
  989 list_add(&page->lru, &free_pages);
  990 continue;

Please correct me if my understanding is wrong.

Thanks.

Are both of them above right?

According the two points above, I was confused to the line 655 below.
When a page is swapped out, the return value of page_swapcount(page)
will not be zero. So, the page couldn't be deleted from swap cache.

Yes, we cannot free the swap as long as its data might be needed again.

But a swap cache page may linger in memory for an indefinite time,
in between being queued for write out, and actually being freed from
the end of the lru by memory pressure.

At various points where we hold the page lock on a swap cache page,
it's worth checking whether it is still actually needed, or could
now be freed from swap cache, and the corresponding swap slot freed:
that's what try_to_free_swap() does.

I do agree. Thanks again.

Hugh


  644  * If swap is getting full, or if there are no more mappings of
this page,
  645  * then try_to_free_swap is called to free its swap space.
  646  */
  647 int try_to_free_swap(struct page *page)
  648 {
  649 VM_BUG_ON(!PageLocked(page));
  650
  651 if (!PageSwapCache(page))
  652 return 0;
  653 if (PageWriteback(page))
  654 return 0;
  655 if (page_swapcount(page))//Has referenced by other swap out
page.
  656 return 0;
  657
  658 /*
  659  * Once hibernation has begun to create its image of
memory,
  660  * there's a danger that one of the calls to
try_to_free_swap()
  661  * - most probably a call from __try_to_reclaim_swap()
while
  662  * hibernation is allocating its own swap pages for the
image,
  663  * but conceivably even a call from memory reclaim - will
free
  664  * the swap from a page which has already been recorded in
the
  665  * image as a clean swapcache page, and then reuse its swap
for
  666  * another page of the image.  On waking from hibernation,
the
  667  * original page might be freed under memory pressure, then
  668  * later read back in from swap, now with the wrong data.
  669  *
  670  * Hibration suspends storage while it is writing the image
  671  * to disk so check that here.
  672  */
  673 if (pm_suspended_storage())
  674 return 0;
  675
  676 delete_from_swap_cache(page);
  677 SetPageDirty(page);
  678 return 1;
  679 }

Thanks.

--
To unsubscribe,

Re: Should a swapped out page be deleted from swap cache?

2013-02-18 Thread Will Huck

Hi Hugh,
On 02/19/2013 02:06 AM, Hugh Dickins wrote:

Another question:

Why kernel memory mapping use direct mapping instead of kmalloc/vmalloc 
which will setup mapping on demand?



On Mon, 18 Feb 2013, Li Haifeng wrote:


For explain my question, the two points should be displayed as below.

1.  If an anonymous page is swapped out, this page will be deleted
from swap cache and be put back into buddy system.

Yes, unless the page is referenced again before it comes to be
deleted from swap cache.


2. When a page is swapped out, the sharing count of swap slot must not
be zero. That is, page_swapcount(page) will not return zero.

I would not say "must not": we just prefer not to waste time on swapping
a page out if its use count has already gone to 0.  And its use count
might go down to 0 an instant after swap_writepage() makes that check.


Are both of them above right?

According the two points above, I was confused to the line 655 below.
When a page is swapped out, the return value of page_swapcount(page)
will not be zero. So, the page couldn't be deleted from swap cache.

Yes, we cannot free the swap as long as its data might be needed again.

But a swap cache page may linger in memory for an indefinite time,
in between being queued for write out, and actually being freed from
the end of the lru by memory pressure.

At various points where we hold the page lock on a swap cache page,
it's worth checking whether it is still actually needed, or could
now be freed from swap cache, and the corresponding swap slot freed:
that's what try_to_free_swap() does.

Hugh


  644  * If swap is getting full, or if there are no more mappings of this page,
  645  * then try_to_free_swap is called to free its swap space.
  646  */
  647 int try_to_free_swap(struct page *page)
  648 {
  649 VM_BUG_ON(!PageLocked(page));
  650
  651 if (!PageSwapCache(page))
  652 return 0;
  653 if (PageWriteback(page))
  654 return 0;
  655 if (page_swapcount(page))//Has referenced by other swap out page.
  656 return 0;
  657
  658 /*
  659  * Once hibernation has begun to create its image of memory,
  660  * there's a danger that one of the calls to try_to_free_swap()
  661  * - most probably a call from __try_to_reclaim_swap() while
  662  * hibernation is allocating its own swap pages for the image,
  663  * but conceivably even a call from memory reclaim - will free
  664  * the swap from a page which has already been recorded in the
  665  * image as a clean swapcache page, and then reuse its swap for
  666  * another page of the image.  On waking from hibernation, the
  667  * original page might be freed under memory pressure, then
  668  * later read back in from swap, now with the wrong data.
  669  *
  670  * Hibration suspends storage while it is writing the image
  671  * to disk so check that here.
  672  */
  673 if (pm_suspended_storage())
  674 return 0;
  675
  676 delete_from_swap_cache(page);
  677 SetPageDirty(page);
  678 return 1;
  679 }

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/