Hi Dave,

On 2/16/2018 4:18 PM, Dave Anderson wrote:
...
>>>> OK, I understand your point.  But what concerns me is that the function's
>>>> purpose is to absolutely identify whether the incoming page structure
>>>> address
>>>> is a correct page structure address.  But if an invalid address gets
>>>> passed
>>>> into is_page_ptr(), your patch would take the invalid address, calculate
>>>> an
>>>> invalid "nr", and continue from there, right?
>>
>> Yes, if an invalid "nr" is the number where section does not exist,
>> valid_section_nr() would return 0.  Even if it is the number where section
>> exists by accident, the invalid "addr" is not between mem_map and
>> end_mem_map,
>> or not page-aligned, because if so, it is a page structure address.
>>
>> Also without this patch, when an invalid address comes, the loop could tries
>> many invalid "nr"s less than NR_MEM_SECTIONS().
>>
>> I hope this answers your concern..
>>
>>>
>>> Another suggestion/question -- if is_page_ptr() is called with a NULL phys
>>> argument (as is done most of the time),  could it skip the "if
>>> IS_SPARSEMEM()"
>>> section at the top, and still utilize the part at the bottom, where it
>>> walks
>>> through the vt->node_table[x] array?  I'm not sure about the "ppend"
>>> calculation
>>> though -- even if there are holes in the node's address space, is it still
>>> a
>>> contiguous chunk of page structure addresses per-node?
>>
>> I'm still investigating and not sure yet, but I think that SPASEMEM uses
>> mem_section instead of node_mem_map means page structures could be
>> non-contignuous per-node according to architecture or condition.
>>
>>   typedef struct pglist_data {
>>   ...
>>   #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
>>           struct page *node_mem_map;
>>
>> I'll continue to check it.
> 
> You are right, but in the case where pglist_data.node_mem_map does *not* 
> exist,
> the crash utility initializes each vt->node_table[node].mem_map with the 
> node's
> starting mem_map address by using the return value from phys_to_page() of the
> node's starting physical address -- which uses the sparsemem functions.
>  
> The question is whether the current "ppend" calculation is correct for the 
> last
> physical page in a node.   If it is not correct, then perhaps an 
> "mem_map_end" value
> can be added to the node_table structure, initialized by using phys_to_page() 
> to get
> the page address of the last physical address in the node.  And then in that 
> case, the 
> question is whether the mem_map range of virtual addresses are contiguous -- 
> even if
> there are holes in the mem_map virtual address range.

"node_size" is set to pglist_data.node_spanned_pages, which includes holes.
So I think that if VMEMMAP, which a page address is linear against its pfn,
the current "ppend" calculation is correct for the last page in a node.
But if not VMEMMAP, since there is no guarantee of the linearity, the
calculation could be incorrect.

I found an example with RHEL5:

crash> help -o
...
                    size_table:
                          page: 56
...
crash> kmem -n
NODE    SIZE      PGLIST_DATA       BOOTMEM_DATA       NODE_ZONES
  0    524279   ffff810000014000  ffffffff804e1900  ffff810000014000
                                                    ffff810000014b00
                                                    ffff810000015600
                                                    ffff810000016100
    MEM_MAP       START_PADDR  START_MAPNR
ffff8100007da000       0            0

ZONE  NAME         SIZE       MEM_MAP      START_PADDR  START_MAPNR
  0   DMA          4096  ffff8100007da000            0            0
  1   DMA32      520183  ffff810000812000      1000000         4096
  2   Normal          0                 0            0            0
  3   HighMem         0                 0            0            0

-------------------------------------------------------------------

NR      SECTION        CODED_MEM_MAP        MEM_MAP       PFN
 0  ffff810009000000  ffff8100007da000  ffff8100007da000  0
 1  ffff810009000008  ffff8100007da000  ffff81000099a000  32768
 2  ffff810009000010  ffff8100007da000  ffff810000b5a000  65536
 3  ffff810009000018  ffff8100007da000  ffff810000d1a000  98304   <= there is a
 4  ffff810009000020  ffff810008901000  ffff810009001000  131072  <= mem_map 
gap.
 5  ffff810009000028  ffff810008901000  ffff8100091c1000  163840
 :
14  ffff810009000070  ffff810008901000  ffff81000a181000  458752
15  ffff810009000078  ffff810008901000  ffff81000a341000  491520
crash> 

In this case, the "ppend" will be

  0xffff8100007da000 + (524279 * 56)
  = 0xffff8100023d9e08

but it looks like the actual value is around 0xffff81000a501000.

And also, we can see the gap between NR=3 and 4.  This means that if the
correct "mem_map_end" is added to the node_table structure, it would be
not enough to check whether an address is a page structure.

Thanks,
Kazuhito Hagio

> 
> Thanks,
>   Dave
> 
> 
> 
>>
>> Thanks,
>> Kazuhito Hagio
>>
>>>
>>>>
>>>> Dave
>>>>
>>>>>
>>>>>>
>>>>>> There is really no compelling reason that count_partial() absolutely
>>>>>> *must* use
>>>>>> is_page_ptr(), and so I'm thinking that perhaps you could come up with a
>>>>>> less
>>>>>> heavy-handed method for simply testing whether a page.lru entry points
>>>>>> to
>>>>>> another
>>>>>> vmemmap'd page.  Something along the lines of adding this for
>>>>>> vmemmap-enabled kernels:
>>>>>>
>>>>>>   #define IN_VMEMMAP_RANGE(page) ((page >= VMEMMAP_VADDR) && (page <=
>>>>>>   VMEMMAP_END))
>>>>>>
>>>>>> and then have count_partial() replace the is_page_ptr() call with
>>>>>> another
>>>>>> slub function that does something like this for vmemmap-enabled kernels:
>>>>>>
>>>>>>    (IN_VMMEMAP_RANGE(next) && accessible(next))
>>>>>>
>>>>>> Or instead of accessible(), it could read "next" as a list_head with
>>>>>> RETURN_ON_ERROR,
>>>>>> and verify that next->prev points back to the current list_head.
>>>>>>
>>>>>> Non-vmemmap-enabled kernels could still use is_page_ptr().
>>>>>>
>>>>>> What do you think of doing something like that?
>>>>>
>>>>> Given possible compatibility issues you said, I think that the way you
>>>>> suggested
>>>>> might well be enough for now.  I'll try a method like the above.
>>>>>
>>>>> Thanks,
>>>>> Kazuhito Hagio
>>>>>
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>>
>>>>>>       
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> Hi,
>>>>>>>
>>>>>>> The "kmem -[sS]" commands can take several minutes to complete with
>>>>>>> the following conditions:
>>>>>>>   * The system has a lot of memory sections with CONFIG_SPARSEMEM.
>>>>>>>   * The kernel uses SLUB and it has a very long partial slab list.
>>>>>>>
>>>>>>>   crash> kmem -s dentry | awk '{print strftime("%T"), $0}'
>>>>>>>   10:18:34 CACHE            NAME                 OBJSIZE  ALLOCATED
>>>>>>>   TOTAL
>>>>>>>   SLABS  SSIZE
>>>>>>>   10:19:41 ffff88017fc78a00 dentry                   192    9038949
>>>>>>>   10045728
>>>>>>>   239184     8k
>>>>>>>   crash> kmem -S dentry | bash -c 'cat >/dev/null ; echo $SECONDS'
>>>>>>>   334
>>>>>>>
>>>>>>> One of the causes is that is_page_ptr() in count_partial() checks if
>>>>>>> a given slub page address is a page struct by searching all memory
>>>>>>> sections linearly for the one which includes it.
>>>>>>>
>>>>>>>         nr_mem_sections = NR_MEM_SECTIONS();
>>>>>>>         for (nr = 0; nr < nr_mem_sections ; nr++) {
>>>>>>>                 if ((sec_addr = valid_section_nr(nr))) {
>>>>>>>                         ...
>>>>>>>
>>>>>>> With CONFIG_SPARSEMEM{_VMEMMAP}, we can calculate the memory section
>>>>>>> which includes a page struct with its page.flags, or its address and
>>>>>>> VMEMMAP_VADDR. With this patch doing so, the computation amount can be
>>>>>>> significantly reduced in that case.
>>>>>>>
>>>>>>>   crash> kmem -s dentry | awk '{print strftime("%T"), $0}'
>>>>>>>   10:34:55 CACHE            NAME                 OBJSIZE  ALLOCATED
>>>>>>>   TOTAL
>>>>>>>   SLABS  SSIZE
>>>>>>>   10:34:55 ffff88017fc78a00 dentry                   192    9038949
>>>>>>>   10045728
>>>>>>>   239184     8k
>>>>>>>   crash> kmem -S dentry | bash -c 'cat >/dev/null ; echo $SECONDS'
>>>>>>>   2
>>>>>>>
>>>>>>> This patch uses VMEMMAP_VADDR. It is not defined on PPC64, but it looks
>>>>>>> like PPC64 supports VMEMMAP flag and machdep->machspec->vmemmap_base is
>>>>>>> it, so this patch also defines it for PPC64. This might need some help
>>>>>>> from PPC folks.
>>>>>>>
>>>>>>> Signed-off-by: Kazuhito Hagio <k-ha...@ab.jp.nec.com>
>>>>>>> ---
>>>>>>>  defs.h   |  2 ++
>>>>>>>  memory.c | 15 +++++++++++++++
>>>>>>>  2 files changed, 17 insertions(+)
>>>>>>>
>>>>>>> diff --git a/defs.h b/defs.h
>>>>>>> index aa17792..84e68ca 100644
>>>>>>> --- a/defs.h
>>>>>>> +++ b/defs.h
>>>>>>> @@ -3861,6 +3861,8 @@ struct efi_memory_desc_t {
>>>>>>>  #define IS_VMALLOC_ADDR(X) machdep->machspec->is_vmaddr(X)
>>>>>>>  #define KERNELBASE      machdep->pageoffset
>>>>>>>  
>>>>>>> +#define VMEMMAP_VADDR   (machdep->machspec->vmemmap_base)
>>>>>>> +
>>>>>>>  #define PGDIR_SHIFT     (machdep->pageshift + (machdep->pageshift -3)
>>>>>>>  +
>>>>>>>  (machdep->pageshift - 2))
>>>>>>>  #define PMD_SHIFT       (machdep->pageshift + (machdep->pageshift -
>>>>>>>  3))
>>>>>>>  
>>>>>>> diff --git a/memory.c b/memory.c
>>>>>>> index 0df8ecc..0696763 100644
>>>>>>> --- a/memory.c
>>>>>>> +++ b/memory.c
>>>>>>> @@ -13348,10 +13348,25 @@ is_page_ptr(ulong addr, physaddr_t *phys)
>>>>>>>         ulong nr_mem_sections;
>>>>>>>         ulong coded_mem_map, mem_map, end_mem_map;
>>>>>>>         physaddr_t section_paddr;
>>>>>>> +#ifdef VMEMMAP_VADDR
>>>>>>> +       ulong flags;
>>>>>>> +#endif
>>>>>>>  
>>>>>>>         if (IS_SPARSEMEM()) {
>>>>>>>                 nr_mem_sections = NR_MEM_SECTIONS();
>>>>>>> +#ifdef VMEMMAP_VADDR
>>>>>>> +               nr = nr_mem_sections;
>>>>>>> +               if (machdep->flags & VMEMMAP)
>>>>>>> +                       nr = pfn_to_section_nr((addr - VMEMMAP_VADDR) / 
>>>>>>> SIZE(page));
>>>>>>> +               else if (readmem(addr + OFFSET(page_flags), KVADDR, 
>>>>>>> &flags,
>>>>>>> +                       sizeof(ulong), "page.flags", 
>>>>>>> RETURN_ON_ERROR|QUIET))
>>>>>>> +                       nr = (flags >> (SIZE(page_flags)*8 - 
>>>>>>> SECTIONS_SHIFT())
>>>>>>> +                               & ((1UL << SECTIONS_SHIFT()) - 1));
>>>>>>> +
>>>>>>> +               if (nr < nr_mem_sections) {
>>>>>>> +#else
>>>>>>>                 for (nr = 0; nr < nr_mem_sections ; nr++) {
>>>>>>> +#endif
>>>>>>>                         if ((sec_addr = valid_section_nr(nr))) {
>>>>>>>                                 coded_mem_map =
>>>>>>>                                 section_mem_map_addr(sec_addr);
>>>>>>>                                 mem_map =
>>>>>>>                                 sparse_decode_mem_map(coded_mem_map,
>>>>>>>                                 nr);
>>>>>>> --
>>>>>>> 1.8.3.1
>>>>>>>
>>>>>>> --
>>>>>>> Crash-utility mailing list
>>>>>>> Crash-utility@redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/crash-utility
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Crash-utility mailing list
>>>>>> Crash-utility@redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/crash-utility
>>>>>>
>>>>>
>>>>> --
>>>>> Crash-utility mailing list
>>>>> Crash-utility@redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/crash-utility
>>>>>
>>>>
>>>> --
>>>> Crash-utility mailing list
>>>> Crash-utility@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/crash-utility
>>>>
>>>
>>> --
>>> Crash-utility mailing list
>>> Crash-utility@redhat.com
>>> https://www.redhat.com/mailman/listinfo/crash-utility
>>>
>>
>> --
>> Crash-utility mailing list
>> Crash-utility@redhat.com
>> https://www.redhat.com/mailman/listinfo/crash-utility
>>
> 
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
> 

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Reply via email to