On 9/20/16 10:53 PM, Dave Hansen wrote:
On 09/20/2016 07:45 AM, Rui Teng wrote:
On 9/17/16 12:25 AM, Dave Hansen wrote:

That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't.  If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this
patch.

The root cause is the test scripts(tools/testing/selftests/memory-
hotplug/mem-on-off-test.sh) changes online/offline status on memory
blocks other than page header. It will *randomly* select 10% memory
blocks from /sys/devices/system/memory/memory*, and change their
online/offline status.

Ahh, that does explain it!  Thanks for digging into that!

That's why we need a PageHead() check now, and why this problem does
not happened on systems with smaller huge page such as 16M.

As far as the PageHuge() set, I think PageHuge() will return true for
all tail pages. Because it will get the compound_head for tail page,
and then get its huge page flag.
    page = compound_head(page);

And as far as the failure message, if one memory block is in use, it
will return failure when offline it.

That's good, but aren't we still left with a situation where we've
offlined and dissolved the _middle_ of a gigantic huge page while the
head page is still in place and online?

That seems bad.

What about refusing to change the status for such memory block, if it
contains a huge page which larger than itself? (function
memory_block_action())

I think it will not affect the hot-plug function too much. We can
change the nr_hugepages to zero first, if we really want to hot-plug a
memory.

And I also found that the __test_page_isolated_in_pageblock() function
can not handle a gigantic page well. It will cause a device busy error
later. I am still investigating on that.

Any suggestion?

Thanks!

Reply via email to