Hi All,

Under heavy stress and constant memory hot add/remove, I have observed the following loop to occasionally loop infinitely:

mm/memory_hotplug.c:__offline_pages

repeat:
       /* start memory hot removal */
       ret = -EINTR;
       if (signal_pending(current))
               goto failed_removal;

       cond_resched();
       lru_add_drain_all();
       drain_all_pages(zone);

       pfn = scan_movable_pages(start_pfn, end_pfn);
       if (pfn) { /* We have movable pages */
               ret = do_migrate_range(pfn, end_pfn);
               goto repeat;
       }

What appears to be happening in this case is that do_migrate_range returns a failure code which is being ignored. The failure is stemming from migrate_pages returning "1" which I'm guessing is the result of us hitting the following case:

mm/migrate.c: migrate_pages

        default:
                /*
                 * Permanent failure (-EBUSY, -ENOSYS, etc.):
                 * unlike -EAGAIN case, the failed page is
                 * removed from migration page list and not
                 * retried in the next outer loop.
                 */
                nr_failed++;
                break;
        }

Does a failure in do_migrate_range indicate that the range is unmigratable and the loop in __offline_pages should terminate and goto failed_removal? Or should we allow a certain number of retrys before we
give up on migrating the range?

This issue was observed on a ppc64le lpar on a 4.18-rc6 kernel.

-John

Reply via email to