Re: [PATCH -mm] kexec jump -v9
On Thu, 2008-05-15 at 20:51 -0400, Vivek Goyal wrote: On Thu, May 15, 2008 at 01:41:50PM +0800, Huang, Ying wrote: Hi, Vivek, On Wed, 2008-05-14 at 16:52 -0400, Vivek Goyal wrote: [...] Ok, I have done some testing on this patch. Currently I have just tested switching back and forth between two kernels and it is working for me. Just that I had to put LAPIC and IOAPIC in legacy mode for it to work. Few comments/questions are inline. It seems that for LAPIC and IOAPIC, there is lapic_suspend()/lapic_resume() and ioapic_suspend()/ioapic_resume(), which will be called before/after kexec jump through device_power_down()/device_power_up(). So, the mechanism for LAPIC/IOAPIC is there, we may need to check the corresponding implementation. ioapic_suspend() is not putting APICs in Legacy mode and that's why we are seeing the issue. It only saves the IOAPIC routing table entries and these entries are restored during ioapic_resume(). But I think somebody has to put APICs in legacy mode for normal hibernation also. Not sure who does it. May be BIOS, so that during resume, second kernel can get the timer interrupts. As for IOAPIC legacy mode, is it related to the following code which set the routing table entry for i8259? void disable_IO_APIC(void) { /* * Clear the IO-APIC before rebooting: */ clear_IO_APIC(); /* * If the i8259 is routed through an IOAPIC * Put that IOAPIC in virtual wire mode * so legacy interrupts can be delivered. */ if (ioapic_i8259.pin != -1) { struct IO_APIC_route_entry entry; memset(entry, 0, sizeof(entry)); entry.mask= 0; /* Enabled */ entry.trigger = 0; /* Edge */ entry.irr = 0; entry.polarity= 0; /* High */ entry.delivery_status = 0; entry.dest_mode = 0; /* Physical */ entry.delivery_mode = dest_ExtINT; /* ExtInt */ entry.vector = 0; entry.dest.physical.physical_dest = GET_APIC_ID(apic_read(APIC_ID)); /* * Add it to the IO-APIC irq-routing table: */ ioapic_write_entry(ioapic_i8259.apic, ioapic_i8259.pin, entry); } disconnect_bsp_APIC(ioapic_i8259.pin != -1); } But, because IOAPIC may need to be in original state during suspend/resume, so it is not appropriate to call disable_IO_APIC() in ioapic_suspend(). So I think we can call disable_IO_APIC() in new hibernation/restore callback. Am I right? Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Tue, 2008-05-27 at 18:15 -0400, Vivek Goyal wrote: [...] But, because IOAPIC may need to be in original state during suspend/resume, so it is not appropriate to call disable_IO_APIC() in ioapic_suspend(). So I think we can call disable_IO_APIC() in new hibernation/restore callback. My hunch is suspend/resume will still work if we put this call in ioapic_suspend() but I would not recommend that. suspend/resume does not need to put IOAPIC in legacy mode. I am not sure what is new hibernation/restore callback? Are you referring to new patches from Rafel? Yes. Rafel has a new patch to separate suspend and hibernation device call backs. http://kerneltrap.org/Linux/Separating_Suspend_and_Hibernation I think this issue is specifc to kexec and kjump so probably we should not tweaking any suspend/resume related bit. How about calling disable_IO_APIC() in kexec_jump()? We can probably even optimize it by calling it only when we are transitioning into new image for the first time and not for subsquent transitions (by keeping some kind of count in kimage). This is little hackish but, should work... Yes. This issue is kexec/kjump specific. We can call it in kexec_jump(). Maybe we also need call something other in native_machine_shutdown()? BTW: I have a new version -v10: http://lkml.org/lkml/2008/5/22/106, do you have time to review it? Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Fri 2008-05-16 09:48:34, Huang, Ying wrote: On Thu, 2008-05-15 at 16:09 -0400, Vivek Goyal wrote: [...] Ok, You want to make BIOS calls. We already do that using vm86 mode and use bios real mode interrupts. So why do we need this interface? Or, IOW, how is this interface better? It can call code in 32-bit physical mode in addition to real mode. So It can be used to call EFI runtime service, especially call EFI 64 runtime service under 32-bit kernel or vice versa. The main purpose of kexec jump is for hibernation. But I think if the effort is small, why not support general 32-bit physical mode code call at same time. I believe we should focus on kexecing kernels, first. Only way to prove the effort is small is by having small followup patch, and that needs the two patches separated... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Wed, 14 May 2008, Eric W. Biederman wrote: My take on the situation is this. For proper handling we need driver device_detach and device_reattach methods. With the following semantics. The device_detach methods will disable DMA and place the hardware in a sane state from which the device driver can reclaim and reinitialize it, but the hardware will not be touched. device_reattach reattaches the driver to the hardware. How would these differ from the already-existing remove and probe methods? Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
Alan Stern [EMAIL PROTECTED] writes: On Wed, 14 May 2008, Eric W. Biederman wrote: My take on the situation is this. For proper handling we need driver device_detach and device_reattach methods. With the following semantics. The device_detach methods will disable DMA and place the hardware in a sane state from which the device driver can reclaim and reinitialize it, but the hardware will not be touched. device_reattach reattaches the driver to the hardware. How would these differ from the already-existing remove and probe methods? Honestly I would like for them not to, and they should be proper factors of the remove and probe methods. However we have a fundamental gotcha that we need to handle. Logical abstractions on physical devices. i.e. How do we handle the case of a filesystem on a block device, when we remove the block device and then read it. We have two choices. 1) We go through the pain of teaching the upper layers in the kernel of how to deal with hotplug and then we are sane when someone removes a usb stick accidentally before unmounting it and then reinserts the usb stick. 2) Teach the drivers how to do just the lower have of hotplug/remove. In which case with the driver still present and presenting it's upper layer queues we have the driver relinquish it's hardware and then later check to see if it's hardware is still present and reinitialize it. I don't know if anyone has looked at moving this to an upper layer. Definitely a question worth asking. The simpler we can make this for driver authors the better. Especially as that will make the drivers more maintainable long term. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
Rafael J. Wysocki [EMAIL PROTECTED] writes: On Thursday, 15 of May 2008, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: Just an added data partial point. In the kexec case I have had not heard anyone screaming to me that ACPI doesn't work after we switch kernels. You don't remove power from devices while doing that. No. It is the second half of S5. When we go from the boot kernel to the restored kernel I am talking about. That path is exactly what happens successfully in the kexec case. Transitioning from one kernel to another. If that path works reliably in kexec then we are talking about something that can be solved without respect to any specific ACPI implementation. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Thursday, 15 of May 2008, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: On Thursday, 15 of May 2008, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: Just an added data partial point. In the kexec case I have had not heard anyone screaming to me that ACPI doesn't work after we switch kernels. You don't remove power from devices while doing that. No. It is the second half of S5. When we go from the boot kernel to the restored kernel I am talking about. Well, you don't remove the power from devices doing that, do you? I was referring to the fact that you remove the power from devices after saving the image (ie. in the poweroff stage). Then, you initialize them and pass all that to the restored kernel and the question here is: (a) Should they be reinitialized before the restored kernel has a chance to access them? (b) If they should, what state they ought to be in when the restored kernel accesses them. That basically depends on how you're going to handle the resuming of devices, especially on the ACPI bus, in the restored kernel. If we are to follow ACPI, the answer to (a) is no, except for devices used to read the image and it's better if the boot kernel doesn't touch ACPI at all. Then, the benefit of putting the system into S4 during the poweroff stage is that (a) the resume can be carried out faster and (b) the restored kernel may use some context preserved by the platform over the sleep state. Also, that allows you to use the wake up capabilities of some devices that need not be available from S5. In any case, however, I don't really think that doing the kexec jump before creating the image is really necessary. The kexec jump during resume is in fact very similar to what the current hibernation code does, but it's slightly more complicated. :-) Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Friday, 16 of May 2008, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: Well, it looks like we do similar things concurrently. Please have a look here: http://kerneltrap.org/Linux/Separating_Suspend_and_Hibernation Yes. Part of the reason I wanted to separate these two conversations I knew something was going on. Similar patches are in the Greg's tree already. Taking a look. I just can't get past the fact in that the only reason hibernation can not use the widely implemented and tested probe/remove is because of filesystems on block devices, and that you are proposing to add 4 methods for each and every driver to handle that case, when they don't need ANYTHING! Why exactly do you think that removing()/probing() devices just for creating a hibernation image is a good idea? Also, -poweroff() is actually similar to the late phase of -suspend(). I wonder how hard teaching the upper layers to deal with hotplug/remove is? The more I look at this the more I get the impression that hibernation and suspend should be solved in separate patches. I'm not at all convinced that is what is good for the goose is good for the gander for things like your prepare method. This was discussed a lot with people who had exactly opposite opinions. With BenH in particular (CCed). Hibernation seems to be an extreme case of hotplug. I don't agree with that. Suspend seems to be just an extreme case of putting unused devices in low power state. Ditto. I don't like the fact that these methods are power management specific. Please be more specific. How should this impact the greater kernel ecosystem. + * The externally visible transitions are handled with the help of the following + * callbacks included in this structure: + * + * @prepare: Prepare the device for the upcoming transition, but do NOT change + * its hardware state. Prevent new children of the device from being + * registered after @prepare() returns (the driver's subsystem and + * generally the rest of the kernel is supposed to prevent new calls to the + * probe method from being made too once @prepare() has succeeded). If + * @prepare() detects a situation it cannot handle (e.g. registration of a + * child already in progress), it may return -EAGAIN, so that the PM core + * can execute it once again (e.g. after the new child has been registered) + * to recover from the race condition. This method is executed for all + * kinds of suspend transitions and is followed by one of the suspend + * callbacks: @suspend(), @freeze(), or @poweroff(). + * The PM core executes @prepare() for all devices before starting to + * execute suspend callbacks for any of them, so drivers may assume all of + * the other devices to be present and functional while @prepare() is being + * executed. In particular, it is safe to make GFP_KERNEL memory + * allocations from within @prepare(), although they are likely to fail in + * case of hibernation, if a substantial amount of memory is requested. + * However, drivers may NOT assume anything about the availability of the + * user space at that time and it is not correct to request firmware from + * within @prepare() (it's too late to do that). + * + * @complete: Undo the changes made by @prepare(). This method is executed for + * all kinds of resume transitions, following one of the resume callbacks: + * @resume(), @thaw(), @restore(). Also called if the state transition + * fails before the driver's suspend callback (@suspend(), @freeze(), + * @poweroff()) can be executed (e.g. if the suspend callback fails for one + * of the other devices that the PM core has unsucessfully attempted to + * suspend earlier). + * The PM core executes @complete() after it has executed the appropriate + * resume callback for all devices. The names above are terrible. Perhaps: @pause/@unpause. The names have been discussed either and I don't intend to change them now. Sorry. @pause Stop all device driver user space facing activities, and prepare for a possible power state transition. Essentially these should be very much like bringing an ethernet interface down. The device is still there but we can't do anything with it. The only difference is that this may not be user visible. + * @suspend: Executed before putting the system into a sleep state in which the + * contents of main memory are preserved. Quiesce the device, put it into + * a low power state appropriate for the upcoming system state (such as + * PCI_D3hot), and enable wakeup events as appropriate. + * + * @resume: Executed after waking the system up from a sleep state in which the + * contents of main memory were preserved. Put the device into the + * appropriate state, according to the information saved in memory by the + * preceding @suspend(). The driver starts working
Re: [PATCH -mm] kexec jump -v9
On Thu, May 15, 2008 at 01:41:50PM +0800, Huang, Ying wrote: Hi, Vivek, On Wed, 2008-05-14 at 16:52 -0400, Vivek Goyal wrote: [...] Ok, I have done some testing on this patch. Currently I have just tested switching back and forth between two kernels and it is working for me. Just that I had to put LAPIC and IOAPIC in legacy mode for it to work. Few comments/questions are inline. It seems that for LAPIC and IOAPIC, there is lapic_suspend()/lapic_resume() and ioapic_suspend()/ioapic_resume(), which will be called before/after kexec jump through device_power_down()/device_power_up(). So, the mechanism for LAPIC/IOAPIC is there, we may need to check the corresponding implementation. ioapic_suspend() is not putting APICs in Legacy mode and that's why we are seeing the issue. It only saves the IOAPIC routing table entries and these entries are restored during ioapic_resume(). But I think somebody has to put APICs in legacy mode for normal hibernation also. Not sure who does it. May be BIOS, so that during resume, second kernel can get the timer interrupts. Thanks Vivek ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Thu, 2008-05-15 at 18:35 -0700, Eric W. Biederman wrote: Vivek Goyal [EMAIL PROTECTED] writes: ioapic_suspend() is not putting APICs in Legacy mode and that's why we are seeing the issue. It only saves the IOAPIC routing table entries and these entries are restored during ioapic_resume(). But I think somebody has to put APICs in legacy mode for normal hibernation also. Not sure who does it. May be BIOS, so that during resume, second kernel can get the timer interrupts. I doubt anything cares in the suspend to ram case. There should just be a small BIOS trampoline to get back to linux when the processor restarts. And you don't need interrupts for any of that. As far as I know, in suspend to ram, interrupt is used as waking up event, such as, keyboard interrupt. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Thu, 2008-05-15 at 21:51 -0400, Vivek Goyal wrote: On Fri, May 16, 2008 at 09:48:34AM +0800, Huang, Ying wrote: On Thu, 2008-05-15 at 16:09 -0400, Vivek Goyal wrote: [...] Ok, You want to make BIOS calls. We already do that using vm86 mode and use bios real mode interrupts. So why do we need this interface? Or, IOW, how is this interface better? It can call code in 32-bit physical mode in addition to real mode. So It can be used to call EFI runtime service, especially call EFI 64 runtime service under 32-bit kernel or vice versa. The main purpose of kexec jump is for hibernation. But I think if the effort is small, why not support general 32-bit physical mode code call at same time. In general what's the environment requirements for EFI runtime services? I mean, just that processor should be in protected mode with paging disabled or one need to stop all other cpus and devices and then make the call (as we are doing in this case?). Put processor in protected mode with paging disabled is sufficient. In one of previous kexec jump versions, I provide some option to choose the state saved (whether stop other cpus, whether stop devices). I agree that now we should focus on kexec based hibernation. But I think it is reasonable to keep the possibility with minimal effort. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Wed, May 14, 2008 at 12:03:29PM -0400, Vivek Goyal wrote: On Thu, Mar 06, 2008 at 11:13:08AM +0800, Huang, Ying wrote: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. Hi Huang, Ok, after a long time, I am back to testing and reviewing this patch. [..] 7. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as root file system. 8. In kernel C, load the memory image of kernel A as follow: /sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf How do I got back to original kernel without loading dump.elf. I mean, original kernel is already in memory and I don't have to first save it to disk and then reload back. Is there a way to do it? If not, then we need to modify kexec-tools to support that. Something like kexec --entry=entry point, should tell kexec that kernel is already loaded. Just do the bit to set the entry point properly. Never mind. I found it. Following worked for me for returning back to original kernel. kexec --load-jump-back-helper --entry=entry point Just wondering if --load-jump-back-helper should be an explicit option or kexec should silently assume it if no -l or -p is given. Thanks Vivek ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Thu, Mar 06, 2008 at 11:13:08AM +0800, Huang, Ying wrote: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. Best Regards, Huang Ying This patch provides an enhancement to kexec/kdump. It implements the following features: - Jumping between the original kernel and the kexeced kernel. - Backup/restore memory used by both the original kernel and the kexeced kernel. - Save/restore CPU and devices state before after kexec. Hi Huang, Ok, I have done some testing on this patch. Currently I have just tested switching back and forth between two kernels and it is working for me. Just that I had to put LAPIC and IOAPIC in legacy mode for it to work. Few comments/questions are inline. [..] .text .align PAGE_ALIGNED + .global kexec_relocate_page +kexec_relocate_page: + +/* + * Entry point for jumping back from kexeced kernel, the paging is + * turned off. + */ +kexec_jump_back_entry: + call1f +1: + popl%ebx + subl$(1b - kexec_relocate_page), %ebx + movl%edi, KJUMP_ENTRY_OFF(%ebx) + movlCP_VA_CONTROL_PAGE(%ebx), %edi + lea STACK_TOP(%ebx), %esp + movlCP_PA_SWAP_PAGE(%ebx), %eax + movlCP_PA_BACKUP_PAGES_MAP(%ebx), %edx + pushl %eax + pushl %edx + callswap_pages + addl$8, %esp + movlCP_PA_PGD(%ebx), %eax + movl%eax, %cr3 + movl%cr0, %eax + orl $(131), %eax + movl%eax, %cr0 + lea STACK_TOP(%edi), %esp + movl%edi, %eax + addl$(virtual_mapped - kexec_relocate_page), %eax + pushl %eax + ret Upon re-entering the kernel, what happens to GDT table? So gdtr will be pointing to GDT of other kernel (which is not there as pages have been swapped)? Do we need to reload the gdtr upon re-entering the kernel. [..] @@ -197,8 +282,54 @@ identity_mapped: xorl%eax, %eax movl%eax, %cr3 + movlCP_PA_SWAP_PAGE(%edi), %eax + pushl %eax + pushl %ebx + callswap_pages + addl$8, %esp + + /* To be certain of avoiding problems with self-modifying code + * I need to execute a serializing instruction here. + * So I flush the TLB, it's handy, and not processor dependent. + */ + xorl%eax, %eax + movl%eax, %cr3 + + /* set all of the registers to known values */ + /* leave %esp alone */ + + movlKJUMP_MAGIC_OFF(%edi), %eax + cmpl$KJUMP_MAGIC_NUMBER, %eax + jz 1f + xorl%edi, %edi + xorl%eax, %eax + xorl%ebx, %ebx + xorl%ecx, %ecx + xorl%edx, %edx + xorl%esi, %esi + xorl%ebp, %ebp + ret +1: + popl%edx + movlCP_PA_SWAP_PAGE(%edi), %esp + addl$PAGE_SIZE_asm, %esp + pushl %edx +2: + call*%edx + movl%edi, %edx + popl%edi + pushl %edx + jmp 2b + What does above piece of code do? Looks like redundant for switching between the kernels? After call *%edx, we never return here. Instead we come back to kexec_jump_back_entry? [..] --- /dev/null +++ b/Documentation/i386/jump_back_protocol.txt @@ -0,0 +1,66 @@ + THE LINUX/I386 JUMP BACK PROTOCOL + - + + Huang Ying [EMAIL PROTECTED] + Last update 2007-12-19 + +Currently, the following versions of the jump back protocol exist. + +Protocol 1.00: Jumping between original kernel and kexeced kernel + support. Calling ordinary C function support. + + +*** JUMP BACK ENTRY + +At jump back entry of callee, the CPU must be in 32-bit protected mode +with paging disabled; the CS, DS, ES and SS must be 4G flat segments; +CS must have execute/read permission, and DS, ES and SS must have +read/write permission; interrupt must be disabled; the contents of +registers and corresponding memory must be as follow: + +Offset/Size Meaning + +%edi Real jump back entry of caller if supported, + otherwise 0. +%esp Stack top pointer, the size of stack is about 4k bytes. +(%esp)/4 Helper jump back entry of caller if %edi != 0, + otherwise undefined. + I am not sure what is helper jump back entry? I understand that you are using %edi to pass around entry point between two kernels. Can you please shed some more light on this? Thanks Vivek ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
Rafael J. Wysocki [EMAIL PROTECTED] writes: On Saturday, 22 of March 2008, Alan Stern wrote: The spec doesn't say much about that, so we'll need to carry out some experiments. Still, as far as I can figure out what the spec authors _might_ mean, I think that it would be inappropriate to restore the ACPI NVS area if S5 was entered on power off. The idea seems to be that the restoration of the ACPI NVS area should complement whatever has been preserved by the platform over the hibernation/resume cycle. IMO, if S5 was entered on powe off, there are two possible ways to go. Either ACPI is initialized by the boot kernel, in which case the image kernel should not touch things like _WAK and similar, just throw away whatever ACPI-related state it got from the image and try to rebuild the ACPI-related data from scratch. Or the boot kernel doesn't touch ACPI and the image kernel initializes it in the same way as during a fresh boot (that might be difficult, though). Just an added data partial point. In the kexec case I have had not heard anyone screaming to me that ACPI doesn't work after we switch kernels. So I expect shutting down ACPI and restarting it should work reliably and that is easy to test as that is already implemented with kexec. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
Huang, Ying [EMAIL PROTECTED] writes: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. A minimal patch route sounds good. * Do not allocate memory (or fail in any way) in machine_kexec(). * We are past the point of no return, committed to rebooting now. */ -NORET_TYPE void machine_kexec(struct kimage *image) +void machine_kexec(struct kimage *image) { unsigned long page_list[PAGES_NR]; void *control_page; + asmlinkage NORET_TYPE void + (*relocate_kernel_ptr)(unsigned long indirection_page, +unsigned long control_page, +unsigned long start_address, +unsigned int has_pae) ATTRIB_NORET; /* Interrupts aren't acceptable while we reboot */ local_irq_disable(); control_page = page_address(image-control_code_page); - memcpy(control_page, relocate_kernel, PAGE_SIZE); + memcpy(control_page, kexec_relocate_page, PAGE_SIZE/2); + KJUMP_MAGIC(control_page) = 0; + if (image-preserve_context) { + KJUMP_MAGIC(control_page) = KJUMP_MAGIC_NUMBER; + if (kexec_jump_save_cpu(control_page)) { + image-start = KJUMP_ENTRY(control_page); + return; Tricky, and I expect unnecessary. We should be able to just have relocate_new_kernel return? + } + } + + relocate_kernel_ptr = control_page + + ((void *)relocate_kernel - (void *)kexec_relocate_page); page_list[PA_CONTROL_PAGE] = __pa(control_page); - page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel; + page_list[VA_CONTROL_PAGE] = (unsigned long)control_page; page_list[PA_PGD] = __pa(kexec_pgd); page_list[VA_PGD] = (unsigned long)kexec_pgd; #ifdef CONFIG_X86_PAE @@ -127,6 +148,7 @@ NORET_TYPE void machine_kexec(struct kim page_list[VA_PTE_0] = (unsigned long)kexec_pte0; page_list[PA_PTE_1] = __pa(kexec_pte1); page_list[VA_PTE_1] = (unsigned long)kexec_pte1; + page_list[PA_SWAP_PAGE] = (page_to_pfn(image-swap_page) PAGE_SHIFT); /* The segment registers are funny things, they have both a * visible and an invisible part. Whenever the visible part is @@ -145,8 +167,9 @@ NORET_TYPE void machine_kexec(struct kim set_idt(phys_to_virt(0),0); /* now call it */ - relocate_kernel((unsigned long)image-head, (unsigned long)page_list, - image-start, cpu_has_pae); + relocate_kernel_ptr((unsigned long)image-head, + (unsigned long)page_list, + image-start, cpu_has_pae); } --- a/kernel/sys.c +++ b/kernel/sys.c @@ -301,18 +301,26 @@ EXPORT_SYMBOL_GPL(kernel_restart); * Move into place and start executing a preloaded standalone * executable. If nothing was preloaded return an error. */ -static void kernel_kexec(void) +static int kernel_kexec(void) { + int ret = -ENOSYS; #ifdef CONFIG_KEXEC - struct kimage *image; - image = xchg(kexec_image, NULL); - if (!image) - return; - kernel_restart_prepare(NULL); - printk(KERN_EMERG Starting new kernel\n); - machine_shutdown(); - machine_kexec(image); + if (xchg(kexec_lock, 1)) + return -EBUSY; + if (!kexec_image) { + ret = -EINVAL; + goto unlock; + } + if (!kexec_image-preserve_context) { + kernel_restart_prepare(NULL); + printk(KERN_EMERG Starting new kernel\n); + machine_shutdown(); + } + ret = kexec_jump(kexec_image); +unlock: + xchg(kexec_lock, 0); #endif Ugh. No. Not sharing the shutdown methods with reboot and the normal kexec path looks like a recipe for failure to me. This looks like where we really need to have the conversation. What methods do we use to shutdown the system. My take on the situation is this. For proper handling we need driver device_detach and device_reattach methods. With the following semantics. The device_detach methods will disable DMA and place the hardware in a sane state from which the device driver can reclaim and reinitialize it, but the hardware will not be touched. device_reattach reattaches the driver to the hardware. So looking at this patch I see two very productive directions we can go. 1) A patch that just fixes up the kexec infrastructure code so it implements the swap page and provides the kernel reentry point. And doesn't handle the upper layer user interface portion. 2) A patch that renames device_shutdown to device_detach. And starts implementing the driver hooks needed from a resumable kexec. Then we have the question what do we do with devices in the kernel that
Re: [linux-pm] [PATCH -mm] kexec jump -v9
Maxim Levitsky [EMAIL PROTECTED] writes: First of all S4 ACPI code turns some leds on some systems, cosmetic thing, but still nice. Secondary, what about wakeup devices? Hardware can disable some devices in S5 while leave them running in S4 on my system for example network card will do WOL in S4, but to make it WOL in S5 I have to turn a specific option in BIOS. While my system doesn't have this, it isn't uncommon for system to leave USB ports running so one can turn the PC with keyboard/mouse even in S4. in S5 those ports will probably be disabled. My system on have this for S3 only. On laptops we can expect even more ACPI functionality, so some more differences between S4 and S5 can happen. Last thing that I want to say is that, when linux puts PC in S? state, on top of executing _PTS, _GTS acpi functions, it writes the destination S state to a fixed register, thus the hardware can (and does) behave differently. Yes. S4 looks interesting. Especially the weird fans don't work on restore from S5 case. S4 still appears to be a premature optimization, that ads lots of complexity and reduces the reliability of the code. Software hibernation to disk should be a rock solid proposition, that needs little if any cooperation from drivers, and it should work on every box, because fundamentally it is hardware agnostic. The only cooperation we need from drivers is for devices that we can't tolerate at upper layers an unplug and replug event like block devices because we would loose our filesystems. All of the reports say hibernation is not rock solid reliable. Things like S4 support keep us from being hardware agnostic. Therefore it appears to me we have a design bug. Which is why I'm not at all happy with S4 support. It actually occurs to me that the first mode we should really support is the mode where the user hits the power button themselves. That totally removes the hibernation path from any weird hardware interactions. Then S5 is an optimization upon that (just a little more work on the shutdown path). Then ultimately S4 reusing and refactoring the work for S3? suspend to ram to allow us to leave very specific devices on. But that is lot of complexity, for a little bit of gain. We should have code that works by design. Code that practically every time. Something that is easy to diagnose. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Thursday, 15 of May 2008, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: On Saturday, 22 of March 2008, Alan Stern wrote: The spec doesn't say much about that, so we'll need to carry out some experiments. Still, as far as I can figure out what the spec authors _might_ mean, I think that it would be inappropriate to restore the ACPI NVS area if S5 was entered on power off. The idea seems to be that the restoration of the ACPI NVS area should complement whatever has been preserved by the platform over the hibernation/resume cycle. IMO, if S5 was entered on powe off, there are two possible ways to go. Either ACPI is initialized by the boot kernel, in which case the image kernel should not touch things like _WAK and similar, just throw away whatever ACPI-related state it got from the image and try to rebuild the ACPI-related data from scratch. Or the boot kernel doesn't touch ACPI and the image kernel initializes it in the same way as during a fresh boot (that might be difficult, though). Just an added data partial point. In the kexec case I have had not heard anyone screaming to me that ACPI doesn't work after we switch kernels. You don't remove power from devices while doing that. So I expect shutting down ACPI and restarting it should work reliably and that is easy to test as that is already implemented with kexec. You can't program devices to generate wakeup events without ACPI, among other things. Anyway, I don't think you should focus on replacing the current hibernation code entirely so much. Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Thursday, 15 of May 2008, Eric W. Biederman wrote: Huang, Ying [EMAIL PROTECTED] writes: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. A minimal patch route sounds good. * Do not allocate memory (or fail in any way) in machine_kexec(). * We are past the point of no return, committed to rebooting now. */ -NORET_TYPE void machine_kexec(struct kimage *image) +void machine_kexec(struct kimage *image) { unsigned long page_list[PAGES_NR]; void *control_page; + asmlinkage NORET_TYPE void + (*relocate_kernel_ptr)(unsigned long indirection_page, + unsigned long control_page, + unsigned long start_address, + unsigned int has_pae) ATTRIB_NORET; /* Interrupts aren't acceptable while we reboot */ local_irq_disable(); control_page = page_address(image-control_code_page); - memcpy(control_page, relocate_kernel, PAGE_SIZE); + memcpy(control_page, kexec_relocate_page, PAGE_SIZE/2); + KJUMP_MAGIC(control_page) = 0; + if (image-preserve_context) { + KJUMP_MAGIC(control_page) = KJUMP_MAGIC_NUMBER; + if (kexec_jump_save_cpu(control_page)) { + image-start = KJUMP_ENTRY(control_page); + return; Tricky, and I expect unnecessary. We should be able to just have relocate_new_kernel return? + } + } + + relocate_kernel_ptr = control_page + + ((void *)relocate_kernel - (void *)kexec_relocate_page); page_list[PA_CONTROL_PAGE] = __pa(control_page); - page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel; + page_list[VA_CONTROL_PAGE] = (unsigned long)control_page; page_list[PA_PGD] = __pa(kexec_pgd); page_list[VA_PGD] = (unsigned long)kexec_pgd; #ifdef CONFIG_X86_PAE @@ -127,6 +148,7 @@ NORET_TYPE void machine_kexec(struct kim page_list[VA_PTE_0] = (unsigned long)kexec_pte0; page_list[PA_PTE_1] = __pa(kexec_pte1); page_list[VA_PTE_1] = (unsigned long)kexec_pte1; + page_list[PA_SWAP_PAGE] = (page_to_pfn(image-swap_page) PAGE_SHIFT); /* The segment registers are funny things, they have both a * visible and an invisible part. Whenever the visible part is @@ -145,8 +167,9 @@ NORET_TYPE void machine_kexec(struct kim set_idt(phys_to_virt(0),0); /* now call it */ - relocate_kernel((unsigned long)image-head, (unsigned long)page_list, - image-start, cpu_has_pae); + relocate_kernel_ptr((unsigned long)image-head, + (unsigned long)page_list, + image-start, cpu_has_pae); } --- a/kernel/sys.c +++ b/kernel/sys.c @@ -301,18 +301,26 @@ EXPORT_SYMBOL_GPL(kernel_restart); * Move into place and start executing a preloaded standalone * executable. If nothing was preloaded return an error. */ -static void kernel_kexec(void) +static int kernel_kexec(void) { + int ret = -ENOSYS; #ifdef CONFIG_KEXEC - struct kimage *image; - image = xchg(kexec_image, NULL); - if (!image) - return; - kernel_restart_prepare(NULL); - printk(KERN_EMERG Starting new kernel\n); - machine_shutdown(); - machine_kexec(image); + if (xchg(kexec_lock, 1)) + return -EBUSY; + if (!kexec_image) { + ret = -EINVAL; + goto unlock; + } + if (!kexec_image-preserve_context) { + kernel_restart_prepare(NULL); + printk(KERN_EMERG Starting new kernel\n); + machine_shutdown(); + } + ret = kexec_jump(kexec_image); +unlock: + xchg(kexec_lock, 0); #endif Ugh. No. Not sharing the shutdown methods with reboot and the normal kexec path looks like a recipe for failure to me. This looks like where we really need to have the conversation. What methods do we use to shutdown the system. My take on the situation is this. For proper handling we need driver device_detach and device_reattach methods. With the following semantics. The device_detach methods will disable DMA and place the hardware in a sane state from which the device driver can reclaim and reinitialize it, but the hardware will not be touched. device_reattach reattaches the driver to the hardware. So looking at this patch I see two very productive directions we can go. 1) A patch that just fixes up the kexec infrastructure code so it implements the swap page and provides the kernel reentry point. And doesn't handle the upper layer user interface portion. 2) A patch that renames device_shutdown to device_detach. And starts implementing the driver hooks needed from a resumable kexec. Then we have
Re: [PATCH -mm] kexec jump -v9
On Wed, 2008-05-14 at 15:30 -0700, Eric W. Biederman wrote: [...] + if (image-preserve_context) { + KJUMP_MAGIC(control_page) = KJUMP_MAGIC_NUMBER; + if (kexec_jump_save_cpu(control_page)) { + image-start = KJUMP_ENTRY(control_page); + return; Tricky, and I expect unnecessary. We should be able to just have relocate_new_kernel return? OK, I will check this. Maybe we can move CPU state saving code into relocate_new_kernel. [...] -static void kernel_kexec(void) +static int kernel_kexec(void) { + int ret = -ENOSYS; #ifdef CONFIG_KEXEC - struct kimage *image; - image = xchg(kexec_image, NULL); - if (!image) - return; - kernel_restart_prepare(NULL); - printk(KERN_EMERG Starting new kernel\n); - machine_shutdown(); - machine_kexec(image); + if (xchg(kexec_lock, 1)) + return -EBUSY; + if (!kexec_image) { + ret = -EINVAL; + goto unlock; + } + if (!kexec_image-preserve_context) { + kernel_restart_prepare(NULL); + printk(KERN_EMERG Starting new kernel\n); + machine_shutdown(); + } + ret = kexec_jump(kexec_image); +unlock: + xchg(kexec_lock, 0); #endif Ugh. No. Not sharing the shutdown methods with reboot and the normal kexec path looks like a recipe for failure to me. This looks like where we really need to have the conversation. What methods do we use to shutdown the system. My take on the situation is this. For proper handling we need driver device_detach and device_reattach methods. With the following semantics. The device_detach methods will disable DMA and place the hardware in a sane state from which the device driver can reclaim and reinitialize it, but the hardware will not be touched. device_reattach reattaches the driver to the hardware. Yes. Current device PM callback is not suitable for hibernation (kexec based or original). I think we can collaborate with Rafael J. Wysocki on the new device drivers hibernation callbacks. So looking at this patch I see two very productive directions we can go. 1) A patch that just fixes up the kexec infrastructure code so it implements the swap page and provides the kernel reentry point. And doesn't handle the upper layer user interface portion. 2) A patch that renames device_shutdown to device_detach. And starts implementing the driver hooks needed from a resumable kexec. OK. I can separate the patch into two patches. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Wed, 2008-05-14 at 16:52 -0400, Vivek Goyal wrote: [...] Ok, I have done some testing on this patch. Currently I have just tested switching back and forth between two kernels and it is working for me. Thanks. [...] +/* + * Entry point for jumping back from kexeced kernel, the paging is + * turned off. + */ +kexec_jump_back_entry: + call1f +1: + popl%ebx + subl$(1b - kexec_relocate_page), %ebx + movl%edi, KJUMP_ENTRY_OFF(%ebx) + movlCP_VA_CONTROL_PAGE(%ebx), %edi + lea STACK_TOP(%ebx), %esp + movlCP_PA_SWAP_PAGE(%ebx), %eax + movlCP_PA_BACKUP_PAGES_MAP(%ebx), %edx + pushl %eax + pushl %edx + callswap_pages + addl$8, %esp + movlCP_PA_PGD(%ebx), %eax + movl%eax, %cr3 + movl%cr0, %eax + orl $(131), %eax + movl%eax, %cr0 + lea STACK_TOP(%edi), %esp + movl%edi, %eax + addl$(virtual_mapped - kexec_relocate_page), %eax + pushl %eax + ret Upon re-entering the kernel, what happens to GDT table? So gdtr will be pointing to GDT of other kernel (which is not there as pages have been swapped)? Do we need to reload the gdtr upon re-entering the kernel. After re-entering the kernel and returning from machine_kexec, restore_processor_state() is called, where the GDTR and some other CPU state such as FPU, IDT, etc are restored. [..] @@ -197,8 +282,54 @@ identity_mapped: xorl%eax, %eax movl%eax, %cr3 + movlCP_PA_SWAP_PAGE(%edi), %eax + pushl %eax + pushl %ebx + callswap_pages + addl$8, %esp + + /* To be certain of avoiding problems with self-modifying code +* I need to execute a serializing instruction here. +* So I flush the TLB, it's handy, and not processor dependent. +*/ + xorl%eax, %eax + movl%eax, %cr3 + + /* set all of the registers to known values */ + /* leave %esp alone */ + + movlKJUMP_MAGIC_OFF(%edi), %eax + cmpl$KJUMP_MAGIC_NUMBER, %eax + jz 1f + xorl%edi, %edi + xorl%eax, %eax + xorl%ebx, %ebx + xorl%ecx, %ecx + xorl%edx, %edx + xorl%esi, %esi + xorl%ebp, %ebp + ret +1: + popl%edx + movlCP_PA_SWAP_PAGE(%edi), %esp + addl$PAGE_SIZE_asm, %esp + pushl %edx +2: + call*%edx + movl%edi, %edx + popl%edi + pushl %edx + jmp 2b + What does above piece of code do? Looks like redundant for switching between the kernels? After call *%edx, we never return here. Instead we come back to kexec_jump_back_entry? For switching between the kernels, this is redundant. Originally another feature of kexec jump is to call some code in physical mode. This is used to provide a C ABI to called code. Now, Eric suggests to use a C ABI compatible mode to pass the jump back entry point too, that is, use the return address on stack instead of % edi. I think that is reasonable. Maybe we can revise this code to be compatible with C ABI and provide a convenient interface for both kernel and other physical mode code. [..] --- /dev/null +++ b/Documentation/i386/jump_back_protocol.txt @@ -0,0 +1,66 @@ + THE LINUX/I386 JUMP BACK PROTOCOL + - + + Huang Ying [EMAIL PROTECTED] + Last update 2007-12-19 + +Currently, the following versions of the jump back protocol exist. + +Protocol 1.00: Jumping between original kernel and kexeced kernel + support. Calling ordinary C function support. + + +*** JUMP BACK ENTRY + +At jump back entry of callee, the CPU must be in 32-bit protected mode +with paging disabled; the CS, DS, ES and SS must be 4G flat segments; +CS must have execute/read permission, and DS, ES and SS must have +read/write permission; interrupt must be disabled; the contents of +registers and corresponding memory must be as follow: + +Offset/SizeMeaning + +%edi Real jump back entry of caller if supported, + otherwise 0. +%esp Stack top pointer, the size of stack is about 4k bytes. +(%esp)/4 Helper jump back entry of caller if %edi != 0, + otherwise undefined. + I am not sure what is helper jump back entry? I understand that you are using %edi to pass around entry point between two kernels. Can you please shed some more light on this? Helper jump back entry is used to provide a C ABI to some physical mode code other than kernel. It is the above redundant code. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Fri, 2008-03-21 at 15:12 -0400, Vivek Goyal wrote: [...] Hi Huang, I am kind of ok with both the methods. - Communicate information between two kernels using an ELF NOTE prepared by kernel. - Communicate information between user space tools using initrd. I think the ELF_NOTES mechanism is sufficient for communication between two kernel. Because it can be written from user space tool in the kernel A (/sbin/kexec via sys_kexec_load), and read from user space tool in the kernel B (via /proc/vmcore). It can be used as user space communication mechanism. So I think it may be not necessary to communicate with initrd. If we want to load the hibernated image with sys_kexec_load (/sbin/kexec -l), we must add multiple stages loading feature to sys_kexec_load. Because the segments in the hibernated image can exceed KEXEC_SEGMENT_MAX (16) easily, considering there will be many memory holes when free pages are excluded. Multiple sys_kexec_load must be used to load a normal hibernated image. If multiple stage loading is unavoidable, I think the better method to communicate information like jump back entry and backup pages map is multiple stage loading like you said in previous mail. And they can be encapsulated as ELF_NOTES. So the only information need to be passed on stack is address of ELF core header. But which method to use will depend on what information we want to exchange between two kernels. For example, re-entry points can be on stack or in ELF NOTE. Backup page map probably can be communicated using initrd as only user space need to access that (ELF Core headers can be put in a memory area which is not swapped during transition from kernel A to B. This way kernel B never needs to know that kernel A had done some swapping of pages?). ELF core headers are in destination memory range of kernel B, so they can be accessed by kernel B directly without knowing pages swapping in kernel A. So far I have understood only following. 1. We need to pass around entry/re-entry points between kernels. 2. We need to pass backup pages map from kernel A to kernel B, so that user space tool can do filtering. 3. We need to pass address of ELF core headers from kernel A to kernel B so that a valid vmcore of kernel A can be exported. - For first time boot of kernel B, address of ELF core header is passed through command line. - For re-entry into B, ELF core header address can be passed using some register, or on stack or using kernel ELF NOTE. What else? What information do we need to communicate from kernel B to kernel A or from kernel C to kernel A? I am sure that you have told it in the past. Just that I don't recollect it. For now, there is no information need to be passed from kernel B/C to kernel A. But I think in the future, there should be some ACPI related information need to be passed in this way, such as from kernel C to kernel A: whether system is restored from ACPI S4 or ACPI S5. So I think it is necessary to make it possible to pass some information from kernel B/C to kernel A. But I think an ELF core header and some memory is sufficient to do this. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Saturday, 22 of March 2008, Pavel Machek wrote: On Fri, 21 Mar 2008, Rafael J. Wysocki wrote: Well, in fact ACPI has something called the NVS memory, which we're supposed to restore during the resume and which we're not doing. The problem may be related to this. No, it can't be. ACPI won't expect the NVS memory to be restored following an S5-shutdown. In fact, as far as ACPI is concerned, resuming from an S5-type hibernation should not be considered a resume at all but just an ordinary reboot. I agree here. All ACPI-related memory areas in the boot kernel should be passed directly through to the image kernel. However, the image kernel is supposed to restore the NVS area (from the image) before executing _WAK. It's supposed to do that when resuming from an S4 hibernation, not when resuming from an S5 hibernation. How can we pass interpretter state? I do not think we do this kind of passing. The interpreter state is passed withing the image. The platform state is not. For an S5 hibernation, the interpreter state within the image is wrong. The image kernel needs to have the interpreter state from the boot kernel -- I don't know if this is possible. yes, nosave pages could be used to do this passing -- if we can put interpretter state into pre-allocated memory block. On x86-64 there's no guarantee that the nosave pages will be at the same locations in both the image kernel and the boot kernel. What we could do is to pass the data in the image header, preallocate some safe pages from the boot kernel, put the data in there and pass a pointer to them to the image kernel. However, as far as the ACPI NVS area is concerned, this is probably not necessary, because the spec wants us to restore the ACPI NVS before calling _WAK, which is just after the image kernel gets the control back. So, in theory, the ACPI NVS data could be stored in the image and restored by the image kernel from a location known to it (the procedure may be to copy the ACPI NVS data into a region of regular RAM before creating the image and copy them back into the ACPI NVS area in platform-leave(), for example), but I suspect that for this to work we'll have to switch ACPI off in the boot kernel, just prior to passing control back to the image kernel. Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Sat, 22 Mar 2008, Rafael J. Wysocki wrote: However, as far as the ACPI NVS area is concerned, this is probably not necessary, because the spec wants us to restore the ACPI NVS before calling _WAK, which is just after the image kernel gets the control back. So, in theory, the ACPI NVS data could be stored in the image and restored by the image kernel from a location known to it (the procedure may be to copy the ACPI NVS data into a region of regular RAM before creating the image and copy them back into the ACPI NVS area in platform-leave(), for example), but I suspect that for this to work we'll have to switch ACPI off in the boot kernel, just prior to passing control back to the image kernel. That sounds by far the simplest solution. If the boot kernel can tell (by looking at some header field in the image or any other way) that the hibernation used S5 instead of S4, then it should just turn off ACPI before passing control to the image kernel. Then the image kernel can turn ACPI back on and all should be well. If you do this, does the NVS region still need to be preserved? Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Saturday, 22 of March 2008, Alan Stern wrote: On Sat, 22 Mar 2008, Rafael J. Wysocki wrote: However, as far as the ACPI NVS area is concerned, this is probably not necessary, because the spec wants us to restore the ACPI NVS before calling _WAK, which is just after the image kernel gets the control back. So, in theory, the ACPI NVS data could be stored in the image and restored by the image kernel from a location known to it (the procedure may be to copy the ACPI NVS data into a region of regular RAM before creating the image and copy them back into the ACPI NVS area in platform-leave(), for example), but I suspect that for this to work we'll have to switch ACPI off in the boot kernel, just prior to passing control back to the image kernel. That sounds by far the simplest solution. If the boot kernel can tell (by looking at some header field in the image or any other way) that the hibernation used S5 instead of S4, then it should just turn off ACPI before passing control to the image kernel. Then the image kernel can turn ACPI back on and all should be well. If you do this, does the NVS region still need to be preserved? The spec doesn't say much about that, so we'll need to carry out some experiments. Still, as far as I can figure out what the spec authors _might_ mean, I think that it would be inappropriate to restore the ACPI NVS area if S5 was entered on power off. The idea seems to be that the restoration of the ACPI NVS area should complement whatever has been preserved by the platform over the hibernation/resume cycle. IMO, if S5 was entered on powe off, there are two possible ways to go. Either ACPI is initialized by the boot kernel, in which case the image kernel should not touch things like _WAK and similar, just throw away whatever ACPI-related state it got from the image and try to rebuild the ACPI-related data from scratch. Or the boot kernel doesn't touch ACPI and the image kernel initializes it in the same way as during a fresh boot (that might be difficult, though). Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Tue 2008-03-18 21:25:27, Eric W. Biederman wrote: Alan Stern [EMAIL PROTECTED] writes: On Wed, 19 Mar 2008, Rafael J. Wysocki wrote: Well, I've been saying that for I-don't-remember-how-long: on my box, if you use S5 instead of entering S4, the fan doesn't work correctly after the resume. Plain and simple. Perhaps there's a problem with our ACPI drivers that causes this to happen, but I have no idea what that can be at the moment. IMO it would be worthwhile to track this down. It's a clear indication that something is wrong somewhere. Could it be connected with the way the boot kernel hands control over to the image kernel? Presumably ACPI isn't prepared to deal with that sort of thing during a boot from S5. It would have to be fooled into thinking the two kernels were one and the same. It should be easy to test if it is a hand over problem, by turning off the laptop by placing it in S5 (shutdown -h now) and then booting same kernel again. Feel free to help with testing. I believe ACPI is simply getting confused by us overwriting memory with that from old image. I don't see how you can emulate it with shutdown. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html pomozte zachranit klanovicky les: http://www.ujezdskystrom.info/ ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Tue, 18 Mar 2008, Eric W. Biederman wrote: Alan Stern [EMAIL PROTECTED] writes: Could it be connected with the way the boot kernel hands control over to the image kernel? Presumably ACPI isn't prepared to deal with that sort of thing during a boot from S5. It would have to be fooled into thinking the two kernels were one and the same. It should be easy to test if it is a hand over problem, by turning off the laptop by placing it in S5 (shutdown -h now) and then booting same kernel again. ? Doesn't this happen every time Rafael turns the computer off and then turns it back on? Do you mean that Rafael should do an S5-type hibernate, but then reboot in such a way that the image isn't loaded and resumed? Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Wednesday, 19 of March 2008, Alan Stern wrote: On Tue, 18 Mar 2008, Eric W. Biederman wrote: Alan Stern [EMAIL PROTECTED] writes: Could it be connected with the way the boot kernel hands control over to the image kernel? Presumably ACPI isn't prepared to deal with that sort of thing during a boot from S5. It would have to be fooled into thinking the two kernels were one and the same. It should be easy to test if it is a hand over problem, by turning off the laptop by placing it in S5 (shutdown -h now) and then booting same kernel again. ? Doesn't this happen every time Rafael turns the computer off and then turns it back on? Do you mean that Rafael should do an S5-type hibernate, but then reboot in such a way that the image isn't loaded and resumed? That will work. The problem happens when the control goes back to the hibernated kernel. I _think_ it has to do with the suspend(PRETHAW) thing we do before that, but frankly I'm not too inclined to verify it as the problem is generally dangerous to the hardware (not working thermal management on a notebook is never fun). Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Tuesday, 18 of March 2008, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: On Friday, 14 of March 2008, Eric W. Biederman wrote: Still, it would be sufficient if we disconnected the drivers from the hardware and thus prevented applications from accessing that hardware. My gut feeling is that except for a handful of drivers we could even get away with simply implementing hot unplug and hot replug. Disks are the big exception here. Which suggests to me that it is at least possible that the methods we want for a kexec jump hibernation may be different from an in-kernel hibernation and quite possibly are easier to implement. I'm not sure about the easier part, quite frankly. Also, with our current ordering of code the in-kernel hibernation will need the same callbacks as the kexec-based thing. However, with the in-kernel approach we can attempt (in the future) to be more ACPI compliant, so to speak, but with the kexec-based approach that won't be possible. Whether it's a good idea to follow ACPI, as far as hibernation is concerned, is a separate question, but IMO we won't be able to answer it without _lots_ of testing on vaious BIOS/firmware configurations. Our experience so far indicates that at least some BIOSes expect us to follow ACPI and misbehave otherwise, so for those systems there should be an ACPI way available. [Others just don't work well if we try to follow ACPI and those may be handled using the kexec-based approach, but that doesn't mean that we can just ignore the ACPI compliance issue, at least for now.] If we do use the ACPI S4 state I completely agree we should be at least spec compliant in how we use it. I took a quick skim through my copy of the ACPI spec so I could get a feel for this issue. Hibernation maps to the ACPI S4 state. The only thing we appear to gain from S4 is the ability to tell the BIOS (so it can tell a bootloader) that this was a hibernation power off instead of simply a software power off. It looks like entering the ACPI S4 state has a few advantages with respect to how the system wakes up. In general using the ACPI S5 state (soft off) appears simpler, and potentially more reliable. The sequence we appear to want is: - Disconnecting drivers from devices. - Saving the image. - Placing the system in a low power or off state. - Coming out of the low power state. - Restoring the image. - Reconnecting drivers to devices. (We must assume the device state could have changed here no matter what we do) It is mostly a matter of where we place the code. Right now I don't see a limitation either with a kexec based approach or without one. Especially since the common case would be using the same kernel with the same drivers both before and after the hibernation event. The low power states for S4 seem to be just so that we can decide which devices have enough life that they can wake up the system. If we handle all of that as a second pass after we have the system in a state where we have saved it we should be in good shape. My inclination is to just use S5 (soft off). One of the cool things about hibernation to disk was that we were supposed to get the BIOS totally out of that path so we could get something that was rock solid and reliable. I don't see why we should use ACPI S4 when the BIOS doesn't seem to give us anything useful, and causes us headaches we should even consider using S4. Does using the S4 state have advantages that I currently do not see? Len? Rafael? Anyone? Well, I've been saying that for I-don't-remember-how-long: on my box, if you use S5 instead of entering S4, the fan doesn't work correctly after the resume. Plain and simple. Perhaps there's a problem with our ACPI drivers that causes this to happen, but I have no idea what that can be at the moment. Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
Alan Stern [EMAIL PROTECTED] writes: On Wed, 19 Mar 2008, Rafael J. Wysocki wrote: Well, I've been saying that for I-don't-remember-how-long: on my box, if you use S5 instead of entering S4, the fan doesn't work correctly after the resume. Plain and simple. Perhaps there's a problem with our ACPI drivers that causes this to happen, but I have no idea what that can be at the moment. IMO it would be worthwhile to track this down. It's a clear indication that something is wrong somewhere. Could it be connected with the way the boot kernel hands control over to the image kernel? Presumably ACPI isn't prepared to deal with that sort of thing during a boot from S5. It would have to be fooled into thinking the two kernels were one and the same. It should be easy to test if it is a hand over problem, by turning off the laptop by placing it in S5 (shutdown -h now) and then booting same kernel again. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Wed, 2008-03-12 at 15:37 -0400, Vivek Goyal wrote: On Tue, Mar 11, 2008 at 08:17:45PM -0600, Eric W. Biederman wrote: Huang, Ying [EMAIL PROTECTED] writes: Yes. The entry point should be saved in dump.elf itself, this can be done via a user-space tool such as makedumpfile. Because makedumpfile is also used to exclude free pages from disk image, it needs a communication method between two kernels (to get backup pages map or something like that from kernel A). We have talked about this before. - Your opinion is to communicate via the purgatory. (But I don't know how to communicate between kernel A and purgatory). How about the return address on the stack? I think he needs to pass on much more data than just return address. IIUC, he needs to pass backup pages map to new kernel, so that any user space tool can use backup pages map to reconstruct/rearrange the first kernel's memory core and tools like makedumpfile can do filtering before hibernated images is saved. This brings me to a random thought. Can we break the process of loading a hibernation kernel in two steps. - In first step just do the memory reservation for running second kernel. (kexec -l dummpy-file-for-reserving-memory) - This memory map of reserved pages is exported to user space. - Use this memory map and regenerate the hibernation kernel initrd (rootfs.gz) and put the memory map there. This memory map can be used by makedumpfile in second kernel for filtering. This way it will user space to user space communication of information which gets fixed at kernel loading time. Doing kexec load in two steps is a possible solution. Although this is a little complex, we can wrap the two steps into one /sbin/kexec invoking. That is, When do /sbin/kexec --load-preserve-context kernel-image, /sbin/kexec first call sys_kexec_load() to load the kernel image and reserving memory, then amend the memory image of loaded kernel (B) according to the new information available such as return address and backup pages map. For this solution, something still need to be solved is how to pass some information back from kernel B (hibernating kernel) to kernel A (original kernel) and how to pass some information from kernel C (resuming kernel) to kernel A (original kernel). - Another possible solution to pass information between kernels (in user space): needed information from kernel are passed in stack, and a special ELF_NOTES is used to access the information in peer kernel. Details is as follow: 1. Possible information need to be passed: 1.1 From user space (known before sys_kexec_load): a. ELF core header b. vmcoreinfo (pointer only) 1.2 From kernel space (known after sys_kexec_load): a. jump back entry (return address) b. backup pages map 2. When jumping from kernel A to kernel B: 2.1 In /sbin/kexec --load-preserve-context kernel-image, /sbin/kexec allocate a special ELF_NOTES (ELF NOTES kernel) for information from kernel space. 2.2 When doing sys_reboot(REBOOT_CMD_KEXEC), kernel put needed information and physical address of ELF core header onto stack just before jump to purgatory. 2.3 After jumping to purgatory, purgatory fills ELF NOTES kernel with corresponding address in stack. 2.4 When kernel B is booted, /proc/vmcore is created and the information form ELF NOTES kernel is available too. 3. When jumping back from kernel B to kernel A and jumping from kernel C to kernel A: 3.1 Same as 2.1 3.2 Same as 2.2, but there is no purgatory in kernel A, so when information are put on stack, jump to jump back entry of kernel A directly. 3.3 The code on jump back entry of kernel A will work as a purgatory to fill ELF NOTES kernel with corresponding address in stack. Then /proc/vmcore reset code is called again to (re-)construct the /proc/vmcore with new information. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
Hi! The features of this patch can be used for as follow: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. The main usage of this functionality is for hibernation. I am not sure what has been the conclusion of previous discussions. Rafael/Pavel, does the approach of doing hibernation using a separate kernel holds promise? Its certainly more traditional method of doing hibernation than tricks swsusp currently plays. What exactly are you referring to? Well, traditionaly it is 'A saves B to disk' (like bootloader saves kerneluserspace). In swsusp we have 'kernel saves itself'... which works, too, but is pretty different design. Now, I guess they are some difficulties, like ACPI integration, and some basic drawbacks, like few seconds needed to boot second kernel during suspend. ...OTOH this is probably only chance to eliminate freezer from swsusp... Some facts: * There's no reason to think that we can't use this same mechanism for hibernation (the only difficulty seems to be the handling of devices used for saving the image). Ok, at least kexec makes handling of suspend device easier. Moreover, if this had been the _only_ argument for the $subject functionality, I'd have been against it. Fortunately its not the only one :-). -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Wed, 12 Mar 2008, Huang, Ying wrote: I think kexec based hibernation is the only currently available possible method to write out image without freezer (after driver works are done). If other process is running, how to prevent them from writing to disk without freezing them in current implementation? This is a very good question. It's a matter of managing the block layer's request queues. Somehow the existing I/O requests must remain blocked while the requests needed for writing the image must be allowed to proceed. I don't know what would be needed to make this work, but it ought to be possible somehow... Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
On Wednesday, 12 of March 2008, Alan Stern wrote: On Wed, 12 Mar 2008, Huang, Ying wrote: I think kexec based hibernation is the only currently available possible method to write out image without freezer (after driver works are done). If other process is running, how to prevent them from writing to disk without freezing them in current implementation? This is a very good question. It's a matter of managing the block layer's request queues. Somehow the existing I/O requests must remain blocked while the requests needed for writing the image must be allowed to proceed. I don't know what would be needed to make this work, but it ought to be possible somehow... Yes, it ought to be possible. Ultimately, IMHO, we should put all devices unnecessary for saving the image (and doing some eye-candy work) into low power states before the image is created and keep them in low power states until the system is eventually powered off. If this is done, the remaining problem is the handling of the devices that we need to save the image. I believe that will be achievable without using the freezer. Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm] kexec jump -v9
Rafael J. Wysocki [EMAIL PROTECTED] writes: Yes, it ought to be possible. Ultimately, IMHO, we should put all devices unnecessary for saving the image (and doing some eye-candy work) into low power states before the image is created and keep them in low power states until the system is eventually powered off. Why? I guess I don't see why we care what power state the devices are in. Especially since we should be able to quickly save the image. We need to disconnect the drivers from the hardware yes. So filesystems still work and applications that do direct hardware access still work and don't need to reopen their connections. I'm leery of low power states as they don't always work, and bringing low power states seems to confuse hibernation to disk with suspend to ram. If this is done, the remaining problem is the handling of the devices that we need to save the image. I believe that will be achievable without using the freezer. Reasonable. In general the problem is much easier if we don't store the hibernation image in a filesystem or partition that the rest of the system is using. That way we avoid inconsistencies. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Thu, Mar 06, 2008 at 11:13:08AM +0800, Huang, Ying wrote: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. Hi Huang, This patchset is slowly getting better. True that first we need to come up with minimal infrastructure patch and then think of building more functionality on top of it. This patch provides an enhancement to kexec/kdump. It implements the following features: - Jumping between the original kernel and the kexeced kernel. - Backup/restore memory used by both the original kernel and the kexeced kernel. - Save/restore CPU and devices state before after kexec. The features of this patch can be used for as follow: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. The main usage of this functionality is for hibernation. I am not sure what has been the conclusion of previous discussions. Rafael/Pavel, does the approach of doing hibernation using a separate kernel holds promise? [..] Usage example of simple hibernation: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_RELOCATABLE=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_PM=y 2. Build an initramfs image contains kexec-tool and makedumpfile, or download the pre-built initramfs image, called rootfs.gz in following text. 3. Prepare a partition to save memory image of original kernel, called hibernating partition in following text. 3. Boot kernel compiled in step 1 (kernel A). 4. In the kernel A, load kernel compiled in step 1 (kernel B) with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context /boot/bzImage --mem-min=0x10 --mem-max=0xff --initrd=rootfs.gz 5. Boot the kernel B with following shell command line: /sbin/kexec -e 6. The kernel B will boot as normal kexec. In kernel B the memory image of kernel A can be saved into hibernating partition as follow: jump_back_entry=`cat /proc/cmdline | tr ' ' '\n' | grep kexec_jump_back_entry | cut -d '='` echo $jump_back_entry kexec_jump_back_entry cp /proc/vmcore dump.elf Why not store the entry point in dump.elf itself, instead of storing it in a separate file? I think this is more like a resumable core file. Something similar to functionality what qemu does for resuming an already booted kernel image. So we might have to introduce an ELF_NOTE to mark an image as resumable core. Then you can shutdown the machine as normal. 7. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as root file system. 8. In kernel C, load the memory image of kernel A as follow: /sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf How the memory segments of dump.elf loaded? Normal kexec way? Memory segments of dump.elf are first stored somewhere and then moved to destination at kexec -e time? Does this really work? If we have 4G RAM, what will be the size of dump.elf? And when we load it back for resuming, do we have sufficient memory left? May be we can have a separate load flag (--load-resume-image) to mark that we are resuming an hibernated image and kexec does not have to prepare commandline, does not have to prepare zero page/setup page etc. I have thought through it again and try to put together some of the new kexec options we can introduce to make the whole thing work. I am considering a simple case where a user boots the kernel A and then launches kernel B using kexec --load-preseve-context. Now a user might save the hibernated image or might want to come back to A. - kexec -l kernel-image Normal kexec functionality. Boot a new kernel, without preserving existing kernel's context. - kexec --load-preserve-context kernel-image Boot a new kernel while preserving existing kernel's context. Will be used for booting kernel B for the first time. - kexec --load-resume-image resumable-core Resumes an hibernated image. Load a ELF64 hibernated image. Context of first kernel/boot-loader will not be preserved. First kernel will not save cpu states. Will put devices into suspended state though so that these can be resumed by resumable core This option can be used by kboot or kernel C to resume an hibernated image. - kexec --load-resume-entry entry-point Image is already loaded. Just prepare the entry point so that one can enter back to previous image. cpu states will be saved and devices will be put to suspended states. will be used for A -- B and B --- A transitions. Both A
Re: [PATCH -mm] kexec jump -v9
On Tuesday, 11 of March 2008, Vivek Goyal wrote: On Thu, Mar 06, 2008 at 11:13:08AM +0800, Huang, Ying wrote: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. Hi Huang, This patchset is slowly getting better. True that first we need to come up with minimal infrastructure patch and then think of building more functionality on top of it. This patch provides an enhancement to kexec/kdump. It implements the following features: - Jumping between the original kernel and the kexeced kernel. - Backup/restore memory used by both the original kernel and the kexeced kernel. - Save/restore CPU and devices state before after kexec. The features of this patch can be used for as follow: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. The main usage of this functionality is for hibernation. I am not sure what has been the conclusion of previous discussions. Rafael/Pavel, does the approach of doing hibernation using a separate kernel holds promise? Well, what can I say? I haven't been a big fan of doing hibernation this way since the very beginning and I still have the same reservations. Namely, my opinion is that the hibernation-related problems we have are not just solvable this way. For one example, in order to stop using the freezer for suspend/hibernation we first need to revamp the suspending/resuming of devices (uder way) and the kexec-based approach doesn't help us here. I wouldn't like to start another discussion about it though. That said, I can imagine some applications of the $subject functionality not directly related to hibernation. For example, one can use it for kernel debgging (jump to a new kernel, change something in the old kernel's data, jump back and see what happens etc.). Also, in principle it may be used for such things as live migration of VMs. Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Wednesday, 12 of March 2008, Pavel Machek wrote: Hi! Hi, This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. Hi Huang, This patchset is slowly getting better. True that first we need to come up with minimal infrastructure patch and then think of building more functionality on top of it. ... The features of this patch can be used for as follow: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. The main usage of this functionality is for hibernation. I am not sure what has been the conclusion of previous discussions. Rafael/Pavel, does the approach of doing hibernation using a separate kernel holds promise? Its certainly more traditional method of doing hibernation than tricks swsusp currently plays. What exactly are you referring to? Yes, I'd like these patches to go in, being able to switch kernels seems like useful tool. No objection from me. Now, I guess they are some difficulties, like ACPI integration, and some basic drawbacks, like few seconds needed to boot second kernel during suspend. ...OTOH this is probably only chance to eliminate freezer from swsusp... Some facts: * In order to be able to do suspend (STR) without the freezer, we need to make device drivers block access to devices from applications during suspend. * There's no reason to think that we can't use this same mechanism for hibernation (the only difficulty seems to be the handling of devices used for saving the image). * We need the drivers to quiesce devices to be able to do the kexec jump in the first place (and to avoid races, we'll need them to block applications' access to devices just like for STR, which is the sufficient condition for removing the freezer). So, I don't really think that the freezer removal argument is valid here. Moreover, if this had been the _only_ argument for the $subject functionality, I'd have been against it. Thanks, Rafael ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
Nigel Cunningham [EMAIL PROTECTED] writes: Hi all. I hope kexec turns out to be a good, usable solution. Unfortunately, however, I still have some areas where I'm not convinced that kexec is going to work or work well: 1. Reliability. It's being sold as a replacement for freezing processes, yet AFAICS it's still going to require the freezer in order to be reliable. In the normal case, there isn't much of an issue with freeing memory or allocating swap, and so these steps can be expected to progress without pain. Imagine, however, the situation where another process or processes are trying to allocate large amounts of memory at the same time, or the system is swapping heavily. Although such situations will not be common, they are entirely conceivable, and any implementation ought to be able to handle such a situation efficiently. If the freezer is removed, any hibernation implementation - not just kexec - is going to have a much harder job of being reliable in all circumstances. AFAICS, the only way a kexec based solution is going to be able to get around this will be to not have to allocate memory, but that will require permanent allocation of memory for the kexec kernel and it's work area as well as the permanent, exclusive allocation of storage for the kexec hibernation implementation that's currently in place (making the LCA complaint about not being able to hibernate to swap on NTFS on fuse equally relevant). While this might be feasible on machines with larger amounts of memory (you might validly be able to argue that a user won't miss 10MB of RAM), it does make hibernation less viable or unviable for systems with less memory (embedded!). It also means that there are 10MB of RAM (or whatever amount) that the user has paid good money for, but which are probably only used for 30s at a time a couple of times a day. Right. I can address the memory concerns with a kexec based approach as they are core to kexec and completely orthogonal to the rest. A kexec in done in two passes. The first to load the target kernel and do whatever memory allocation is needed. The second to actually switch which kernel is running. Using a linux kernel to save off the image or in any other way be the target is not required it is simply the sane thing to do in a general implementation. An embedded developer could likely implement a save to disk routing in a couple of hundred lines of C and a couple of K RAM if it was an important feature. Any attempt to start to use storage available to the hibernating kernel is also going to have these race issues. Yep. Although disk storage is frequently less expensive, and more readily available, so this is less of an issue. Still it does suggest that a dedicated partition likely will be required. 2. Lack of ACPI support. At the moment, noone is going to want to use kexec based hibernation if they have an ACPI system. This needs to be addressed before it can be considered a serious contender. Yes. 3. Usability. Right now, kexec based hibernation looks quite complicated to configure, and the user is apparently going to have to remember to boot a different kernel or at least a different bootloader entry in order to resume. Not a plus. It would be good if you could find a way to use one bootloader entry, resuming if there's an image, booting normally if there's not. I completely agree here. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
Hi. On Wed, 2008-03-12 at 00:24 +0100, Pavel Machek wrote: ...OTOH this is probably only chance to eliminate freezer from swsusp... I think eliminating the freezer and having reliable hibernation under load look like incompatible goals at the moment. Do you see that as 'not a problem' or have some idea on how that issue can be addressed? Regards, Nigel ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Tue, 2008-03-11 at 17:10 -0400, Vivek Goyal wrote: On Thu, Mar 06, 2008 at 11:13:08AM +0800, Huang, Ying wrote: This is a minimal patch with only the essential features. All additional features are split out and can be discussed later. I think it may be easier to get consensus on this minimal patch. Hi Huang, This patchset is slowly getting better. True that first we need to come up with minimal infrastructure patch and then think of building more functionality on top of it. This patch provides an enhancement to kexec/kdump. It implements the following features: - Jumping between the original kernel and the kexeced kernel. - Backup/restore memory used by both the original kernel and the kexeced kernel. - Save/restore CPU and devices state before after kexec. The features of this patch can be used for as follow: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. The main usage of this functionality is for hibernation. I am not sure what has been the conclusion of previous discussions. Rafael/Pavel, does the approach of doing hibernation using a separate kernel holds promise? [..] Usage example of simple hibernation: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_RELOCATABLE=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_PM=y 2. Build an initramfs image contains kexec-tool and makedumpfile, or download the pre-built initramfs image, called rootfs.gz in following text. 3. Prepare a partition to save memory image of original kernel, called hibernating partition in following text. 3. Boot kernel compiled in step 1 (kernel A). 4. In the kernel A, load kernel compiled in step 1 (kernel B) with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context /boot/bzImage --mem-min=0x10 --mem-max=0xff --initrd=rootfs.gz 5. Boot the kernel B with following shell command line: /sbin/kexec -e 6. The kernel B will boot as normal kexec. In kernel B the memory image of kernel A can be saved into hibernating partition as follow: jump_back_entry=`cat /proc/cmdline | tr ' ' '\n' | grep kexec_jump_back_entry | cut -d '='` echo $jump_back_entry kexec_jump_back_entry cp /proc/vmcore dump.elf Why not store the entry point in dump.elf itself, instead of storing it in a separate file? I think this is more like a resumable core file. Something similar to functionality what qemu does for resuming an already booted kernel image. So we might have to introduce an ELF_NOTE to mark an image as resumable core. Yes. The entry point should be saved in dump.elf itself, this can be done via a user-space tool such as makedumpfile. Because makedumpfile is also used to exclude free pages from disk image, it needs a communication method between two kernels (to get backup pages map or something like that from kernel A). We have talked about this before. - Your opinion is to communicate via the purgatory. (But I don't know how to communicate between kernel A and purgatory). - Eric's opinion is to communicate between the user space in kernel A and user space in kernel B. - My opinion is to communicate between two kernel directly. I think as a minimal infrastructure patch, we can communicate minimal information between user space of two kernels. When we have consensus on this topic, we can use makedumpfile for both excluding free pages and saving the entry point. Now, we can save the entry point in a separate file or I can write a simple tool to do this. Then you can shutdown the machine as normal. 7. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as root file system. 8. In kernel C, load the memory image of kernel A as follow: /sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf How the memory segments of dump.elf loaded? Normal kexec way? Memory segments of dump.elf are first stored somewhere and then moved to destination at kexec -e time? Yes. Exactly. But during kexec loading, if the source page is same as destination page, we need just one page. Does this really work? If we have 4G RAM, what will be the size of dump.elf? And when we load it back for resuming, do we have sufficient memory left? Yes. It really works. If we have 4G RAM, the size of dump.elf is 4G - (memory area used by second kernel), in this example, it is 4G - 16M. The loading kernel will live in 16M memory, and load dump.elf into all other memory area. May be we can have a separate load flag (--load-resume-image) to mark that we are resuming an hibernated image and kexec does not have to
Re: [PATCH -mm] kexec jump -v9
On Wed, 2008-03-12 at 00:49 +0100, Rafael J. Wysocki wrote: On Wednesday, 12 of March 2008, Pavel Machek wrote: [...] Its certainly more traditional method of doing hibernation than tricks swsusp currently plays. What exactly are you referring to? Long long ago, the hibernation is not done by Linux kernel itself but BIOS (APM). Those days, kernel just does some preparation and jump to BIOS to do the hibernation. Imagine kernel B is the hibernation BIOS, kernel A does some prepare and jump to the BIOS (kernel B) just like the old days. Yes, I'd like these patches to go in, being able to switch kernels seems like useful tool. No objection from me. Now, I guess they are some difficulties, like ACPI integration, and some basic drawbacks, like few seconds needed to boot second kernel during suspend. ...OTOH this is probably only chance to eliminate freezer from swsusp... Some facts: * In order to be able to do suspend (STR) without the freezer, we need to make device drivers block access to devices from applications during suspend. * There's no reason to think that we can't use this same mechanism for hibernation (the only difficulty seems to be the handling of devices used for saving the image). I think kexec based hibernation is the only currently available possible method to write out image without freezer (after driver works are done). If other process is running, how to prevent them from writing to disk without freezing them in current implementation? * We need the drivers to quiesce devices to be able to do the kexec jump in the first place (and to avoid races, we'll need them to block applications' access to devices just like for STR, which is the sufficient condition for removing the freezer). So, I don't really think that the freezer removal argument is valid here. Moreover, if this had been the _only_ argument for the $subject functionality, I'd have been against it. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
Rafael J. Wysocki [EMAIL PROTECTED] writes: Well, what can I say? I haven't been a big fan of doing hibernation this way since the very beginning and I still have the same reservations. Namely, my opinion is that the hibernation-related problems we have are not just solvable this way. For one example, in order to stop using the freezer for suspend/hibernation we first need to revamp the suspending/resuming of devices (uder way) and the kexec-based approach doesn't help us here. I wouldn't like to start another discussion about it though. Agreed. At best all this does is moving the policy on how to save the kernel image from the kernel itself out to user space, and it not a cure all. That said, I can imagine some applications of the $subject functionality not directly related to hibernation. For example, one can use it for kernel debgging (jump to a new kernel, change something in the old kernel's data, jump back and see what happens etc.). Also, in principle it may be used for such things as live migration of VMs. Also such things as calling BIOS services or EFI services on x86_64. Where vm86 is not useful. So in principle I think a kexec with return is a logical extension to the current kexec functionality. That said it looks like next month before I will have time to do a reasonable job of reviewing the current patches. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
Huang, Ying [EMAIL PROTECTED] writes: Yes. The entry point should be saved in dump.elf itself, this can be done via a user-space tool such as makedumpfile. Because makedumpfile is also used to exclude free pages from disk image, it needs a communication method between two kernels (to get backup pages map or something like that from kernel A). We have talked about this before. - Your opinion is to communicate via the purgatory. (But I don't know how to communicate between kernel A and purgatory). How about the return address on the stack? - Eric's opinion is to communicate between the user space in kernel A and user space in kernel B. Purgatory is for all intents and purposes user space. Because the return address falls on the trampoline page we won't know it's address before we call kexec. But a return address and a stack on that page should be a perfectly good way to communicate. - My opinion is to communicate between two kernel directly. I think as a minimal infrastructure patch, we can communicate minimal information between user space of two kernels. When we have consensus on this topic, we can use makedumpfile for both excluding free pages and saving the entry point. Now, we can save the entry point in a separate file or I can write a simple tool to do this. We need a fixed protocol so we do not make assumptions about how things will be implemented, allowing kernels to diverge and kinds of other good things. For communicating extra information from the kernel being shut down we have elf notes. Direct kernel to kernel communication is forbidden. We must have a well defined protocol. Allowing the implementations to change at their different speeds, and still work together. May be we can have a separate load flag (--load-resume-image) to mark that we are resuming an hibernated image and kexec does not have to prepare commandline, does not have to prepare zero page/setup page etc. There is already similar flag in original kexec-tools implementation: --args-none. If it is specified, kexec-tools does not prepare command line and zero page/setup page etc. I think we can just re-use this flag. And If it is desired an alias is good for me too. My gut feel is we look at the image and detect what kind it is, and simply not enable image processing after we have read the note that says it is a resumable core or whatever. I have thought through it again and try to put together some of the new kexec options we can introduce to make the whole thing work. I am considering a simple case where a user boots the kernel A and then launches kernel B using kexec --load-preseve-context. Now a user might save the hibernated image or might want to come back to A. - kexec -l kernel-image Normal kexec functionality. Boot a new kernel, without preserving existing kernel's context. - kexec --load-preserve-context kernel-image Boot a new kernel while preserving existing kernel's context. Will be used for booting kernel B for the first time. - kexec --load-resume-image resumable-core In original kexec-tools, this can be done through: kexec -l --args-none resumable-core Do you need to define an alias for it? Make common cases fast to use. The UI equivalent of make the common case fast. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm] kexec jump -v9
On Tue, 2008-03-11 at 23:18 +0100, Rafael J. Wysocki wrote: On Tuesday, 11 of March 2008, Vivek Goyal wrote: [...] Rafael/Pavel, does the approach of doing hibernation using a separate kernel holds promise? Well, what can I say? I haven't been a big fan of doing hibernation this way since the very beginning and I still have the same reservations. Namely, my opinion is that the hibernation-related problems we have are not just solvable this way. For one example, in order to stop using the freezer for suspend/hibernation we first need to revamp the suspending/resuming of devices (uder way) and the kexec-based approach doesn't help us here. I wouldn't like to start another discussion about it though. Yes. We need to work on device drivers for all hibernation implementations. And kexec-based hibernation provides a possible method to avoid freezer after driver works done. Best Regards, Huang Ying ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec